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FOREWORD 

Between  the  idea 

And  the  reality... 

Falls  the  Shadow 

The  Hollow  Men,  T.S.  Elliot 

To  paraphrase  Elliot,  between  the  lU  research  and  the  applications,  falls 
the  shadow.  In  recognition  of  this,  the  theme  for  the  1992  DARPA  Image 
Understanding  Workshop  is  ’’Image  Understanding  Applications.”  We  are 
back  to  the  theme  stressed  by  Bob  Simpson  in  the  1985  DARPA  Workshop 
Proceedings  when  he  warned,  ”...  We  need  to  identify  real  defense  applica¬ 
tions  that  have  been  made  possible  by  the  basic  lU  research.  Otherwise  you 
will  have  difficulty  justifying  the  continuation  of  lU.”  Indeed,  there  is  even 
more  need  today  to  show  how  the  lU  research  of  the  past  decade  or  more 
can  be  made  to  pay  off  in  applications. 

In  response  to  this  need,  we  have  scheduled  a  Current  Applications  ses¬ 
sion,  with  presentations  on  the  Unmanned  Ground  Vehicle  (UGV),  RADIUS, 
the  Image  Understanding  Environment  (lUE),  and  the  Cartographic  Mod¬ 
eling  Environment  Enhanced  (CMEE).  The  Future  Applications  session  has 
presentations  on  SAR,  Automatic  Target  Recognition,  the  Rapid  Geographic 
Database,  and  emerging  applications  in  Bomb  Damage  Assessment,  Image 
Dissemination,  and  Flexible  Template  Generation. 

The  first  robotic  vehicle  in  the  history  of  the  United  States  will  be  fielded 
as  a  result  of  the  Joint  Unmanned  Ground  Vehicles  program.  This  pro¬ 
gram  has  three  major  thrusts:  1)  building  and  early  user  testing  of  a  Sur¬ 
rogate  Teleoperated  Vehicle;  2)  engineering  manufacturing  and  development 
of  a  Tactical  Unmanned  Ground  Vehicle;  and  3)  Unmanned  Ground  Vehicle 
technology  enhancement  and  exploitation.  The  Technology  Enhancement 
Program  is  a  demonstration-driven  effort  that  includes  Demo-I  and  Demo- 
II.  Demo-II,  directed  by  LTC  Erik  Mettala,  builds  on  the  work  of  the  Au¬ 
tonomous  Land  Vehicle  and  research  on  the  NAVLAB  at  Carnegie  Mellon 
University.  Advances  in  image  understanding,  neural  networks,  and  comput¬ 
ers  now  make  it  possible  to  use  a  Humvee  military  vehicle  as  the  research 
platform,  enabling  research  in  the  realm  of  high-performance  autonomous 
navigation.  The  Demo-II  program  will  transition  the  results  of  research  de¬ 
veloped  on  Humvee’s  at  CMU,  University  of  Massachusetts,  University  of 


Michigan,  Jet  Propulsion  Laboratory  and  Martin  Marietta  Corporation  onto 
the  Surrogate  vehicle  -  building  four  ‘Surrogate  Semiautonomous  Vehicles’ 
to  be  demonstrated  in  the  fourth  quarter  of  FY  1994.  In  a  recent  CMU 
demonstration,  a  Humvee  operated  autonomously  in  traffic  at  63  MPH  over 
a  4  mile  stretch  of  freeway,  and  ran  for  22  miles  at  more  legal  speeds. 

My  predecessor,  Rand  Waltzman,  devoted  much  time  and  effort  to  RA¬ 
DIUS,  Research  and  Development  for  Image  Understanding  Systems,  a  sys¬ 
tem  for  the  image  analyst  that  focuses  on  operational  intelligence  applica¬ 
tions.  The  objective  of  the  study  begun  in  June  1991,  is  to  provide  lU 
technology  through  research  and  demonstrations  within  5  years,  that  can 
be  deployed  operationally  within  10  years.  RADIUS  is  to  improve  imagery 
analyst  (lA)  productivity  and  the  timeliness  and  quality  of  exploitation  prod¬ 
ucts.  The  CMEE,  based  on  the  SRI  Cartographic  Modeling  Environment, 
will  provide  a  baseline  environment  for  evaluation  by  the  RADIUS  contrac¬ 
tor. 

The  Image  Understanding  Environment  (lUE)  was  another  of  Rand’s 
projects.  The  lUE  will  facilitiate  the  exchange  of  research  results  within 
the  lU  community,  and  provide  a  platform  for  demonstrations  and  tools  for 
DARPA  applications.  We  expect  the  lUE  to  serve  as  a  conceptual  stan¬ 
dard  for  lU  data  models  and  algorithms.  The  lUE  will  facilitate  transfer 
of  research  results  to  the  applications  development  community  by  making 
available  the  latest  experimental  programs  in  a  standardized  format  and  a 
standardized  environment.  The  design  specs  for  the  lUE,  a  400  page  docu¬ 
ment,  are  nearing  completion. 

With  all  of  this  emphasis  on  applications,  I  don’t  want  to  overlook  the 
crucial  lU  science  base  of  technical  papers.  There  are  72  technical  papers, 
divided  into  the  following  categories:  (a)  physics-based  and  low-level  vision 
(12%);  (b)  stereo  (7%);  (c)  motion  (17%);  (d)  shape  recovery  and  analy¬ 
sis  (13%);  (e)  object  recognition  and  scene  analysis  (17%);  (f)  aerial  photo 
matching  and  interpretation  (13%);  (g)  active  vision  and  visual  strategies 
(5%);  robotics  applications  (12%);  and  (i)  lU  software  and  parallel  process¬ 
ing  (4%). 

Finally,  I  have  to  mention  funding.  In  1990  there  were  about  80  responses 
to  BAA  90-15  dealing  with  image  research,  of  which  at  least  20  deserved 
funding.  Unfortunately,  there  was  money  available  for  only  half  of  these.  This 
points  out  a  basic  problem  in  the  field:  major  university  research  centers  are 
turning  out  Ph.D.s  who,  after  a  few  years  of  post-doc  work,  start  lU  centers 
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at  other  universities.  Ten  years  ago  there  were  about  half  a  dozen  lU  research 
centers,  and  now  there  are  at  least  double  that  number,  with  new  ones  being 
formed  every  year.  In  addition,  the  cost  of  such  centers  continues  to  increase 
as  more  powerful  workstations  and  computers  are  required  to  carry  out  the 
work.  I  offer  no  solution  to  the  funding  problem  (of  course,  applications  of 
lU  that  “knocked  people’s  socks  off”  would  help),  I  merely  point  out  that 
there  are  a  lot  of  empty  ricebowls  out  there  and  it  will  take  a  wide  breadth 
of  applications  to  fill  them. 

Oscar  Firschein,  DARPA  SISTO 
Program  Manager 
Image  Understanding 
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SECTION  I 

Principal  Investigator  Reports 


use  IMAGE  UNDERSTANDING  RESEARCH:  1990-1991 


R.  Nevatia,  K.  Price  and  G.  Medioni* 
Institute  for  Robotics  and  Intelligent  Systems 
University  of  Southern  California 
Los  Angeles,  California  90089-0273 


Abstract 

This  paper  summarizes  the  USC  Image  Under¬ 
standing  research  projects  and  provides  refer¬ 
ences  to  more  detailed  sources  of  information. 

Our  work  has  focussed  on  the  topics  of  3-D  vi¬ 
sion  (including  range  data  processing,  stereo, 
shape  from  contour  and  object  recognition), 
aerial  image  analysis,  motion  analysis  (includ¬ 
ing  spatio-temporal  analysis,  3-D  motion  esti¬ 
mation,  detection  of  moving  objects  and  an  in¬ 
tegrated  motion  system),  and  parallel  process¬ 
ing  (including  mapping  algorithms  onto  spe¬ 
cific  or  flexible  architectures,  and  processor¬ 
time  tradeoffs). 

1  INTRODUCTION 

This  paper  summarizes  our  research  projects  since  the 
last  image  understanding  workshop.  Some  of  this  work  is 
described  in  more  detail  in  other  papers  in  these  proceed¬ 
ings  [Chung  and  Nevatia,  1992;  Stein  and  Medioni,  1992; 
Reinhart  and  Nevatia,  1992;  Ulupinar  and  Nevatia,  1992; 
Rom  and  Medioni,  1992;  Kim  and  Price,  1992;  Frauizen, 
1992;  Khokhar  and  Prasanna,  1992];  this  work  is  covered 
only  briefly  in  this  summary.  We  also  provide  references 
to  details  for  work  not  described  elsewhere  in  these  pro¬ 
ceedings.  We  first  give  a  brief  overview  of  these  separate 
research  areas. 

1.1  Descriptions  from  Range  Data 

We  are  continuing  our  work  in  the  acquisition  of  models 
from  multiple  unregistered  views. 

•  In  one  approach,  the  raw  images  are  rough- 
registered  using  the  TOSS  system  [Stein  and 
Medioni,  1991],  then  accurately  overlapped  by  the 
method  described  in  [Chen  and  Medioni,  1991].  It  is 
then  possible  to  view  the  integrated  data  from  any 
viewpoint. 


'This  research  was  supported  by  the  Defense  Advanced 
Research  Projects  Agency  under  contract  F49620-90-C-007S, 
monitored  by  the  Air  Force  Office  of  Scientific  Research.  The 
United  States  Government  is  authorized  to  reproduce  and 
distribute  reprints  for  governmental  purposes  notwithstand¬ 
ing  any  copyright  notation  hereon. 


•  In  an  other  approach,  we  instead  segment  each  in¬ 
dividual  view  into  simple  patches,  and  describe  the 
data  by  a  graph  whose  nodes  are  the  patches  and 
whose  links  are  adjacency  relations  between  patches 
[Frm  ei  ai,  1989].  The  graphs  corresponding  to  mul¬ 
tiple  views  are  matched  using  a  constraint  satisfac¬ 
tion  network  [Parvin  and  Medioni,  1991b].  The  in¬ 
tegration  proceeds  in  two  steps:  first,  we  generate 
a  composite  graph,  then  intersect  surfaces  to  com¬ 
pute  edges  and  vertices  which  are  part  of  the  B-rep 
representation. 

Finally,  given  3D  data  points,  we  are  investigating  the 
approximation  of  the  underlying  surface  by  a  global  de¬ 
formable  model.  This  is  achieved  by  minimizing  an  en¬ 
ergy  which  expresses  the  smoothness  of  the  approxima¬ 
tion  and  the  closeness  to  the  original  data  points.  Our 
formalism  is  an  extension  to  surfaces  in  3D  of  the  B- 
snakes  [Menet  ei  ai,  1990]  for  curves. 

1.2  Shape  Inference  and  Description  from 
Images 

•  Perceptual  Grouping:  Most  high  level  vision  algo¬ 
rithms,  such  as  shape  from  contour  [Ulupinar  and 
Nevatia,  1988]  or  line  drawing  interpretation  require 
perfect  data  as  input,  but  it  is  impossible  to  gener¬ 
ate  such  features  with  low  level  algorithms  such  as 
edge  detectors.  Here,  we  try  to  bridge  this  gap  by 
transforming  an  edge  image  into  a  saliency  map. 
We  present  a  non-iterative  method  based  on  a  field 
associated  with  eiu:h  edge.  This  field  encodes  the 
notions  of  simplicity,  curvature  constancy  and  co- 
curvilinearity.  The  results  on  commonly  used  test 
images  correspond  to  “natural”  groupings. 

•  Description  of  Planar  Shapes:  We  study  the  prob¬ 
lem  of  planar  shaped  description,  and  suggest  a 
method  to  produce  an  axial  representation  based 
on  the  hierarchical  decomposition  of  the  objects  into 
parts.  Such  a  representation  is  robust  with  respect 
to  scale,  noise  and  occlusion. 

•  3-D  Shape  Inference  from  Stereo:  We  have  been  in¬ 
terested  in  descriptions  from  stereo  images  for  a  long 
time.  Our  current  focus  is  on  development  of  a  hi¬ 
erarchical  stereo  system  where  features  are  matched 
at  multiple  levels  of  abstraction.  Another  aspect  of 
this  work  is  in  computing  object  descriptions  from 


3 


tlie  stereo  matches,  which  tend  to  be  sparse.  We  ex¬ 
pect  our  methods  to  work  for  objects  found  indoors 
as  well  as  man-made  objects  outdoors. 

•  3-D  Shape  from  Contour:  In  this  project,  we  are  de¬ 
veloping  techniques  for  inferring  3-d  shape  descrip¬ 
tions  given  only  object  contours.  We  have  devel¬ 
oped  a  theory  that  can  infer  the  shape  of  a  class  of 
objects,  namely  zero-Gaussian  curvature  surfaces, 
straight  homogeneous  generalized  cylinders  and  pla¬ 
nar,  right,  constant  cross-section  generalized  cylin¬ 
ders.  One  of  the  recent  advances  here  is  extension 
of  our  techniques  to  infer  shape  of  objects  made  of 
multiple  curved  surfaces.  We  have  also  started  to 
develop  techniques  to  make  our  theory  work  with 
real  images  where  contours  are  likely  to  be  frag¬ 
mented  and  distracting  contours  such  as  markings 
and  shadows  present.  Our  early  results  are  encour¬ 
aging  and  described  briefly  later  in  this  paper. 

1.3  Object  Recognition 

•  Complexity  Analysis  of  the  TOSS  system:  We  have 
defined  a  methodology  based  on  efficient  coding  and 
hash  tables  to  recognize  3D  objects  given  3D  data, 
even  when  the  number  of  models  is  large  [Stein 
and  Medioni,  1991].  We  have  performed  a  detailed 
complexity  analysis  of  the  method,  which  results  in 
0(n)  <  Orecognition  <  O(nm^),  where  n  is  the  num¬ 
ber  of  matching  primitives  and  m  is  the  number  of 
models  in  the  database.  The  worst  case  occurs  when 
the  models  hypothesized  to  be  in  the  scene  are  very 
similar. 

•  Recognition  of  3D  objects  in  images:  The  more  in¬ 
teresting  problem  is  to  recognize  3D  objects  from 
grey  level  images.  The  previous  methodology  be¬ 
comes  very  inefficient,  as  the  number  of  generated 
hypotheses  increases  drastically.  We  propose  in¬ 
stead  to  generate  high  level  groupings  and  to  use 
these  as  matching  primitives.  The  groupings  we  are 
using  are  based  on  parallel  and  skew  symmetry,  U- 
shapes  and  closures.  Furthermore,  we  show  how  to 
compute  these  groupings  efficiently  *'  om  segments, 
and  how  we  keep  the  number  of  groupings  small. 
We  have  obtained  encouraging  initial  results  on  real 
images. 

•  The  “Drop  -off”  problem:  As  an  application  of 
our  matching  methodology,  we  study  the  “drop-off” 
problem,  in  which  an  observer  is  given  a  topographic 
map,  and  is  dropped  off  at  an  unknown  location. 
We  select  as  matching  feature  the  panoramic  hori¬ 
zon  curve  (corresponding  to  the  sky-ground  bound¬ 
ary  from  a  given  viewpoint).  The  polygonal  ap¬ 
proximation  of  this  curve  is  compared  with  precom¬ 
puted  ones  using  our  hash  beised  scheme  [Stein  and 
Medioni,  1990).  We  have  obtained  accurate  results 
from  real  data. 

1.4  Motion  Analysis 

We  have  a  number  of  projects  in  the  area  of  motion  anal¬ 
ysis,  with  autonomous  navigation  providing  the  context 


for  most  of  the  work,  though  these  techniques  have  a 
much  broader  utility. 

•  Spatio-temporal  Analysis:  Early  work  in  spatio- 
temporal  analysis  aissumed  restricted  paths  for  the 
moving  camera.  We  developed  a  slice-based  anal¬ 
ysis  technique  that  computes  dense  optical  flow 
estimates  for  arbitrary  observer  motion  in  closely 
spaced  data[Peng,  1991]. 

•  Motion  Estimation:  We  have  continued  our  study 
of  multi-frame  motion  estimation  techniques  with 
the  development  of  a  system  to  find  the  three- 
dimensional  motion  and  structure  estimates  for 
a  class  of  motion,  called  chronogeneous  motion, 
that  includes  the  standard  constant  motions  and 
some  accelerated  motions  [Franzen,  1991b;  Franzen, 
1992]. 

•  Integrated  system  for  Motion:  In  order  to  use  our 
motion  estimation  system  more  fully,  we  have  devel¬ 
oped  an  integrated  system  that  includes  hierarchi¬ 
cal  feature  extraction  and  matching  and  feedback  of 
the  3-D  motion  estimation  to  the  feature  matching 
process.  This  enables  the  system  to  tolerate  errors 
and  differences  in  the  feature  extraction  and  match¬ 
ing  process  by  removing  these  inconsistent  feature 
points  from  the  later  analysis  [Kim  and  Price,  1992]. 

•  Mobile  Platform:  We  use  the  domain  of  autonomous 
navigation  to  unify  our  motion  work.  To  this  end 
we  have  a  small  project  in  vision  based  navigation 
with  a  trinocular  stereo  system  for  reliable  3-D  de¬ 
scriptions  of  the  environment. 

1.5  Aerial  Image  Analysis 

Our  work  in  aerial  image  analysis  consists  of  two  major 
components.  First  is  the  transfer  of  technology  funded 
by  DARPA  to  the  RADIUS  program.  Specifically,  we  are 
focussing  on  the  transfer  of  our  techniques  for  building 
detection  in  aerial  images  [Huertas  and  Nevatia,  1988a; 
Mohan  and  Nevatia,  1989bj.  Second  project  is  the  con¬ 
tinuation  of  our  long  range  effort  of  analyzing  complex 
cultural  domains.  We  have  chosen  large  commercial  air¬ 
ports  as  a  test  domain.  In  previous  work  [Huertas  ei  ai, 
1990a;  Huertas  et  ai,  1989a],  we  have  shown  good  re¬ 
sults  on  the  detection  of  runways  and  taxiways.  In  this 
report,  we  describe  our  recent  work  on  aircraft  detection. 

1.6  Parallel  Processing 

We  are  investigating  parallel  implementations  of  various 
vision  algorithms  developed  in  our  group  and  elsewhere. 
We  have  studied  algorithms  for  stereo  and  image  match¬ 
ing,  graph  algorithms,  sorting  on  reconfiguration  mesh 
and  VLSI  architectures  for  image  transforms.  In  another 
project,  we  have  studied  parallel  implementations  of  var¬ 
ious  symbolic  algorithms.  Here,  we  are  interested  not 
only  in  the  efficiency  of  hardware  utilization  but  also  of 
the  programmer  efficiency  and  software  maintainability. 
We  find  that  these  goals  can  be  achieved  simultaneously 
by  choosing  an  appropriate  architecture.  Some  of  these 
investigations  are  described  later  in  this  paper. 
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2  DESCRIPTIONS  FROM  RANGE 
IMAGES 

Range  imagery  differs  from  intensity  imagery  in  that  the 
input  directly  relates  to  the  geometric  shape  of  the  ob¬ 
jects  in  the  scene.  The  issues  to  be  addressed  are  the 
familiar  ones:  boundary  detection,  segmentation,  repre¬ 
sentation,  recognition,  and  pose  estimation.  How  can 
we  represent  models  to  perform  matching?  In  our  previ¬ 
ous  work,  we  have  used  multiple  views  [Fan  et  ai,  1989; 
Stein  and  Medioni,  1991;  Parvin  and  Medioni,  1991b], 
but  it  is  sometimes  necessary  to  also  generate  an  in¬ 
tegrated  model  from  multiple  unregistered  view.  The 
model  building  procedure  to  perform  the  integration  at 
the  data  level  described  in  [Chen  and  Medioni,  1991]  is 
non  linear  and  therefore  requires  an  initial  guess.  We 
have  used  the  TOSS  system  [Stein  and  Medioni,  1991] 
to  provide  such  a  guess.  The  current  limitation  is  that 
the  function  describing  the  surface  of  the  object  must  be 
single  valued  in  some  coordinate  system.  We  are  working 
on  the  segmentation  of  the  object  into  parts  to  overcome 
this  limitation. 

2.1  Symbolic  Level  Merging  into  a  B-Rep 

The  alternative  approach  consists  of  generating  a  sym¬ 
bolic  description,  such  as  an  attributed  graph,  for  each 
view,  and  then  merge  the  different  descriptions  at  this 
high  level.  Each  view  is  represented  by  a  graph  whose 
nodes  are  the  individual  surface  patches  and  the  links 
are  the  relationships  between  adjacent  patches.  These 
patches  are  inferred  from  the  bounding  contours,  de¬ 
tected  using  local  operations.  We  overcome  the  limi¬ 
tations  inherent  in  this  local  scheme  by  forcing  the  junc¬ 
tions  to  correspond  to  possible  objects.  The  implementa¬ 
tion  is  performed  using  a  dynamic  network,  as  explained 
in  [Parvin  and  Medioni,  1991a].  The  rigid  transforma¬ 
tion  between  any  two  views  is  obtained  by  matching 
the  graphs  describing  the  views.  In  our  implementation 
[Parvin  and  Medioni,  1991b],  this  is  achieved  through  a 
two  level  constraint  satisfaction  network. 

The  issue  we  now  address  is  the  integration  of  multiple 
views  into  a  single  volumetric  description.  In  Geometric 
Modeling  [Requicha,  1980],  we  construct  an  object  based 
on  geometric  operations  such  as  union  and  intersection  of 
primitive  features,  i.e.  cone,  cylinder,  etc.  This  strategy 
also  can  be  extended  to  multiple  view  integration  for  3D 
surfaces,  that  is,  as  more  area  of  a  given  surface  becomes 
visible  from  other  views,  its  bounding  contours  are  up¬ 
dated  to  accommodate  for  the  additional  information. 
This  approach,  however,  requires  precise  knowledge  of 
errors  in  segmentation  and  transformation.  The  errors 
do  not  place  surfaces  at  their  exact  location,  when  they 
are  transformed  to  the  reference  coordinate  system.  As  a 
result,  the  merging  can  become  quite  complicated.  This 
requires  that  several  tolerances  be  specified  for  manag¬ 
ing  various  types  of  errors.  Our  approach  avoids  the  low 
level  updating  strategy,  and  concentrates  first  on  build¬ 
ing  a  composite  attributed  graph  of  the  object.  Once 
the  composite  graph  is  constructed,  adjacent  surfaces 
are  intersected  and  the  location  of  edges  and  vertices  is 
computed. 


To  begin  with,  the  composite  graph  is  initialized  to  be 
the  attributed  graph  of  a  given  view.  As  each  additional 
view  reveals  more  surfaces,  they  are  added  to  the  com¬ 
posite  graph.  Besides  the  addition  of  new  links  in  the 
composite  graph,  the  union  operation  updates  those  at¬ 
tributes  that  are  partially  more  occluded  under  one  view 
than  the  other. 

The  next  issue  is  surface  intersection.  For  polyhedral 
scenes,  it  is  rather  straightforward,  so  we  only  explain 
the  process  for  curved  surfaces.  For  details,  we  refer  the 
reader  to  the  technical  report  [Parvin,  1991]. 

In  each  view,  a  curved  surface  patch  is  described  by  a 
quadratic  equation,  and  the  surface  patch  is  bounded  by 
the  corresponding  limb  and/or  orientation  discontinuity 
segments  from  that  view.  We  adopt  the  explicit  form 
F(X,  Y,  Z)  =  0,  which  is  given  in  matrix  form  as  follows: 
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The  coefficients  of  this  matrix  are  approximated  by 
the  conventional  least  square  technique,  forming  the  10- 
by-10  scatter  matrix  and  searching  for  the  eigenvector 
corresponding  to  the  minimum  eigenvalue.  This  scatter 
matrix  is  constantly  updated  as  data  points  from  other 
views  become  visible.  These  data  points  are  first  trans¬ 
formed  into  the  reference  viewpoint  where  their  contri¬ 
bution  are  added  to  the  scatter  matrix.  Once  the  scatter 
matrix  contains  the  contribution  from  all  viewpoints,  the 
coefficients  above  are  estimated  for  surface  description. 

In  addition  to  updating  the  scatter  matrix,  the  cor¬ 
responding  bounding  surface  orientation  boundaries  are 
transformed  into  the  current  viewpoint.  Once  the  com¬ 
posite  surface  is  completed,  adjacent  surfaces  are  inter¬ 
sected  so  that  the  edge  information  of  the  B-rep  is  com¬ 
puted.  In  general,  surface  intersection  between  quadratic 
surfaces  is  a  difficult  problem.  However,  additional  infor¬ 
mation  maintained  in  the  attribute  graph  simplifies  this 
task.  This  includes  the  list  of  boundary  points  along  the 
adjacent  surfaces.  Of  course,  this  information  is  not  ac¬ 
curate  due  to  segmentation  errors,  transformation  errors 
and  quantization  errors.  As  a  result,  the  surfaces  are 
not  placed  at  the  exact  boundary  location.  We  propose 
to  use  the  location  of  the  crease  segments  as  the  initial 
estimate  of  the  surface  intersection.  Two  cases  exist  for 
tracing  of  surface  intersection.  In  both  cases,  the  crease 
segment  dividing  the  adjacent  surfaces  can  define  an  aux¬ 
iliary  plane  normal  to  the  crease  boundary.  The  first  case 
involves  intersection  of  a  quadratic  surface  and  a  plane. 
In  this  case,  the  solution  to  the  intersecting  curve  can 
be  computed  in  closed  form,  since  the  nonlinear  system 
of  equation  contains  two  linear  equations.  The  second 
case  involves  intersection  of  two  quadratic  surfaces.  In 
this  case  the  nonlinear  system  of  equation  is  solved  by 
Lavenberg-Marquardt  algorithm,  which  is  supported  by 
the  IMSL  petckage.  The  basic  tracing  algorithm  is  as  fol¬ 
lows: 

Assume  that  adjacent  surfaces  are  given  by  F(  A',  V,  Z)  = 
0  and  G(X,  Y,Z)  =  0  with  a  point  Pi  on  their  intersec¬ 
tion.  The  following  procedure  will  trace  the  intersecting 


5 


boundary; 

1.  compute  vector  d  =  x  xyG  at  point  Pi, 

2.  estimate  Pi+i  =  Pi  +  ad  for  some  small  step  size  a, 

3.  estimate  the  auxiliary  plane,  U,  normal  to  the  line 
PiPi+i  passing  through  point  fi+i, 

4.  refine  the  estimate  of  Pi+i  by  solving  a  nonlinear 
system  of  equation  given  by  F,  G  and  U  and  their 
Jacobian. 

The  performance  of  our  system  for  B-rep  generation  is 
illustrated  on  an  example  with  curved  surfaces,  as  shown 
in  figure  1 .  Here,  6  views  of  the  object  are  used  to  con¬ 
struct  the  composite  graph  of  the  object.  These  views 
are  matched,  the  inter  view  transformation  is  computed, 
and  the  necessary  attributes  are  transformed  from  one 
view  into  the  next  one.  The  segmented  results  for  each 
view  of  the  object,  along  with  the  resulting  B-rep  are 
shown  in  figure  2.  Note  that  the  union  of  the  object  has 
a  gap  in  the  creased  curved  segment  that  is  completed 
during  surface  intersection. 


(a)  (b) 

Figure  1:  Intensity  and  shaded  image 


2.2  Surface  Approximation  by  Deformable 
Patches 

Overview  We  present  an  implementation  of  de¬ 
formable  models  to  approximate  3-D  surface.  It  is  an 
extension  of  our  previous  work  on  ”B-snakes”  [Menet 
ei  al.,  1990],  which  approximate  curves  using  B-splines. 
The  user  provides  an  initial  simple  surface,  such  as  a 
cylinder  or  a  sphere,  which  is  subject  to  internal  forces 
(describing  implicit  continuity  properties  such  as  ten¬ 
sion  and  bending)  and  external  forces  which  attracts  it 
toward  features.  The  problem  is  cast  in  terms  of  energy 
minimization,  and  is  solved  iteratively,  our  choice  of  ba¬ 
sis  functions  leads  to  reasonable  complexity  and  good 
numerical  stability.  We  show  results  on  real  range  im¬ 
ages  to  illustrate  the  applicability  of  our  approach.  The 
advantages  of  this  approach  are  that  it  provides  a  com¬ 
pact  representation  of  the  approximated  data,  it  gives  a 
Cl  continuous  (for  quadratic  spline)  analytical  descrip¬ 
tion  of  the  data,  which  allows  computation  of  differential 
properties,  and  tends  itself  to  application  such  as  non 
rigid  motion  tracking  and  object  recognition. 


(a)  (b)  (c) 


(d)  (e)  (f) 


(g)  (h)  (i) 

Figure  2:  B-rep  from  multiple  views  of  the  object:  (a-f) 
view  1  to  6;  (g)  union  of  crease  segments;  (h)  corrected 
crease  segments  after  surface  intersection;  (i)  B-rep, 


Surface  Fitting  The  idea  of  fitting  data  by  a  de¬ 
formable  model  can  be  found  in  the  work  of  Kass  ei 
a/.[Kass  et  al.,  1987]  in  2D.  Such  models  were  general¬ 
ized  in  3D  by  the  same  authors  [Terzopoulos  et  al.,  1987; 
Terzopoulos,  1988]  for  a  surface  of  revolution.  More 
recently,  Cohen  [Cohen  e<  al.,  1991],  Terzopoulos  [Ter¬ 
zopoulos  and  Metetxas,  1991]  and  Pentland[Pentland  and 
Sclaroff,  1991]  have  formulated  new  methods  for  3D 
data.  Our  approach  is  closely  related  to  [Cohen  ei  ai, 
1991],  except  that  we  use  finite  differences  as  opposed  to 
a  finite  element  method.  The  formalism  we  are  about  to 
establish  amounts  to  deforming  the  initial  surface  to  con¬ 
form  as  closely  as  possible  to  the  given  3D  data  points. 
This  is  achieved  by  defining  an  attraction  force  field 
around  these  points  to  bring  the  initial  surface  closer 
to  them,  and  is  solved  by  introducing  an  energy  dissipa¬ 
tion  functional  to  dissipate  the  kinetic  energy  during  the 
motion.  In  our  case,  since  the  entire  surface  is  defined  in 
terms  of  control  points,  the  positions  of  all  control  points 
get  updated  at  each  iteration.  We  are  given  an  initial 
surface  (such  as  a  cylinder),  defined  with  MJV  control 
vertices.  Any  point  on  the  surface  is  defined  by  the  patch 
(i,j)  to  which  it  belongs  (0  <  i  <  M,0  <  j  <  JV)  and 
its  coordinate  (u,  u)  on  the  patch  (0  <  «,  u  <  1).  There¬ 
fore  we  write  the  position  (14,1^^,  of  each  point  as 
V(i,j,u,v).  The  total  energy  of  the  surface  is  the  sum 


of  the  energies  at  each  point. 

The  internal  energy  expresses  the  smoothness  of  the  first 
and  second  derivatives  of  V,  and  can  be  written  as 


Eint  -  l^o,[|  I  +  I  I]  +  I  lauSv  '  II 


The  externzd  energy  is  a  potential  energy  which  attracts 
the  surface  toward  the  data  points.  In  our  implemen¬ 
tation,  we  precompute  in  a  digitized  volume  (typically 
64  X  64  X  64)  the  distance  of  each  voxel  to  the  closest 
data  point.  The  distance  from  a  point  on  the  approxi¬ 
mating  surface  is  simply  that  of  the  voxel  the  point  falls 
in.  The  energy  for  each  patch  is  computed  At  K  x  K 
equally  sampled  points  on  the  surface,  therefore  Etxt  is 


E...  «  X  W„.  X  £  ‘f  G(F(^,  |)) 


jb,=0  kt=0 


and  area(i,j)  is  approximated  by  the  sum  of  eight  tri¬ 
angles. 

For  a  quadratic  B-spline  surface,  a  control  point 
C(i,j)  only  affects  9  patches  P(k,l){i  —  I  <  k  <  i  + 
l,j  —  I  <  I  <  j  +  1).  So,  to  minimize  the  energy  with 
respect  to  the  position  of  each  control  point,  we  only 
consider  these  patches.  For  the  X  component,  we  have 
the  following  equation: 

dEtotal  _  ^  ^  dE(k,  1)  _  - 

For  a  quetdratic  spline,  this  expression  is  a  function  of  25 
control  pointsC(ib,/),»— 2  <  k  <  i-f2,j— 2  <  /  <  j+2,so 
we  end  up,  for  each  control  point  (?(i,  j),with  an  equation 
of  the  type 

i+2  >+2 

X;  E  CHk  X(h,k)+D(i,j)=o 

h=i-2k=j-2 

This  can  be  expressed  in  matrix  form  as  AX  -H  D  =  0, 
where  j4  is  a  MN  by  MN  matrix,  X  and  D  are  MN  by 
1  matrices.  This  system  of  equations  is  solved  iteratively 

Xt+i  =  {A-¥  7/)"^  •  (yX,  -  D{x,  y,  ?)), 
where  7  is  the  Euler  step  size. 


Results  We  present  an  example  with  real  data  to 
demonstrate  our  approach.  The  data  is  a  360  ”  laser 
range  finder  image  of  a  human  head  scanned  by  Vi¬ 
sual  Computing  Group  (courtesy  of  Vision  and  Model¬ 
ing  group,  the  Media  Laboratory,  MIT).  There  are  about 
2500  points,  as  shown  in  figure  3(a).  The  original  sur¬ 
face  is  defined  by  22  x  22  control  points  equally  spaced 
on  a  cylinder,  as  shown  in  Figure  3(b).  The  final  po¬ 
sition  of  the  control  points  is  shown  in  figure  3(c)  after 
25  iterations,  and  the  corresponding  surface  is  displayed 
in  Figure  3(d).  The  running  time  for  this  example  is  30 
minutes  on  a  Sparc  2  workstation.  We  used  the  following 
parameters: 

7  =  0.05,  Wa  =  0.001,  =  0.05,  Wext  =  10 


(a)  Original  Range 
Data 


(b)  Starting  surface 


(c)  Final  position  of  (j)  Final  surface 

control  points 

Figure  3:  Fitting  a  B-spline  surface 


Future  This  method  is  a  direct  extension  to  3D  of  our 
work  on  B-snakes.  We  are  interested  in  reducing  the 
computing  time  by  an  order  of  magnitude,  and  believe 
that  this  can  be  achieved  using  an  adaptive  approach. 

3  SHAPE  INFERENCE  AND 

DESCRIPTION  FROM  IMAGES 

3.1  Perceptual  Grouping 

An  area  which  is  likely  to  improve  results  in  computer 
vision  is  the  one  of  perceptual  grouping.  This  area  can 
be  classified  as  a  mid-level  issue  directed  toward  clos¬ 
ing  the  gap  between  what  is  produced  by  state-of-the- 
art  low-level  algorithms  and  what  is  desired  as  input  to 
high  level  algorithms.  Many  researchers  resort  to  using 
synthetic  data  as  their  input  because  of  that  weakness. 
Figure  4(a)  depicts  an  example  of  perceptual  grouping 
easily  experienced  by  the  human  visual  system.  The  cir¬ 
cle  in  the  middle  is  easily  distinguishable  from  its  noisy 
background.  Furthermore,  we  tend  to  fill  the  gaps  and 
accept  the  fragmented  circle  as  a  complete  one.  Fig¬ 
ure  4(b)  shows  a  more  complex  case  with  overlapping 
features.  A  striking  example,  often  used  in  the  psy¬ 
chology  literature,  is  that  of  the  Kanizsa  illusion  (Fig¬ 
ure  4(c))[Kaniz8a,  1976].  Here  we  perceive  edges  which 
have  no  physical  support  whatsoever  in  the  original  sig¬ 
nal.  Lowe  [Lowe,  1987]  discusses  the  Gestalt  notions  of 
co-linearity,  co-curvature  and  simplicity  as  important  in 
perceptual  grouping.  Ullman  [Sha’ashua  and  Ullman, 
1988]  suggests  the  use  of  a  saliency  measure  to  guide  the 
grouping  process,  and  to  eliminate  erroneous  features  in 
the  image.  The  approach  described  here  is  closely  re- 
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(a)  A  perceived  circle  Overlapping 

structures 


(c)  Kanizsa 
square 

Figure  4:  Examples  of  perceptual  groupings 


lated  to  Ullman’s,  in  the  sense  that  a  saliency  map  is 
first  constructed  from  an  edge  image,  and  later  higher- 
level  features  are  referred.  The  proposed  approach  is 
capable  of  “highlighting”  structures  which  are  salient, 
as  well  as  interpolating  gaps  in  a  smooth  manner,  and 
remove  noisy  edgels  in  a  given  image,  all  in  a  unified 
non-iterative  scheme.  The  underlying  goal  is  to  keep 
the  interpretation  as  simple  as  possible  in  the  “Gestalt” 
sense.  This  translates  into  three  major  constrains; 

1.  Co-curvilinearity  In  the  lack  of  other  cues,  con¬ 
tinuation  is  the  only  interpretation,  and  so  is  co- 
curvilinearity. 

2.  Constancy  of  curvature  We  tend  to  extend  a  curve 
of  some  constant  curvature  with  the  same  curvature. 

3.  Favoring  low  curvatures  over  large  ones  Humans 
seem  to  connect  fragmented  line  segments  in  a  way 
that  the  increase  in  total  curvature  is  minimum  (see 
Ullman  [Sha’ashua  and  Ullman,  1988]). 

With  that  in  mind,  we  have  devised  a  technique  that 
implicitly  imposes  the  above  constrains  in  the  form  of 
an  Extension  field  emanating  from  each  edge  segment, 
as  described  next.  The  calculation  complexity  of  the 
saliency  map  is  linear  in  the  number  of  edge  elements 
in  the  image  and  thus  bounded  by  the  size  of  the  given 
image.  Ideally,  a  saliency  map  should  assign  large  values 
of  probability  along  these  illusory  lines,  and  also  specify  a 
direction  of  most  probable  continuation  of  the  segment. 
We  will  show  how  our  scheme  treats  examples  of  such 
nature. 

Extension  Fields  An  Extension  field  is  a  non- 
normalized  probability  directional  field  describing  the 
contribution  of  a  single  edge  element  to  its  environment 
in  term  of  length  and  direction.  Put  other  words,  it 
votes  on  the  preferred  direction  and  the  probability  of 


existence  of  every  point  to  share  a  curve  with  the  origi¬ 
nal  segment. 

The  field  is  of  infinite  extent,  although  in  practice  it 
disappears  at  a  predefined  distance  from  the  edge.  Fig¬ 
ure  5  depicts  such  a  field.  Since  we  favor  large  and  con- 


Figure  5:  The  elementary  extension  field 


stant  curvature,  field  direction  at  a  given  point  in  space 
is  chosen  to  be  tangent  to  the  circle  passing  through  the 
edge  segment  and  that  point,  while  its  strength  is  pro¬ 
portional  to  the  radius  of  that  circle.  Also,  the  strength 
decays  square-exponentially  with  the  distance  from  the 
origin  (the  edge  segment). 


r(x, 

0 


if  |y|  <  |x| 
otherwise 


where  0(x,y)  =  <an-‘(j|^), 
and  r(x,  y)  =  Ae~^ 

and  D  is  the  arc  length  along  the  circle  linking  the  origin 
to  (x,y). 

The  assignment  of  actual  probabilities  to  the  field  is 
performed  as  follows.  We  consider  two  short  edge  seg¬ 
ments,  perpendicular  to  each  other  and  apeirt.  We  assign 
probabilities  to  the  field  in  such  a  way  that  all  paths  con¬ 
necting  these  points  get  roughly  the  same  saliency,  such 
that  any  one  best  path  does  not  exist  between  the  two. 
This  seems  to  be  in  agreement  with  human  perception. 

The  uncertainty  with  regard  to  where  an  isolated  edge 
should  be  extended  (if  at  all)  is  very  high  to  begin  with. 
When  several  co-linear  edge  segments  are  combined,  the 
resulting  field  becomes  more  and  more  directed  and  the 
uncertainty  grows  smaller.  For  instance,  when  two  co- 
linear  line  segment  fields  are  combined,  the  probability 
for  that  longer  segment  to  extend  to  a  very  curved  line 
becomes  smaller,  while  the  weights  in  the  direction  of  the 
lines  grow  larger.  The  same  intuitive  result  is  produced 
when  two  segments  form  an  obtuse  angle.  Furthermore, 
the  gaps  Eire  bridged  in  a  smooth  fashion.  The  whole 
process  can  be  thought  of  as  a  directional  convolution 
with  the  above  field  (mask).  The  resulting  map  is  then  a 
superposition  of  a  collection  of  fields  each  oriented  along 
a  corresponding  short  segment. 

Combining  individual  field  elements  is  of  great  impor¬ 
tance.  Ideally,  we  would  want  an  averaged  majority  vote 
regarding  the  preferred  orientation  of  a  given  position. 
In  practice  we  used  a  pair-wise  vector  addition  of  field 
elements.  The  addition  favors  the  direction  which  makes 
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the  resulting  vector  the  largest.  In  tie  situations  (2  vec¬ 
tors  are  perpendicular)  an  arbitrary  choice  is  made. 

The  whole  scheme  is  resolution  dependent.  This 
means  that  the  actual  ’’primitive”  line  segments  are  ar¬ 
bitrarily  defined  to  have  a  specific  length.  From  this 
point  and  on,  they  are  considered  unit  length.  It  is  thus 
desired  in  general  to  process  an  image  starting  with  a 
coarse  resolution  to  extract  large  features,  and  then  go 
on  to  finer  features.  In  our  current  system,  only  the 
coarse  resolution  is  considered. 

It  is  interesting  to  note  that,  although  the  process  is 
local  in  essence,  a  global  percept  emerges  as  a  result. 

The  addition  of  random  noise  to  an  image  is  expected 
to  create  a  distributed  map  of  votes,  and  thus  not  to  in¬ 
terfere  with  truly  salient  patterns.  When  an  accidental 
formation  of  random  segments  does  give  rise  to  high  val¬ 
ues  in  the  map,  that  formation  is  perceived  as  significant 
to  humans  as  well. 

In  most  cases,  applying  the  convolution  once  is  suffi¬ 
cient  to  achieve  a  meaningful  saliency  map.  However,  in 
situations  where  the  noise  levels  are  particularly  large, 
we  threshold  the  saliency  map  and  recalculate  the  con¬ 
volution  on  that  map. 

Extraction  of  high-level  features  Once  a  saliency 
map  is  acquired,  a  process  which  iteratively  ’removes’ 
salient  group  is  started.  This  process  first  removes  the 
most  salient  group,  recalculates  the  saliency  map,  and 
then  proceeds  to  remove  the  next  most  salient  group. 
The  complexity  of  this  process  is  thus  proportional  to 
the  number  of  features  in  the  image.  It  is  guaranteed  to 
terminate  since  at  each  iteration  the  overall  power  of  the 
field  is  strictly  reduced  by  removing  a  feature. 

We  use  a  directional  roof-top  following  algorithm.  The 
linking  process  starts  at  the  point  of  largest  saliency  and 
advances  in  the  general  direction  dictated  by  the  orien¬ 
tation  of  the  current  position. 

Results  We  have  tested  our  approach  on  the  three  ex¬ 
amples  shown  in  Figure  4.  The  saliency  map  produced 
is  shown  (strength  only)  as  an  intensity  image  (Figure 
6).  The  result  of  following  the  path  of  highest  saliency 
produces  the  expected  groupings. 


(a)  (b)  (c) 


Figure  6:  Intensity  of  saliency  maps 


Conclusions  and  future  directions  The  above 
scheme  was  shown  to  achieve  useful  results  for  some  sub¬ 
set  of  problems  in  perceptual  grouping.  We  intend  to 
incorporate  two  additional  kinds  of  fields.  The  first  will 
enhance  perpendicular  relations  between  lines,  such  as 
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(c)  The  decomposition  of  an  “F4”  shape. 

(d)  The  decomposition  of  an  “FIB”  shape. 


Figure  7;  Decompositions  obtained  from  real  data. 

junctions  and  edges  which  perceptually  terminate  at  a 
vertex.  The  second  will  enhance  parallel  relations  in  the 
image.  When  all  these  fields  are  combined,  we  expect  to 
get  a  much  more  powerful  saliency  measure. 

3.2  Planar  Shape  Description  and 
Decomposition 

Shape  description  is  a  major  problem  in  machine  percep¬ 
tion,  and  is  the  basis  for  recognition.  Many  approaches 
have  been  suggested,  but  none  provide  a  complete  and 
natural  solution.  We  have  developed  a  method  for  pro¬ 
ducing  an  axial  representation  of  a  shape  based  on  a 
hierarchical  decomposition  of  the  shape  into  its  parts. 
The  novelty  of  our  approach  lies  in  the  combination  of 
several  competing  approaches  and  tools,  into  a  unified 
scheme  and  an  efficient  implementation  producing  natu¬ 
ral  descriptions.  The  details  are  given  elsewhere  in  these 
proceedings  [Rom  and  Medioni,  1992],  and  we  have  ob¬ 
tained  natural  decompositions  on  shapes  obtained  from 
real  data,  as  shown  in  Figure  7. 

3.3  3-D  Shape  Inference  from  Stereo 

Use  of  stereo  is  common  to  recover  3-D  structure  of  the 
scene  by  using  multiple  images.  Traditionally,  the  ma¬ 
jor  problem  in  stereo  has  been  considered  to  be  that  of 
finding  correspondences,  i.e.  finding  the  images  of  tlu' 
same  physical  entity  in  the  multiple  images.  It  has  been 
thought  that  surface  description  and  segmentation  pro¬ 
cesses  follow,  and  are  largely  independent  of,  the  cor¬ 
respondence  problem.  We  argue  that  this  view  is  not 
completely  correct.  While  finding  a  good  solution  to  the 
correspondence  problem  is  critical,  we  can  typically  only 
expect  to  get  a  sparse  depth  map  from  this  step.  Thus, 
finding  surface  descriptions  still  requires  grouping  oper¬ 
ations.  We  argue  that  the  grouping  operations  may,  in 
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fact,  help  the  correspondence  process  and  in  this  sense 
the  two  are  not  totally  independent. 

We  have  been  developing  a  stereo  system  that  con¬ 
structs  a  hierarchy  of  descriptions  in  each  image  and 
matches  these  descriptions.  Our  first  experience  with 
this  approach  was  in  description  of  buildings  that  are 
composed  of  rectangular  structures  [Mohan  and  Neva- 
tia,  1989b]  from  stereo  pair  of  aerial  images.  Our  cur¬ 
rent  system  uses  the  following  feature  groupings;  edges, 
curves,  junctions  and  ribbons.  Matches  are  made  at  the 
highest  available  level  first  and  propagated  down  to  lower 
levels.  A  block  diagram  of  this  system  is  shown  in  fig¬ 
ure  8.  A  major  component  of  tins  method  consists  of  a 
careful  analysis  of  occlusion  properties  in  stereo  which 
gives  strong  constraints  on  matching  of  junctions  of  con¬ 
tours  and  allows  us  to  distinguish  limb  boundaries  from 
surface  discontinuity  boundaries  (creases).  This  system 
is  new  since  our  last  progress  report,  however,  it  has 
been  published  in  open  literature  elsewhere  [Chung  and 
Nevatia,  1991],  so  to  save  space  we  will  omit  the  details 
here. 


Figure  8:  Overview  of  our  stereo  system. 


One  application  of  this  stereo  system  we  are  investigat¬ 
ing  is  shape  description  of  Straight  Homogeneous  Gener¬ 
alized  Cylinders  (SHGCs).  Many  curved  objects,  includ¬ 
ing  SHGCs,  produce  boundaries  where  significant  sec¬ 
tions  are  view-point  dependent  (limb)  boundaries.  These 
boundaries  can  not  be  matched  in  the  two  views  with¬ 
out  producing  significant  errors.  Worse  yet,  incorrectly 
segmenting  the  surface  at  these  boundaries  produces  a 
poor  description.  This  problem  was  first  addressed  by 
Lim  and  Binford  [Lim  and  Binford,  1988].  Their  sys¬ 
tem  makes  some  Jissumptions  that  are  not  always  cor¬ 
rect  and  they  provided  no  methods  for  separating  the 
limb  boundaries  from  the  other  boundaries.  We  have 
developed  a  method  that  is  mathematically  sound  and 


also  uses  our  junction  analysis  technique  to  identify  tin' 
limbs.  This  method  is  described  in  detail  elsewhere  in 
these  proceedings  [Chung  and  Nevatia,  1992].  Results 
fora  a  real  examlple  using  this  method  is  shown  in  fig¬ 
ure  9. 


(a)  Left  Image  (b)  Right  Image 


(c)  Recovered  volumetric  description 


Figure  9:  Results  of  hierarchical  stereo  matching  and 
volumetric  shape  recovery  for  a  scene  of  a  lamp 


3.4  3-D  Shape  from  Contour 

In  some  cases,  only  a  single  image  of  a  scene  is  available 
and  we  need  to  infer  the  3-D  structure  from  this  sin¬ 
gle  image.  Humans  are  usually  very  good  at  this  task, 
but  the  process  we  use  is  not  well-understood.  Of  the 
many  cues  that  are  available,  such  as  shading,  shadows 
and  texture,  we  believe  that  the  most  significant  one  is 
the  shape  of  the  2-D  contours  themselves.  Inferring  3-D 
surface  from  2-D  contours  is  a  well-known  and  difficult 
problem.  We  have  developed  a  theory  for  this  that  we 
believe  is  applicable  to  a  large  class  of  scenes  that  we 
encounter  and  that  yields  results  in  agreement  with  ilie 
human  perception. 

A  set  of  2-D  contours  can,  of  course,  arise  from  in¬ 
finitely  many  3-D  shapes,  though  we  typically  perceive 
only  one  (or  two).  To  make  such  inferences,  some  as¬ 
sumptions  about  the  nature  of  the  scene  must  be  made 
Our  objective  in  constructing  our  theory  was  to  mini¬ 
mize  the  number  of  assumptions  made  and  to  generate 
results  that  agree  with  human  perception.  (In  a  certain 
sense,  the  only  test  for  shape  from  contour  algorithms 
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Figure  10:  The  needle  image  obtained  from  the  com¬ 
puted  surface  normals. 


can  be  comparison  with  human  perception,  since  even 
in  those  CEises  where  the  3-D  object  projecting  the  2-D 
shape  is  known  to  us,  the  same  2-D  contour  could  be 
produced  by  another  3-D  object  as  well.) 

Our  technique  is  based  on  observing  certain  kinds 
of  symmetry  relations  among  image  curves.  We  show 
mathematically  the  conditions  under  which  certain 
classes  of  surfaces  exhibit  these  symmetries.  Though 
the  reverse  is  not  always  guaranteed,  we  assume  that 
the  symmetries  do  indicate  the  presence  of  the  corre¬ 
sponding  classes  of  surfaces.  Further,  the  contours  and 
the  symmetries  allow  us  to  formulate  some  constraints 
on  the  quantitative  shape  of  the  surfaces  being  viewed. 
The  constraints  that  derive  purely  from  the  geometry  of 
the  surface  are,  however,  not  sufficient  to  compute  the 
precise  shape  of  the  surface  and  leave  some  degrees  of 
freedom  unspecified.  These  remaining  degrees  of  free¬ 
dom  are  fixed  by  using  some  simple  perceptual  proper¬ 
ties. 

Our  approach  has  so  far  been  developed  for  the  fol¬ 
lowing  classes  of  surfaces: 

•  Zero-Gaussian  Curvature  (ZGC)  surfaces 

•  Straight  Homogeneous  Generalized  Cylinders 
(SHGCs) 

•  Planar,  Right,  Constant  Cross-section  Generalized 
Cylinders  (PRCGCs) 

We  believe  that  this  class  of  objects  is  broad  and  cov¬ 
ers  a  wide  range  of  objects,  particularly  in  man-made 
environments.  In  previous  papers  [Ulupinar  and  Neva- 
tia,  1990a;  Ulupinar  and  Nevatia,  1990b;  Ulupinar  and 
Nevatia,  1991]  we  have  described  these  methods  in  detail 
and  shown  results  on  a  number  of  examples.  In  recent 
work,  we  have  extended  our  technique  to  work  on  ob¬ 
jects  consisting  of  several  curved  surfaces  (restricted  to 
be  zero-Gaussian  curvature  surfaces).  Our  perception 
of  a  surface  can  be  heavily  influenced  by  the  percep¬ 
tion  of  the  neighboring  surfaces.  Our  technique  recovers 
shape  of  all  the  visible  surfaces  of  an  object  simulta¬ 
neously.  Figure  10  illustrates  the  type  of  results  for  this 
technique.  Elsewhere  in  these  proceedings  [Ulupinar  and 
Nevatia,  1992],  we  describe  the  details  of  this  technique. 

3.5  Working  with  Imperfect  Data 

In  these  previous  papers,  however,  we  assumed  that  the 
input  to  our  method  consisted  of  clean  and  complete 
line  drawings.  That  is,  we  assumed  that  we  were  given 


complete  object  boundaries  (without  gaps)  and  that  no 
other  boundaries,  such  as  those  that  may  be  caused  by 
surface  markings,  shadows  and  noise  were  present.  This 
is,  of  course,  unrealistic  when  boundaries  are  derived 
from  real  images  not  necessarily  taken  under  highly  con¬ 
trolled  conditions.  Our  current  work  aims  to  address 
these  difficulties.  This  is  work  in  progress,  but  we  have 
already  gotten  some  very  encouraging  results.  We  briefly 
describe  our  approach  and  some  results  here. 

Our  approach  is  based  on  perceptual  grouping.  Nor¬ 
mally,  perceptual  grouping  uses  criteria  such  as  proxim¬ 
ity,  continuity,  symmetry  and  closure.  While  the  im¬ 
portance  of  such  criteria  is  widely  accepted,  the  pre¬ 
cise  properties  for  imph  ■'entation  are  not  clear.  In  past 
work  [Mohan  and  Nevatia,  1989a],  we  showed  promising 
results  using  intuitive  definitions.  However,  for  applica¬ 
tions  to  the  current  task,  we  have  a  major  advantage-  the 
precise  properties  that  allow  us  to  infer  a  ZGC,  SHGC  or 
PRCGC,  the  only  classes  of  surfaces  for  which  we  know 
how  to  infer  3-D  structure,  are  known  to  us.  Thus,  our 
grouping  can  be  to  search  for  the  presence  of  these  spe¬ 
cific  properties.  This  not  only  helps  us  avoid  the  compu¬ 
tational  complexity  of  the  general  grouping  approaches, 
but  gives  groups  that  we  know  how  to  interpret  as  3- 
D  objects.  In  other  words,  the  process  of  grouping  and 
shape  description  are  combined. 

Our  grouping  consists  of  several  steps  (see  figure  11). 

1.  First  step  consists  of  grouping  curves  based  on  co- 
curvilinearity,  we  use  the  method  given  in  [Mohan 
and  Nevatia,  1989a].  This  step  can  bridge  short 
breaks. 

2.  Next  step  attempts  to  bridge  large  breaks  and  to 
detect  missing  terminations  and  discontinuities  of 
curves.  For  this,  we  use  the  symmetry  properties 
themselves  to  aid  in  the  grouping.  This  process  con¬ 
sists  of  the  following  components: 

(a)  Initially,  we  fit  B-splines  to  the  given  curves  and 
find  symmetries  among  these  using  the  method 
of  Saint-Marc  [Saint-Marc  and  Medioni,  1990]. 
This  method  gives  speirse,  local  correspon¬ 
dences  and  is  unable  to  distinguish  between 
“desired”  symmetries  and  others,  which  can 
only  be  done  by  more  global  properties. 

(b)  Local  symmetries  are  grouped  into  more  global 
symmetries.  Our  grouping  criteria  account  for 
different  ways  two  parallel  symmetric  curves 
can  be  broken  due  to  occlusion,  low  contrasts 
or  imperfections  in  the  edge  detection.  An  ex¬ 
ample  is  shown  in  figure  12(a).  Competing  hy¬ 
potheses  yield  alternative  groupings  which  are 
evaluated  at  higher  levels  where  more  global  in¬ 
formation  becomes  available  (see  figure  11). 

(c)  The  above  step  can  give  symmetries  that  are 
not  complete.  Next  we  hypothesize  possible 
terminations,  based  on  continuity  of  one  of  the 
curves  giving  the  unterminated  symmetry  and 
other  contours  in  the  image  to  give  closure.  An 
example  is  shown  in  figure  12(b). 

3.  Verification  of  global  symmetries:  here  we  test  for 
grouped  symmetries  to  verify  global  regularity  critc- 
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Figure  11:  Block  Diagram  of  the  method 


ria  that  consist  of  preferring  monotonic,  linear  cor¬ 
respondences  with  closed  boundaries.  Note  that  this 
step  will  remove  the  effects  of  surface  markings,  even 
if  they  happened  to  form  symmetries  accidentally 
in  step  2.  Once  consistent  global  symmetries  and, 
consequently,  global  curves  are  obtained,  large  gaps 
can  be  filled  in  in  a  way  dictated  by  the  similarity 
between  symmetric  curves. 

4.  In  the  final  step,  global  symmetries  and  completed 
boundaries  together  with  properties  of  ZGCs  and 
SHGCs  are  used  to  hypothesize  existence  of  such 
surfaces  in  the  scene.  Here,  complex  cases  of  ter¬ 
minations  (not  handled  in  step  2.c  above)  and  limb 
boundary  completion  can  be  handled. 

We  show  the  operation  of  our  method  by  three  exam¬ 
ples,  one  is  an  LSHGC  (ZGC  surface),  the  other  two  are 
SHGCs. 

Figure  13(a)  shows  the  boundaries  of  an  LSHGC  with 
many  breaks  and  extraneous  boundaries.  Figure  13(b) 
shows  the  B-spline  fit  on  the  grouped  curves  and  the  local 
symmetries  detected  (their  axes)  (output  of  steps  1  and 
2. a  above).  Figure  13(c)  shows  completed  symmetries 
and  boundaries.  This  input  is  now  in  a  form  that  can  be 
handled  by  our  shape  from  contour  system. 

Figure  14(a)  shows  the  boundaries  of  an  SHGC.  No- 
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Figure  12;  Connection  (a)  and  termination  (b)  hypothe¬ 
ses  and  boundary  inference 


tice  the  large  break.  Figure  14(b)  shows  the  B-splines 
fit  and  the  local  symmetries.  The  gap  has  been  filled  in 
figure  14(c)  2»s  a  consequence  of  the  global  symmetry 
existing  between  the  top  cross-section  boundaries  and 
the  bottom  ones;  i.e.  the  bottom  part  suggests  how  to 
complete  the  missing  top  part.  Figure  14(d)  shows  the 
visible  cross-sections  and  the  recovered  axis.  For  this, 
we  use  the  property  that  lines  of  symmetry  intersect  on 
the  axis. 

Figure  15(a)  shows  the  boundaries  obtained  from  an¬ 
other  SHGC.  Note  that  this  time  part  of  the  right  limb 
boundary  is  missing.  This  is  a  more  serious  situation 
than  missing  part  of  the  cross-section  boundary  which 
can  be  bridged  by  using  the  parallel  symmetry  of  the 
cross-sections.  The  limb  completion  is  done  after  the 
cross-sections  in  the  incomplete  part  of  the  SHGC  have 
been  recovered.  We  use  the  property  that  limb  bound¬ 
aries  and  cross-sections  are  tangential.  Limb  points  are 
points  on  these  cross-sections  defining  a  tangential  enve¬ 
lope. 

Figure  15(d)  shows  the  axis,  recovered  cross-sections 
and  completed  boundaries  which  can  now  be  passed  to 
our  shape  from  contour  system. 


(a)  (b)  (c) 


Figure  13:  Processing  of  an  LSHGC  contour 

These  examples  are  meant  to  illustrate  how  the  pow¬ 
erful  constraints  given  by  our  shape  from  contour  theory- 
can  be  used  to  infer  shapes  even  in  presence  of  significant 
breaks  and  markings.  This  system  is  still  in  development 
irul  lia-s  not  been  fully  tested.  In  the  future,  we  intend 
I,,  j.  velop  methods  that  work  on  scenes  with  multiple, 
I  lu  hiig  objects  and  with  complex  objects  that  consist 
(il  1.^  r!:iblies  of  primitives  we  have  studied. 

4  OBJECT  RECOGNITION 

4.1  Analysis  of  the  TOSS  system 

We  have  introduced  the  TOSS  object  recognition  sys¬ 
tem  which  is  able  to  match  general  three  dimensional 
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Figure  14:  Application  on  an  SHGC  contour 
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(a)  (b)  (c) 
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Figure  15:  Application  of  the  method  to  a  SHGC  contour 
with  cross-section  curve  and  limb  broken 


objects  from  partial  3D  data  in  an  efficient  way  by  using 
a  method  called  structural  indexing  [Stein  and  Medioni, 
1991].  Our  algorithm  uses  a  combined  representation, 
which  captures  information  about  both  smooth  patches 
and  discontinuity  lines:  For  some  objects,  such  as  poly- 
hedra,  it  is  natural  to  use  a  representation  based  on 
edges.  For  that  reason,  we  use  as  basic  feature  for  the 
representation  of  surface  and  depth  discontinuities,  the 
3D  curve.  We  achieve  a  robust  and  stable  representation 
by  using  multiple  line  fitting  tolerances  to  obtain  a  set 
of  polygonal  approximations.  The  polygonal  approxima¬ 
tions  are  grouped  in  sets  of  connected  segments.  These 
super  segments  are  encoded  based  on  the  angles  between 
consecutive  segments,  providing  invariance  with  respect 
to  rotation,  translation,  and  scale  (even  though  scale  is 
not  needed  in  three  dimensional  object  recognition). 

For  some  objects,  however,  such  as  objects  bounded 
by  free  form  surfaces,  it  is  difficult  to  use  edges  for  the 
representation.  Therefore,  we  also  use  splashes,  based  on 
small  surface  patches  where  we  can  compute  differentia! 
properties  in  a  reliable  way.  A  splash  consists  of  a  radial 
grouping  of  surface  normals.  It  is  a  local  Gaussian  map 
describing  the  distribution  of  surface  orientation  along 
a  geodesic  circle.  A  splash  can  be  represented  by  two 
two-dimensional  periodic  functions,  which  can  also  be 
combined  into  one,  compact,  three  dimensional  curve. 
This  allows  us  to  use  a  unified  representation  scheme  for 
the  splash  and  the  above  mentioned  3D  curve. 

We  illustrate  the  recognition  of  three  models  of  com¬ 
posers  (shown  in  figure  16)  in  a  scene  with  occlusion. 


(a)  (b)  (c) 
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Figure  16:  Three  Composers 


Figure  17:  Projection  of  Detected  Busts  in  Scene  3 


Figure  17  shows  the  reprojection  of  the  recognized  ob¬ 
jects  onto  the  scene. 

Robustness  and  Stability  In  order  for  our  represen¬ 
tation  to  be  useful  in  recognizing  objects  from  real  data, 
it  is  necessary  to  address  the  issues  of  robustness  and 
stability  of  the  splash  representation  in  the  presence  of 
noise.  In  particular,  we  must  examine  the  robustness 
of  a  splash  regarding  location  uncertainty,  study  the  ef¬ 
fects  on  representation  and  matching  of  an  error  in  the 
reference  normal  orientation,  and  ask  how  much  noise 
can  be  added  to  the  surface  patch  and  still  have  a  stable 
representation. 

We  have  found  empirically  that  our  representation  is 
adequate,  since  our  system  performs  well  on  real  data. 
We  would  have  liked  to  model  the  behavior  of  our  scheme 
on  arbitrary  free  form  surfaces,  but  they  are  too  general 
to  be  of  any  help.  Instead,  we  present  a  full  analysis  on 
a  simple  analytic  model  of  a  specific  surface  patch,  as 
shown  in  Figure  18(a)  and  (b),  for  all  three  issues  (lo¬ 
cation,  orientation,  and  surface  stability).  Furthermore 
for  the  orientation  robustness  issue,  we  study  the  effects 
of  noise  in  the  general  case  of  a  3D  curve  (representing 
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(a)  Location 
of  the  Refer¬ 
ence  Splash 


(b)  Qualitative  Side  View 


Figure  18:  The  Coiner  Surface 


either  one  of  our  features). 

Robustness  in  the  Location  To  answer  the  question 
of  how  robust  the  representation  of  a  splash  is  with  re¬ 
spect  to  location  accuracy,  we  use  empirical  data  based 
on  an  example  environment  as  described  above. 

For  small  quantizations,  the  patch  is  recognized  only 
if  the  splash  is  very  close  to  the  location  of  the  original 
splash,  whereas  larger  quantizations  allow  a  much  larger 
latitude  in  location  uncertainty.  Of  course,  if  the  quan¬ 
tization  is  too  large,  the  splash  may  be  confused  with 
splashes  from  other  locations.  We  therefore  observe  the 
desired  robustness  with  respect  to  location  uncertainty 
for  this  surface  patch  for  the  some  of  the  quantizations. 

Robustness  in  the  Reference  Normal  Here,  we  ex¬ 
amine  the  influence  of  noise  in  the  reference  normal  on 
the  representation  of  a  splash.  Adding  an  error  angle 
£  in  the  direction  6  to  the  reference  normal,  how  is  the 

curve  v{9)  =  ^  ^  influenced? 


Figure  19:  Envelope  of  a  3-D  curve 

We  find  that  the  maximal  error  for  is  bounded  by: 
l|A0||  =  ||<i-«'||<c 

and  that  a  bound  for  the  maximal  error  for  AV*  as 

iiAv-ii  =  u  -  n  <  c. 


We  can  now  address  the  problem  of  how  the  representa¬ 
tion  might  vary.  Instead  of  a  single  curve  v{0)  we  have 
now  to  deal  with  an  envelope  of  curves  V(6,£)  which  is 
a  disc  of  radius  e  swept  along  v  (see  Figure  19). 

What  are  the  polygonal  approximations  that  appro.x- 
imate  such  a  set  of  curves?  How  are  the  angles  between 
consecutive  line  segments  affected?  To  make  a  general 
statement  is  very  difficult.  Therefore  we  make  some  sim¬ 
plifying  assumptions.  As  a  result,  we  find  that,  for  small 
quantizations,  the  patch  is  recognized  only  if  the  refer¬ 
ence  normal  is  very  similar  to  the  reference  normal  of 
the  original  splash.  An  uncertainty  of  up  to  two  degrees 
is  bearable  and  only  for  a  certain  6  direction.  Larger 
quantizations  allow  a  much  larger  error  in  the  reference 
normal,  and  more  freedom  in  the  6  direction.  The  shape 
of  the  different  distributions  with  respect  to  6  is  depen¬ 
dent  on  the  underlying  surface,  and  the  quantization. 

Stability  with  respect  to  Noise  We  want  to  ad¬ 
dress  the  question  of  how  corrupted  can  the  underlying 
surface  be  to  reduce  the  stability  of  the  splash  represen¬ 
tation.  We  add  zero  mean  Gaussian  noise  a^(0,<r)  with 
a  the  amplitude  (in  pixels)  and  a  the  standard  devia¬ 
tion,  to  the  surface  s  to  get  a  corrupted  surface,  and 
compute  the  splash  at  the  same  location  cis  the  reference 
splash  on  the  corrupted  surface.  For  a  given  quantization 
we  then  match  the  “corrupted”  splash  against  the  refer¬ 
ence  splash.  We  find  that  the  larger  the  quantization, 
the  more  stable  is  the  splash  representation,  the  smaller 
the  quantization,  the  less  stability  we  can  observe.  The 
matching  behavior  is  not  affected  by  either  increasing 
the  amplitude  of  the  noise  and  keeping  the  standard  de¬ 
viation  small  or  decreasing  the  amplitude  and  having  a 
large  noise  deviation. 

Complexity  Analysis  The  whole  issue  of  complex¬ 
ity  focuses  on  the  question:  “What  is  the  discriminative 
power  of  the  features  in  the  system?”  The  answer  to 
this  question  is  made  up  of  two  parts.  The  first  part  is 
the  analysis  of  the  retrieval  process  and  the  number  of 
generated  hypotheses.  The  second  part  is  the  discussion 
of  the  verification  step  and  the  cost  of  grouping  them 
into  consistent  clusters.  When  we  talk  in  the  follow¬ 
ing  about  “scene  features”  we  consider  only  the  scene 
features,  which  lead  to  the  generation  of  at  least  one 
hypothesis.  For  the  complexity  analysis,  we  ignore  the 
scene  features  whose  code  is  disjunct  from  all  the  keys 
in  the  hash  table  (they  also  do  not  add  any  cost  in  prac¬ 
tice).  Before  we  start  to  answer  the  above  question  we 
define  the  parameters  involved: 

•  n  =  the  number  of  features  in  the  scene, 

•  d  =  the  number  of  features  per  table  entry, 

•  m,  =  the  number  of  models  in  the  scene, 

•  and  m  =  the  number  of  models  in  the  data  base. 

To  simplify  the  discussion,  we  make  the  as.sumptions 
that  every  model  has  the  same  number  of  super  seg¬ 
ments,  and  that  the  entries  are  equally  distributed  over 
the  table  (d  =  const).  Furthermore  we  assume  that  ev¬ 
ery  model  consists  of  /  features  (/  =  const). 
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Hypotheses  Generation  When  we  have  n  features 
in  the  scene,  it  is  obvious  that  the  cost  to  generate  all 
candidate  hypotheses  with  indexing  is  0(n).  The  im¬ 
portant  issue  for  the  cost  of  the  complete  recognition  is 
not  the  cost  of  retrieval,  but  h,  the  number  of  hypothe¬ 
ses  generated,  because  this  number  has  a  crucial  effect 
on  the  verification  step.  The  number  h  is  proportional 
to  the  number  of  features  per  table  entry  d  {h  =  d  •  n). 
The  larger  h,  the  slower  is  the  clustering  into  consistent 
clusters  in  the  following  verification.  Therefore  we  are 
interested  in  a  small  h,  which  corresponds  to  a  small  d. 
This,  so  called  “ability  do  discriminate”,  is  influenced 
by  the  amount  of  noise  in  the  data,  the  amount  of  simi¬ 
larity  between  models,  and  the  grid  size  for  the  interest 
operator. 

To  summarize,  the  crucial  value  for  the  retrieval  stage 
is  h,  the  number  of  generated  hypotheses.  In  a  best 
case,  this  number  is  very  low.  The  theoretical  beat  case 
is,  when  every  hypotheses  votes  for  a  different  model 
(d  =  1  and  n  =  m,).  This  results  in  h  =  n.  The  worst 
ceise  corresponds  to  a  large  value  of  h.  There  is  only 
little  discriminative  power  in  the  features.  This  means 
that  most  features  are  encoded  with  the  same  code  (large 
d).  Every  scene  feature  generates  all  possible  candidate 
hypotheses  and  therefore  the  overall  number  of  retrieved 
hypotheses  candidates  is  h  =  f  ■  m  ■  n.  This  is  the  ap¬ 
proach  taken  by  many  systems  of  the  past,  which  use  ei¬ 
ther  points  or  lines  as  basic  features,  and  must  therefore 
perform  the  discrimination  task  during  the  verification 
step. 

Verification  As  mentioned  above,  the  task  of  the  veri¬ 
fication  is  to  cluster  the  h  hypotheses  into  mutually  con¬ 
sistent  clusters.  This  is  done  for  every  entry  in  the  cor¬ 
respondence  table. 

Best  Case:  In  the  best  case,  there  is  a  lot  of  discrim¬ 
inative  power  in  the  features.  This  corresponds  to  a  low 
d  value.  The  h  =  d  ■  n  hypotheses  are  divided  in  the 
correspondence  table  according  to  which  model  the  hy¬ 
pothesis  votes  for.  For  m,  models  in  the  scene  we  have 
m,  entries  in  the  correspondence  table  with  ^  hypothe¬ 
ses  each.  For  every  entry  these  ^  hypotheses  have  to 
be  grouped  with  respect  to  the  geometrical  constraints. 
We  distinguish  the  clustering  process  based  on  different 
cases; 

•  Every  model  in  the  scene  occurs  only  once:  In  this 
case,  the  hypotheses  in  one  entry  either  vote  for  the 
model  instance  or  they  are  wrong  hypotheses.  This 
leads  to 

Oie,t{n)  =  m,  ■  0(—)  =  0{d  ■  n)  =  0(n). 

m, 

*  Every  model  in  the  scene  can  occur  more  than  once: 
In  this  case  the  clustering  of  the  hypotheses  in  one 
entry  cannot  be  done  in  0(;^),  but  O(^)  instead, 
leading  to  a  complexity  of 

Ose,t{n)  =  m,  •  0{^)  =  =  0{n^). 

m*  m. 


The  best  case  has  a  noteworthy  side  effect.  Assuming 
the  number  of  scene  features  n  fixed,  and  examining  the 
complexity  Om  with  respect  to  the  models  in  the  data 
base,  we  find  that  it  grows  as  Om  =  0{k  ■  m)  when  m  is 
the  number  of  stored  models,  and  fc  <  1. 

Worst  Case:  In  the  worst  case,  there  is  little  discrim¬ 
inative  power  in  the  features.  This  corresponds  to  a  high 
value  for  d.  The  overall  number  of  retrieved  hypotheses 
candidates  is  h  =  d  n  with  d  =  m  -  f.  These  h  hypothe¬ 
ses  are  divided  in  the  correspondence  table  according  to 
which  model  the  hypothesis  votes  for.  For  m  models  in 
the  data  base  we  get,  in  the  worst  case,  m  entries  in  the 
correspondence  table  with  h  hypotheses  each.  For  every 
entry,  these  h  hypotheses  have  to  be  grouped  with  re¬ 
spect  to  the  geometrical  constraints.  Clustering  these  b 
hypotheses  results  in  a  complexity  of 

Outorstin,  m)  =  m  O(h^)  =  0(/^  m^)  =  0{n^  m^). 

In  the  worst  case,  the  ratio  of  good  versus  bad  hypothe¬ 
ses  is  very  small. 

Summary  As  a  conclusion  we  get  the  result,  that  the 
practical  complexity  of  our  system  is 

0(n)  <  O 

recognition 

In  the  case  of  well  distinguishable  models,  the  complexity 
comes  close  to  the  above  discussed  best  case.  An  example 
where  the  system  slows  down  is  shown  in  Figure  16,  a 
cluttered  scene  which  consists  of  three  composer  busts. 
The  system  detects  the  correct  models  and  computes 
the  correct  locations,  but  due  to  noisy  data  and  similar 
features,  the  discriminative  power  is  smaller,  and  the 
overall  recognition  process  is  slower. 

4.2  Recognition  of  3D  objects  using  2D 
groupings 

Here  we  are  interested  in  the  recognition  of  a  three- 
dimensional  object  in  a  two-dimensional  scene.  This 
raises  the  question  of  how  we  describe  an  object  and 
which  underlying  primitives  we  use. 

Most  approaches  are  based  on  low  level  primitives 
(points,  line  segments,  and  other  local  features).  In  ob¬ 
ject  recognition,  this  strategy  works  only  when  a  perfect 
geometric  model  is  available.  Therefore,  by  matching  a 
model  with  a  scene  results  in  a  set  of  correspondence 
hypotheses  and  further  verification  by  using  geometric 
constraints  leads  to  the  correct  solution,  but  the  compu¬ 
tational  complexity  is  l2trge. 

We  instead  propose  an  alternate  approach  using 
groupings.  Using  perceptual  grouping  in  computer  vi¬ 
sion  is  an  old  idea,  but  very  few  implementations  have 
been  demonstrated  [Lowe,  1985;  Mohan  and  Nevatia, 
1989b;  Huttenlocher  and  Wayner,  1991;  Sha’ashua  and 
Ullman,  1988]. 

Here,  we  develop  a  feature  hierarchy  which  can  be  used 
for  object  recognition  of  three-dimensional  objects  from 
a  two-dimensional  scene.  In  this  hierarchy,  we  propose 
specific  groupings  based  on  proximity,  parallelism,  sym¬ 
metry,  and  closure.  The  detection  of  these  features  is 
performed  in  an  efficient  way  using  proximity  indexing. 
First,  we  generate  features  with  multiple  representations 
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to  overcome  unreliability  of  local  algorithms  in  the  pre¬ 
processing,  and  to  handle  noise  and  capture  different  lev¬ 
els  of  detail.  Later,  we  merge  perceptual  similar  features 
at  higher  levels  of  the  feature  hierarchy.  As  models,  we 
use  a  set  of  non-registered  views  of  a  3-D  object.  While 
most  other  systems  use  spatial  correspondences  to  ver¬ 
ify  matching  hypotheses,  we  use  high  level  features  and 
their  topological  relationships  for  the  recognition  process. 
These  features  are  grouped  based  on  closure  and  prox¬ 
imity  to  generate  so  called  high  level  groupings  which  are 
stored  in  a  table.  Using  indexing,  we  retrieve  matching 
hypotheses,  which  are  verified  against  each  other  with  re¬ 
spect  to  topological  constraints.  Groups  of  consistent  hy¬ 
potheses  represent  detected  model  instances  in  a  scene. 
A  more  detailed  description  can  be  found  elsewhere  in 
these  proceedings  [Stein  and  Medioni,  1992].  An  exam¬ 
ple  of  successful  recognition  is  shown  in  Figure  20. 


(a)  Model 


Figure  20:  Recognition  based  on  High  Level  Groupings 


4.3  Application  to  the  “drop-ofF”  problem 

Navigation  using  maps  requires  the  frequent  updating  of 
the  location  of  an  observer  with  respect  to  the  map.  Hu¬ 
mans  performing  this  task  use  a  large  number  of  differ¬ 
ent  problem  solving  techniques,  both  expectation  driven, 
where  a  landmark  is  selected  on  the  map  and  searched 
for,  or  data  driven,  where  image  features  are  extracted 
first  and  matched  against  the  map. 

The  problem  most  often  addressed  by  previous  re¬ 
searchers  is  that  of  updating  the  current  position,  given 
a  good  initial  solution.  This  can  be  solved  by  correct¬ 
ing  (small)  errors  between  the  predicted  and  observed 
aspects  of  the  scene. 

Here  instead,  we  study  the  “drop-ofF’  problem.  An 
excellent  overview  of  the  whole  problem  is  given  in 


[Thompson  et  ai,  1990].  The  authors  describe  a  prelimi¬ 
nary  computational  model  for  the  drop-off  problem  (tlie 
name  drop-off  comes  from  the  case  in  which  an  observer 
is  “dropped  off”  into  a  unfamiliar  environment  and  ha.s 
to  orient  himself).  The  observer  stays  at  one  position 
and  tries  to  find  his  location  based  on  visible  landmarks 
and  salient  features.  In  our  approach,  the  observer  tries 
to  establish  his  location  based  on  the  curve  described 
by  the  panoramic  horizon  (as  explained  later)  which  is 
visible  from  his  viewpoint. 

It  should  be  noted  that  people  experience  serion.s 
difficulties  in  solving  such  localization  problems,  lead¬ 
ing  many  researchers  to  suggest  that  traditional  object 
recognition  strategies  are  unlikely  to  succeed  [Thomp¬ 
son  et  ai,  1990].  This  is  based  on  the  observation  that 
the  combinatorics  of  the  problem  are  very  unfavorable, 
as  the  shapes  are  complex  and  the  number  of  different 
cispects  is  extremely  large. 

We  challenge  this  view  and  propose  that  a  table-based 
matching  strategy  [Stein  and  Medioni,  1990;  Stein  and 
Medioni,  1991]  can  overcome  the  limitations  mentioned 
previously.  We  propose  to  extract  from  many  locations 
in  the  map  the  panoramic  horizon  curves,  which  corre¬ 
spond  to  the  crest  line  perceived  by  the  observer  tis  he 
completes  a  full  360“  view  in  place.  Such  curves  are 
encoded  and  stored  in  a  table.  To  locate  an  unknown 
location,  we  first  extract  the  panoramic  horizon  curve  of 
the  unknown  location.  Then  we  approximate  it  by  a  fam¬ 
ily  of  polygons  with  different  line  fitting  tolerances.  By 
using  indexing  into  the  table,  we  retrieve  candidate  lo¬ 
cations.  The  correct  candidate  (the  closest  one)  is  found 
by  applying  further  geometrical  constraints  during  the 
verification  step. 

It  should  be  clear  to  the  reader  that  this  approach  is 
not  guaranteed  to  provide  a  unique  answer,  as  it  is  easy 
to  come  up  with  counter  examples  (planar  or  repetitive 
environment),  but  we  show  that  it  gives  excellent  results 
in  complex,  real  environments,  as  shown  below. 

Definitions 

Map:  A  map  in  our  definition  is  a  topographic  map 
which  provides  us  with  the  three  dimensional  coordinates 
for  each  surface  point  (e.g.  range  data).  The  z  compo¬ 
nent  (altitude  or  height)  can  be  retrieved  from  the  x  and 
y  coordinates:  z  =  Map(a:,  y).  We  assume  that  the  map 
is  small  with  respect  to  the  associated  planet  sphere. 
Therefore  the  map  can  be  considered  as  a  small  planar 
patch.  The  z  axis  is  perpendicular  to  this  patch. 

Viewer  or  Observer:  The  viewer  is  an  observing 
platform  with  coordinates  v  =  (r^,  j/„,  ;„).,  and  the  cam¬ 
era  is  at  Zv  ~  Map(a:„,  y„)  -|-  h,  where  h  is  the  height  of 
the  camera. 

Horizon:  The  horizon  is  the  upper  bound  of  the  pro¬ 
jection  of  all  landscape  points  on  a  cylinder  around  the 
observer  as  shown  in  Figure  21.  Such  a  panoramic  hori¬ 
zon  H  is  a  periodic  curve.  Projecting  the  curve  back  on 
the  landscape  results  in  a  set  of  three  dimensional  points 
H  whose  projection  on  the  map  is  shown  in  Figure  22.  H 
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height:  h 


Figure  21:  Horizon 


and  H  are  defined  in 

the  following  way  : 

H(t;) 

'■  d  ' 

max{^^^^‘'h,0®  <tf<  360®] 
d  d 

H(«) 

=  {Pv(^.</)  1 

max{^^^},0®  <0  <  360® 
d  d  “■ 

with 

V  = 

(Xv  1  !/v  1  ) 

Pv(^,cf)  = 

d  = 

Az{d)  = 

r  -  2* 

We  call  r{d)  =  the  tangent  of  o,  the  relative  height. 
Without  loss  of  generality  we  set  0®  =  360®  as  the  North 
direction  and  define  the  orientation  of  a  horizon  as  clock¬ 
wise  (as  seen  from  above).  An  example  can  be  seen  in 
Figure  22.  We  display  a  horizon  as  a  graph  as  seen  from 
the  observer.  The  abscissa  represents  the  direction  6 
(0®  =  north),  the  ordinate  represents  the  relative  height 
r.  The  graph  of  the  horizon  of  Figure  22  can  be  seen  in 
Figure  23. 


Representation  We  represent  the  map  by  a  set  of 
horizons  computed  for  the  points  of  a  grid  superimposed 
on  the  map.  The  spacing  of  the  grid  determines  the 
accuracy  with  which  we  can  find  the  correct  location. 
The  horizons  are  stored  in  a  table,  implemented  as  a 
hash  table.  A  table  allows  efficient  storage  (only  point¬ 
ers  are  recorded),  the  indexing  scheme  allows  fast  access, 
and  different  super  segments  with  the  same  keys  can  be 
stored  in  cellar  like  buckets.  The  table  (data  base)  grows 
in  size  with  the  number  of  recorded  horizons.  This  pro¬ 
cess  of  building  the  data  base  from  the  map  is  performed 
off  line. 


Results  The  localization  algorithm  is  now  illustrated 
with  an  example  from  a  real  terrain  map.  The  hori¬ 
zons  which  is  seen  by  the  observer  is  simulated.  For  the 


Figure  22:  Map  with  Superimposed  Reprojected  Horizon 


(a)  Panoramic  View  of  Observer  (rendered) 


(b)  Extracted  Horizon  Curve 


Figure  23:  Graph  of  Horizon 


presentation  of  the  range  data,  we  always  display  the 
artificially  shaded  images. 

As  a  terrain  map  we  use  the  DEM  (digital  elevation 
model)  covering  the  Martin  Marietta  ALV  test  area.  A 
digital  elevation  model  is  a  two-dimensional  array  of 
uniformly  spaced  terrain  elevation  measurements.  Our 
DEM  map  consists  of  810pixelx702pixel  with  a  pixel  cor¬ 
responding  to  a  size  of  5  x  5m^  on  the  ground.  The  whole 
map  corresponds  to  an  area  of  approximately  4  x  3.5km‘ . 
Outside  of  the  map  we  assume  flat  surface  (sometimes 
the  horizon  line  is  not  limited  to  the  area  described  by 
the  map).  The  horizons  on  the  map  are  sampled  on  a 
grid  with  a  grid  spacing  of  15  pixels.  This  corresponds 
to  75  m.  For  better  visibility  we  exaggerate  the  elevation 
data.  The  lowest  elevation  is  21  (in  pixels),  the  highest 
is  252.  The  height  of  the  observer  above  the  surface  was 
always  constant  (2  pixels).  For  the  linear  approximation 
we  used  the  line  fitting  tolerances  2,  3,  4,  and  5.  To  ob¬ 
tain  useful  linear  approximations,  we  scale  the  relative 
height  by  a  factor  of  50.  These  are  no  critical  values,  as 
significant  deviations  from  these  values  do  not  affect  the 
results.  The  localization  takes  less  than  1  minute. 

The  observer  is  located  in  a  valley  between  two  ridges 
(see  Figure  22).  Figure  23  shows  the  rendered  panoramic 
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view  from  the  location  of  the  observer  (a)  and  the  ex¬ 
tracted  horizon  curve  (b).  After  the  matching  the  system 
finds  the  hypotheses.  The  best  hypothesis  can  be  seen 
in  a  closeup  in  Figure  24(b).  Figure  25  shows  the  ren¬ 
dered  view  seen  by  the  observer  in  comparison  with  the 
rendered  panoramic  view  of  the  detected  result. 


Figure  24;  Closeup  of  the  Detected  Result 


(a)  Rendered  Panoramic  View  of  Observer 


(b)  Rendered  Panoramic  View  of  Computed  Location 


Figure  25;  Simulated  Panoramic  Views  for  Horizon 


5  MOTION  ANALYSIS 

We  have  had  a  number  of  projects  in  the  analysis  of  se¬ 
quences  of  images  including  analysis  of  closely  spaced 
images,  feature  based  analysis,  motion  estimation  tech¬ 
niques,  and  navigation  using  recognition  of  visual  fea¬ 
tures.  Autonomous  navigation  provides  the  context  for 
much  of  the  work,  though  these  techniques  have  a  much 
broader  utility. 

Motion  analysis  using  feature  point  analysis  tech¬ 
niques  and  multiple  frames  forms  the  central  focus  of 
our  work.  This  approach  involves  extracting  a  set  of 
consistent  features  from  a  sequence  of  images,  finding 
the  corresponding  features  in  consecutive  frames,  and  fi¬ 
nally  computing  the  three-dimensional  motion  based  on 
the  correspondences,  which  also  provides  an  estimate  of 
the  structure  of  the  moving  objects  or  scene.  These  are 
often  described  separately  or  as  sequential  operations, 
but  integration  into  a  single  system  and  feedback  to  ear¬ 
lier  processing  is  a  major  part  of  the  work. 


Our  effort  includes  several  separate  and  related 
projects  including;  analysis  of  closely  spaced  image.s 
(spatio-temporal  analysis)  using  features  such  as  lines, 
corners,  and  regions  to  extract  three-dimensional  struc¬ 
ture  information;  matching  edge  based  contours  iji  a  se¬ 
quence  of  images;  integrating  several  feature  detection 
and  matching  techniques  to  derive  three-dimensional 
motion  and  structure  estimates;  study  of  the  formulation 
of  the  motion  estimation  problem;  detection  of  moving 
objects  in  a  scene  with  a  moving  observer;  and  the  vi¬ 
sual  guidance  of  a  mobile  robot.  Several  of  these  projects 
have  been  completed  in  the  past  year  with  the  resulting 
thesis  being  produced.  This  overview  discusses  the  cur¬ 
rent  status  of  the  research  in  these  are£is.  Some  of  these 
a  are  covered  in  more  detail  in  other  papers  in  this  pro¬ 
ceedings  or  in  other  recent  conference  papers. 

5.1  Spatio-Temporal  Analysis 

The  goal  of  our  work  in  spatio-temporal  analysis  is  to 
generate  a  dense  optic  flow  map  from  a  motion  sequence. 
Because  of  the  sparseness  of  OD  features  (e.g.  corners)  or 
ID  features  (e.g.  curves),  we  feel  that  2D  features  (e  g. 
regions)  are  more  likely  to  produce  dense  motion  esti¬ 
mates.  Early  work  in  spatio-temporal  analysis  includes 
that  of  [Holies  et  ai,  1987].  Our  work  began  with  [Peng 
and  Medioni,  1988;  Peng  and  Medioni,  1989],  with  the 
extraction  of  paths  in  slices  taken  in  the  temporal  direc¬ 
tion  of  the  spatio-temporal  data  volume  (i.e.  paths  of  an 
object  point  through  time  and  space).  This  produces  an 
image  velocity  estimate  only  along  object  contours. 

In  order  to  generate  a  dense  displacement  field,  more 
analysis  of  the  slice  data  is  needed.  Strips  that  corre¬ 
spond  to  trapezoidal  regions  found  in  the  slices  through 
the  temporal  dimension  of  the  image  volume  are  con¬ 
structed  for  selected  orientations  throughout  the  image. 
These  extracted  strips  provide  estimates  of  the  veloc¬ 
ity  component  along  the  slice  orientation.  The  velocity 
estimates  of  different  slice  orientations  are  combined  to 
compute  the  velocity  constraint  for  each  pixel.  A  voting 
scheme  is  used  to  extract  the  position  of  the  Focus  of 
Expansion,  which  can  then  be  used  to  compute  the  real 
velocity  of  the  pixels. 

This  process  is  very  expensive  (requiring  hours  on  se¬ 
rial  machines),  but  most  of  the  computation  is  easily 
performed  on  the  SIMD  architecture  of  the  Connection 
Machine.  This  algorithm  was  transferred  to  a  CM-2  with 
very  good  results  for  computational  speed-up.  Much  of 
this  work  was  reported  on  in  previous  years  and  has  now 
been  completed  with  more  detail  in  [Peng,  1991]. 

5.2  Motion  Estimation 

We  have  continued  our  exploration  of  techniques  for 
computing  structure  from  motion  using  feature  matches 
through  multiple  frames.  The  use  of  multiple  (as  op¬ 
posed  to  two)  frames  is  desirable  for  several  rea.sons: 

•  to  increase  the  robustness  of  the  solution, 

•  to  allow  recovery  of  structure/motion  with  fewer 
features  being  tracked,  and 

•  to  allow  estimation  of  “higher  order  derivatives”  of 
the  motion. 
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Figure  26:  Reconstruci,ed/Extrapolated  TYajectories  for 
the  Rocket  Field  Sequence  (Based  on  Frames  0  through 
6):  Data  for  Frames  0  through  19  Overlaid  on  Image  0 

We  have  completed  development  and  implementation 
of  an  algorithm  for  the  shape  from  motion  problem  given 
point  feature  correspondences  and  perspective  projec¬ 
tion.  This  solution  works  for  a  class  of  motions  called 
chronogeneous  motion,  which  includes  uniform  acceler¬ 
ation  and  constant  angular  velocity  rotation  and  trans¬ 
lation  as  special  cases.  The  solution  is  by  an  iterative 
algorithm  that  recovers  the  three-dimensional  motion  of 
the  feature  points  and  the  three-dimensional  location  of 
each  feature  in  each  frame.  An  additional  closed  form 
algorithm  that  recovers  motion  and  structure  for  uni¬ 
form  acceleration  is  used  to  generate  initial  guesses  for 
the  iterative  procedure  [Franzen,  1991a]. 

These  algorithms  are  discussed  further  in  these  pro¬ 
ceedings  [Franzen,  1992]  with  additional  results,  or  in 
the  thesis  [Franzen,  1991b].  Figure  26  shows  an  exam¬ 
ple  of  reconstructed  trajectories  of  objects  in  the  image 
plane  based  on  the  computed  three-dimensional  motion. 
Figure  1  illustrates  the  structure  accuracy  (compared  to 
measured  ground  truth)  for  this  problem.  These  results 
show  that  this  algorithm  performs  well  in  recovering 
structure  and  motion  parameters  from  feature  point  cor¬ 
respondences.  We  are  using  this  motion  estimation  tech¬ 
nique  in  our  other  motion  work  [Kim  and  Price,  1992]. 

5.3  Integrated  System  for  Motion 

We  have  continued  to  develop  and  use  an  integrated  sys¬ 
tem  for  testing  and  combining  each  of  the  subsystems 
of  the  motion  analysis  system  (segmentation,  feature  ex¬ 
traction  and  matching,  motion  estimation,  motion  feed¬ 
back  to  matching).  In  a  paper  in  these  proceedings  [Kim 
and  Price,  1992],  we  present  an  approach  to  improve  the 
results  of  matching  features  using  a  feature-based  mo¬ 
tion  analysis  technique  applied  to  multiple  frames.  Au¬ 
tomatic  correspondence  procedures  produce  noisy  results 


Pnt 

X 

y 

z 

Depth 

%  Error 

1 

11.57 

14.19 

-11.56 

-10.38 

37.23 

40.28 

38.98 

42.71 

9.56 

3 

4.41 

5.60 

-9.10 

-8.43 

26.71 

28.09 

27.07 

28.64 

5.80 

5 

-7.33 

-6.28 

-9.88 

-7.87 

30.72 

28.53 

31. .59 
29.21 

-7.51 

7 

16.41 

18.69 

-10.49 

-9.24 

31.46 

34.02 

35.48 

38.81 

9.39 

9 

0.53 

0.87 

-9.33 

-7.70 

27.29 

25.91 

27.29 

25.93 

-5.00 

11 

-3.08 

-2.36 

-8.83 

-5.14 

66.69 

97.18 

66.77 

97.21 

45.60 

13 

-2.43 

-1.71 

-11.10 

-8.74 

35.65 

33.36 

35.74 

33.40 

-6. .53 

15 

0.82 

1.71 

-11.68 

-5.88 

44.19 

37.30 

44.20 

37.34 

-15.51 

17 

21.32 

23.96 

-15.61 

-13.22 

52.63 

58.36 

56.79 

63.08 

11.09 

Table  1:  Reconstructed  Structure  Compared  to  Ground 
TVuth  Values  for  the  Rocket  Field  Sequence 

due  to  variations  in  the  feature  extraction  process,  error.s 
in  the  matching  process,  or  noise  in  the  actual  motion. 
Tracking  the  same  feature  through  a  number  of  frames 
results  in  fragmented  trajectories  due  to  occlusions  or 
missing  matches. 

We  developed  a  technique  that  gradually  refines  the 
initial  noisy  correspondence  data  by  using  the  future  tra¬ 
jectory  of  matched  points  based  on  the  estimated  motion 
parameters  and  eliminating  those  points  that  are  not 
compatible  with  the  others  in  the  sequence  and  with 
the  estimated  motion.  We  also  links  fragments  of  a 
single  feature  into  a  single  trajectory  using  the  three- 
dimensional  motion  estimation  to  indicate  which  image 
points  should  correspond  to  the  same  real  feature. 

Results  from  this  system  are  illustrated  in  figure  27 
which  shows  the  initial  correspondence  data  used  as  in¬ 
put  to  the  system  and  the  refined  and  linked  correspon¬ 
dence  data  used  for  the  final  motion  estimation.  The 
trajectories  are  drawn  on  the  first  image  of  the  sequence 
and  correspond  to  sequences  that  may  start  and  end  at 
any  frame  in  the  fifteen  frame  sequence. 

5.4  Mobile  Platform 

We  have  used  our  mobile  robot  as  a  testbed  for  study¬ 
ing  simple  processing  techniques  that  can  be  used  for 
visual  guidance.  On  this  platform  we  have  implemented 
a  simple  sonar-based  object  avoidance  procedure  that 
will  prevent  collisions  with  objects  and  allows  the  vision 
system  to  concentrate  on  other  navigation  goals.  Wv 
are  developing  a  trinocular  (three  camera)  stereo  system 
for  reliable  generation  of  basic  three-dimensional  descrip¬ 
tions  of  the  scene  for  navigation. 

6  AERIAL  IMAGE  ANALYSIS 

Our  work  in  aerial  image  analysis  consists  of  two  major 
components.  First  is  the  transfer  of  technology  funded 
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(a)  Initial  coiiespondence 


(b)  Final  correspondence  results 


Figure  27:  Correspondence  data  used  in  motion  estima¬ 
tion 

by  DARPA  to  the  RADIUS  program.  Specifically,  we  are 
focussing  on  the  transfer  of  our  techniques  for  building 
detection  in  aerial  images  [Huertas  and  Nevatia,  1988b; 
Mohan  and  Nevatia,  1989c].  Second  project  is  the  con¬ 
tinuation  of  our  long  range  effort  of  analyzing  complex 
cultural  domains.  We  have  chosen  large  commercial  air¬ 
ports  as  a  test  domain.  In  previous  work  [Huertas  e<  a/., 
1990b;  Huertas  et  ai,  1989b]  have  shown  good  results  on 
the  detection  of  runways  and  taxiways.  In  this  report, 
we  describe  our  recent  work  on  aircraft  detection. 

6.1  Aircraft  Detection 

We  are  building  a  program  to  detect  commercial  and 
private  aircraft  in  aerial  images.  There  are  many  types 
of  aircraft  that  can  be  present  on  a  commercial  airport 
complex  including  Boeing  737  and  747,  Lockheed  1011, 
and  Dehavilland  DHC-6.  These  aircraft  are  varied  in 
shape  and  size.  For  instance,  some  have  engines  on  the 
rear  fuselage,  some  on  the  wings  and  some  are  front  pro¬ 
pelled.  We  would  like  to  avoid  having  a  different  detailed 
model  for  each  aircraft  in  our  system,  as  is  the  case  in 
[Heller  and  Mundy,  1990]  for  example.  Most  aircraft  in 
the  commerical  airport  setting  have  a  fuselage  and  two 
wings  placed  symmetrically  relative  to  the  fuselage.  We 
will  use  this  basic  model  to  hypothesize  aircraft  in  a  two 
step  process. 


The  first  step  of  this  process  assumes  a  reasonable  con¬ 
trast  between  the  aircraft  and  the  surface  on  which  the 
aircraft  sits.  The  example  we  use  is  a  2048x2048  portion 
of  Boston  Logan  Airport  28.  First  step  of  this  process  is 
to  detect  linear  segments  29.  Next,  we  find  anti-parallels 
which  are  possible  candidates  for  aircraft  fuselages  30. 
These  anti-parallels  can  correspond  to  the  fuselage  sec¬ 
tion  in  front  of  the  wings  (front  fuselage)  or  the  fuselage 
section  behind  the  wings  (rear  fuselage).  In  order  to 
instantiate  a  hypothesis  for  an  aircraft,  the  fuselage  sec¬ 
tions  must  be  supported  by  a  wing  pair.  In  this  step, 
the  wings  are  sets  of  line  segments  placed  symmetrically 
relative  to  the  candidate  fuselage  within  certain  angular 
and  distance  constraints.  The  angle  of  the  wing  relative 
to  the  fuselage  varies  according  to  the  aircraft  type,  how¬ 
ever  the  variance  is  reasonable  to  constrain  the  search.  If 
a  fuselage  has  more  than  one  wing-  pair  match,  we  group 
the  fuselage  hypothesis  anti-parallels  using  the  co-linear 
grouping  functions  from  the  transportation  network  sys¬ 
tem  to  try  to  establish  one  solid  fuselage  hypothesis.  Af¬ 
ter  fuselage  fragment  grouping,  we  seek  further  support 
for  the  aircraft  hypothesis  by  looking  for  further  image 
evidence  of  wings  being  interrupted  by  aircraft  engines. 
Further,  engine  placement  can  be  a  cue  to  aircraft  type 
identification. 

However,  often  there  is  very  little  contrast  between 
an  aircraft  and  the  runway/terminal  area  on  which  it 
sits.  In  this  case,  we  rely  on  shadow  information  in  the 
second  step  of  our  aircraft  identification  process.  In  this 
case,  we  detect  candidate  fuselage  sections  by  looking 
for  thin  dark  anti-parallels  31  corresponding  to  aircraft 
shadow  hypotheses.  The  wing  pair  detection  process  is 
also  the  same,  except  we  concentrate  on  finding  dark 
anti-parallels  corresponding  to  the  wing  shadows.  We  do 
not  go  back  into  the  data  and  look  for  further  evidence 
because  it  is  assumed  that  there  is  insufficient  contrast 
to  find  any  more  details. 

Our  final  set  of  hypotheses  correspond  to  a  conjunc¬ 
tion  of  the  aircraft  detected  from  the  ribbon  description 
32  and  those  found  using  the  shadow  evidence.  We  hope 
to  improve  this  module  by  integrating  the  two  steps  to 
give  higher  confidence  to  aircraft  detected  explicitly  from 
their  general  shape  which  also  have  shadow  evidence.  We 
plan  to  use  the  aircraft  hypothesis  in  our  system  to  help 
constrain  the  search  for  complex  building  structures,  and 
to  improve  our  airport  transportation  network  detection 
system. 

7  PARALLEL  PROCESSING 

The  parallel  processing  group  at  USC,  in  collaboration 
with  the  vision  group  has  been  actively  involved  in  devis¬ 
ing  efficient  parallel  solutions  to  problems  from  all  levels 
of  image  understanding  [Prasanna  Kumar,  1991].  Our 
research  is  focused  on  understanding  the  key  problems 
in  parallelizing  the  techniques  developed  by  the  vision 
community  (in  particular  those  developed  by  the  vision 
group  at  USC),  and  also  implementing  the  solutions  on 
available  parallel  machines.  A  summary  of  our  current 
work  is  outlined  below. 
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Figure  28;  Boston  Logan  Airport 


Figure  29:  Linear  Segments 


Stereo  and  Image  Matching  Stereo  matching  is  one 
of  the  well  known  methods  for  extraction  of  depth  infor¬ 
mation.  Depth  recovery  is  a  crucial  problem  in  image 
understanding  with  applications  in  robotics  and  naviga¬ 
tion.  For  stereo  matching,  we  have  proposed  0{^^) 
time  algorithm  on  a  P  processor  fixed  size  linear  array, 
where  N  is  the  number  of  line  segments  in  one  image,  n 
is  the  number  of  line  segments  in  a  window  determined 
by  the  object  size,  and  P  <  n  [Khokhar  et  ai,  1991]. 
This  algorithm  is  a  parallel  implementation  of  the  stereo 
matching  algorithm  proposed  by  Medioni  and  Nevatia  in 
[Medioni  and  Nevatia,  1985). 

Discrete  relaxation  techniques  have  been  widely  used 


Figure  30:  Fuselage  Anti-parallels 


Figure  31:  Shadow  Anti-parallels 


in  computer  vision  and  artificial  intelligence.  For  the 
image  matching  problem,  discrete  relaxation  technique 
outlined  in  [Medioni  and  Nevatia,  1984]  leads  to  a  se¬ 
quential  execution  time  of  O(n^m^)  for  labelling  n  ob¬ 
jects  with  m  labels.  In  [Lin  and  Prasanna,  1991]  we  have 
proposed  a  faster  sequential  algorithm  for  image  match¬ 
ing  which  runs  in  O(n^m^)  time,  where  n  is  the  number 
of  line  segments  in  the  image  and  m  is  the  number  of  line 
segments  in  the  model.  Also,  a  partitioned  parallel  im¬ 
plementation  has  been  developed  by  using  the  proposed 
sequential  algorithm.  0{{^+P)nm)  time  performance 
is  achieved  on  a  P  processor  fixed  si/c  linear  array,  where 
P  <  nm. 
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Figure  32;  Aircraft  Hypotheses 


Sorting  on  Reconfigurable  Mesh  The  Reconfig- 
urable  Mesh  forms  the  CAAPP  level  of  Image  Under¬ 
standing  Architecture  (lUA)  [Weems  ei  al,  1989].  An 
optimal  sorting  algorithm  on  the  Reconfigurable  Mesh 
is  derived  in  [jang  and  Prasanna,  1991].  The  algorithm 
sorts  n  numbers  in  constant  time  using  nxn  processors. 
The  best  known  previous  result  uses  0(n  x  n  log^  n)  pro¬ 
cessors.  Our  algorithm  satisfies  the  AT^  lower  bound 
of  Q(n^)  for  sorting  n  numbers  in  the  word  model  of 
VLSI.  Modification  to  the  algorithm  for  area-time  trade¬ 
off  is  shown,  to  achieve  the  AT^  lower  bound  over 
1  <  T  <  y/n.  Previously,  the  lower  bound  was  achieved 
over  logn  <T<  y/n.  Notice  that,  using  sort  as  a  basic 
procedure  number  of  low-  and  intermediate-level  Image 
Understanding  problems  can  be  solved  on  the  lUA. 

Graph  Algorithms  Many  of  the  intermediate-level 
computer  vision  tasks  can  be  posed  as  graph  problems. 
Particularly,  digitized  picture  graphs  (DPGs)  of  two  and 
three  dimensions  are  of  primary  importance  due  to  their 
natural  correspondence  with  black/white  images.  We  in¬ 
troduce  a  notion  of  partitionabUity  of  graphs  and  show 
that  DPGs  (of  any  fixed  dimension)  are  partitionable 
[Rao  and  Prasanna,  1991].  This  partitionabUity  prop¬ 
erty  helps  in  constructing  efficient  parallel  algorithms  for 
many  problems  on  digitized  picture  graphs.  We  show 
that  our  techniques  can  be  efficiently  simulated  on  a 
P  X  P,  fixed-size  mesh-connected  computer,  1  <  P  <  n. 
Unlike  other  approaches  (Miller  and  Stout,  1985],  our 
algorithms,  because  of  the  partitionabUity  idea,  easily 
extend  to  problems  in  higher  dimensions. 

VLSI  Architectures  for  Image  Transforms  and 
Vector  Quantization  We  have  studied  VLSI  archi¬ 
tectures  for  various  image  transforms  and  vector  quan¬ 
tization  techniques.  Two  linear  array  architectures  have 
been  proposed  for  computing  the  arithmetic  Fourier 
transform  and  image  compression  using  vector  quanti¬ 


zation  [Park  and  Prasanna,  1991a;  Park  and  Prasanna, 
1991b].  These  architectures  have  modular  PEs  and  can 
support  real-time  processing.  The  designs  can  operate 
with  less  number  of  PEs  than  the  input  size.  The  pro¬ 
posed  designs  require  fixed  I/O  bandwidth  with  the  host. 

Parallelization  of  Symbolic  Techniques  in  Vision 
There  is  relatively  little  work  done  in  parallelizing  higli 
level  vision  algorithms.  Such  algorithms  are  usually 
symbolic  in  nature  and  the  processing  is  not  entirely 
local.  We  believe  that  when  dealing  with  such  com¬ 
plex  algorithms,  the  parallel  implementation  must  be 
concerned  with  the  following  four  characteristics:  algo¬ 
rithm  speedup,  processor  efficiency,  system  complexity, 
and  programmer  burden. 

Most  research  in  parallel  processing  has  been  con¬ 
cerned  solely  with  the  first  two  characteristics.  We 
have  been  pursuing  an  alternative  that  achieves  a  bet¬ 
ter  balance  between  the  desired  characteristics.  In  our 
approach  we  classify  algorithms,  in  terms  of  opera¬ 
tions,  data  dependencies,  data  movements,  and  algo¬ 
rithm  characteristics,  and  then  specify  a  parallel  pro¬ 
cessor  architecture  that  is  well  suited  to  those  character¬ 
istics  [Reinhart,  1991]. 

We  have  applied  this  methodology  to  a  number  of 
mid  and  high  level  vision  algorithms.  Our  first  expe¬ 
rience  was  with  an  algorithm  for  image  matching  via  re¬ 
laxation  labelling  with  symbolic  objects  and  geometric 
constraints[Medioni  and  Nevatia,  1984].  Our  analysis  in¬ 
dicated  that  the  use  of  an  MIMD  architecture  that  com¬ 
prises  powerful  processing  elements  programmed  with 
the  loosely  synchronous  protocol.  A  suitable  intercon¬ 
nect  topology  is  one  of  logarithmic  diameter.  Two  im¬ 
plementations  were  developed,  one  using  binary  tree  and 
the  other  using  hypercube  connections.  This  scheme  ex¬ 
ploits  the  coarse  grain  parallelism  within  the  algorithm. 
Further  analysis  shows  that  equipping  each  PE  with  a 
tightly  coupled  vector  processor  would  exploit  the  fine 
grain  parallelism  within  the  algorithm.  This  architecture 
achieves  high  degrees  of  speedup  and  efficiency  while  us¬ 
ing  software  that  is  nearly  identical  to  that  of  the  serial 
implementation,  thus  system  complexity  and  program¬ 
mer  burden  are  minimized.  Details  of  this  study  were 
previously  reported  in  [Reinhart  and  Nevatia,  1990]. 

In  more  recent  work,  we  have  studied  an  algorithm 
for  object  recognition  that  uses  graph  matching.  Our 
analysis  again  indicates  that  an  MIMD  architecture  with 
powerful  processing  elements  is  suited  to  this  problem. 
The  PEs  are  connected  by  a  hypercube  topology.  Again, 
we  achieve  significant  algorithm  speedup  and  proces.sor 
efficiency  while  using  software  that  is  nearly  identical  to 
that  of  the  serial  implementation. 

Lastly,  we  have  studied  the  mid-level  operations  of 
linear  feature  extraction  and  perceptual  organization. 
These  operations  may  appear  to  be  simple  and  repet¬ 
itive,  and  thus  well  suited  to  SIMD  implementations. 
However,  this  is  not  the  case.  Typically,  parallel  im¬ 
plementations  only  focus  on  the  study  of  a  specific  al¬ 
gorithm.  They  assume  that  the  input  is  given  in  the 
desired  form  and  the  output  is  produced  in  some  form. 
In  a  system  (or  sub-system)  that  comprises  of  a  number 
of  processing  steps,  conversion  of  the  output  of  one  stage 
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to  another  itself  can  be  a  majo  step,  possibly  requiring 
serial  implementation.  A  simple  example  is  that  of  lin¬ 
ear  feature  extraction  where  finding  edges  and  then  their 
neighbors  that  would  form  curves  is  an  iconic  process 
that  is  easily  implemented  on  a  SIMD  maciiine.  How¬ 
ever,  this  is  different  from  actually  producing  a  list  of 
curves,  each  curve  given  by  a  list  of  points  forming  it, 
in  order,  and  possibly  a  linear  approximation  to  it  as 
well.  This  is  the  structure  needed  for  subsequent  use  of 
the  linear  feature  processing.  Our  proposed  implemen¬ 
tation  is  described  in  detain  in  another  paper  in  these 
proceedings  [Reinhart  and  Nevatia,  1992]. 

In  conclusion,  our  research  h2is  shown  that  the  com¬ 
plex  operations  and  data  movements  required  by  mid 
and  high  level  vision  can  be  performed  efficiently  if  care 
is  taken  in  specifying  the  parallel  processor  architecture. 
Implementations  that  achieve  high  degrees  of  algorithm 
speedup  and  processor  efficiency  can  be  attained  with¬ 
out  sacrificing  system  complexity  and  programmer  bur¬ 
den.  Such  architectures  can  be  realized  utilizing  hetero¬ 
geneous  designs  or  via  reconfigurable  architectures  given 
an  efficient  reconfiguration  procedure.  We  have  certainly 
not  studied  all  the  algorithms  used  in  vision,  but  believe 
that  our  choice  of  selected  algorithms  covers  enough  of 
a  span  to  indicate  that  it  is  fruitful  to  further  pursue 
this  approach.  We  also  believe  that  the  next  step  is  to 
investigate  the  parallel  implementation  of  complete,  het¬ 
erogeneous  vision  systems  that  comprise  low,  mid,  and 
high-level  algorithms. 
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Abstract 

The  image  unclerstancling  program  at  SRI  Interna¬ 
tional  is  a  broad  effort  s])anning  the  entire  range  of  ma¬ 
chine  vision  research.  In  this  report  we  describe  oiir 
progress  in  two  domains;  the  first  is  concerned  with 
modeling  the  earth’s  surface  from  aerial  imaging  sen¬ 
sors;  the  second  is  concerned  with  ground-level  vision 
e-id  vision-based  land  navigation.  In  particular,  we  de¬ 
scribe  progress  in  stereo  compilation  and  automated  ter¬ 
rain  modeling  from  aerial  imagery;  in  interactive  scene 
modeling  and  scene  generation;  in  automatic  image  seg¬ 
mentation  and  delineation  of  man-made  objects;  in  de¬ 
tecting  and  tracking  moving  objects;  and  in  using  knowl¬ 
edge  beyond  shape  and  immediate  appearance  to  rec¬ 
ognize  objects  in  natural  scenes  and  other  complex  do¬ 
mains. 

1  Introduction 

The  overall  goal  of  Image  Understanding  research  at 
SRI  International  is  to  obtain  solutions  to  fundamen¬ 
tal  problems  in  computer  vision  that  are  essential  in  al¬ 
lowing  machines  to  model,  manipulate,  and  understand 
their  environment  from  sensor-acquired  data  and  stored 
knowledge. 

In  this  report  we  describe  progre.ss  in  two  domains, 
aerial  and  ground-based  vision.*  The  first  is  concerned 
with  i.icd^Iing  the  earth’s  surface  from  photographs 
taken  from  aircraft  and  satellites;  the  second  is  con¬ 
cerned  with  modeling  a  natural  environment  in  real  time 
from  data  taken  by  a  robotic  device  moving  through,  and 
interacting  with,  this  environment. 

In  the  discussion  of  the  first  domain  we  describe  our 
progress  in  developing  stereo  techniques  for  building  ter¬ 
rain  models  from  aerial  imagery;  interactive  techniques 
for  building  three-dimensional  models  of  man-made  and 
cultural  objects,  and  a  new  automatic  technique  for  seg¬ 
menting  aerial  images  into  coherent  regions  and  for  de¬ 
tecting  and  delineating  man-made  objects. 

’Supported  by  various  Defense  Advanced  Research  Projects 
Agency  contracts. 


In  the  discu.ssion  of  ground-ba.sed  vision  we  describe 
progress  in  developing  technic|ues  for  building  object  de¬ 
scriptions  that  evolve  gradually  over  time  a.s  more  ilata 
are  obtained,  motion  analysis  technii|nes  for  detecting 
and  tracking  moving  obji'cts  in  data,  taken  by  moving 
sen.sors,  and  a  new  method  for  using  contextual  infor¬ 
mation  to  recognize  natural  objects,  such  as  trees  and 
bu.shes,  in  outdoor  scenes. 

An  important  theme  in  much  of  our  current  work 
is  an  emphasis  on  comi)utational  performance  —  espe¬ 
cially  through  the  development  of  algorithms  capable  of 
exploiting  the  new  parallel  machine  architectures  now 
available  (e.g.,  the  Connection  Machine”").* 

2  Stereo  Research  and  Stereo 
Based  Modeling 

Stereo  reconstruction  is  a  critical  task  in  machine 
vision,  with  applications  to  robotics  and  cartogra¬ 
phy,  that  has  received  a  great  deal  of  attention 
in  the  image  understanding  community  ([BarnardOO], 
[Barnard&Fischler90]).  Its  importance  goes  beyond  the 
obvious  application  to  constructing  geometric  models: 
understanding  scene  geometry  is  necessary  for  effec¬ 
tive  feature  extraction  and  other  scene  analysis  tasks. 
While  considerable  success  ha.s  been  achieved  in  impor¬ 
tant  parts  of  the  problem,  there  is  no  complete  stereo¬ 
modeling  system  that  can  perform  reliably  in  a  wide  va¬ 
riety  of  scene  domains. 

Historically,  the  com|)utational  modeling  of  stereo  vi¬ 
sion  has  been  driven  by  a  number  of  diverse  motivations. 
The  practical  applications  of  automated  stereo  are  so 
important,  especially  in  cartography  and  robotics,  that 
many  engineering-oriented  approaches  have  been  tried. 
These  often  use  “correlation”  techniques:  patches  of  in¬ 
tensities  in  one  image  are  searched  for  in  the  other  image 
by  maximizing  a  measure  of  correlation  or  minimizing  a 
measure  of  intensity  mismatch.  The  other  motivations, 
such  as  the  desire  to  model  biological  stereo,  involve  a 
variety  of  techniques.  Some  are  feature- based:  discrete 

*Use  of  a  Connection  Macliine  was  provided  by  DARPA. 
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local  features  (usually  edges)  are  matched  across  images; 
others  use  an  approach  in  which  a  dense  disparity  map 
is  the  state  variable  of  a  system,  stereo  matching  is  then 
formulated  as  an  optimization  problem;  find  the  best 
disparity  map  by  maximizing  an  objective  function  that 
measures  the  “quality”  of  the  map. 

Our  research  strategy  in  this  task  is  to  develop  new 
techniques  for  the  key  steps  in  the  stereo  process,  such 
as  matching  and  interpolation,  and,  in  parallel,  to  in¬ 
tegrate  these  new  idea,s  with  existing  techniques  in  the 
context  of  an  operational  syst  em.  As  jiart  of  this  process 
SRI  has  implemented  [Hannah85]  and  evaluated  [Han- 
nahSS,  HannahSO]  a  complete  high-performance  stereo 
system,  STEllEOSYS,  that  uses  a  combination  of  the 
correlation  and  feature-matching  approaches.  In  a  test 
of  existing  stereo  systems  on  12  pairs  of  digital  images, 
conducted  by  the  International  Society  of  Photogramme- 
try,  STEREOSYS  was  able  to  succes.sfully  process  more 
of  the  images  than  any  other  system  (11  out  of  the  12 
pairs);  while  no  forma)  ranking  of  the  test  results  will 
be  published,  it  appears  that  this  system  placed  first  (or 
very  near  the  top)  in  the  competition.  While  STERE¬ 
OSYS  was  originally  targeted  for  application  to  the  do¬ 
main  of  aerial  imagery,  it  has  also  performed  successfully 
in  extensive  tests  involving  ground-level  outdoor  imagery 
obtained  from  a  camera  mounted  on  a  moving  platform. 

Another  system  we  have  developed  for  stereo  match¬ 
ing  is  CYCLOPS  [BarnardOO].  This  work  began  as 
a  stochastic-optimization  approach  to  stereo  matching, 
but  has  evolved  into  a  more  complete  system  for  car¬ 
tographic  terrain  modeling,  including  software  modules 
for  camera  modeling,  epipolar  resampling,  the  genera¬ 
tion  of  regular-grid  elevation  maps,  ortho-images,  con¬ 
tour  plots,  and  synthetic  perspective  views,  in  addition 
to  the  central  task  of  image  matching.  One  of  the  goals  of 
this  work  has  been  to  develop  efficient  stereo-processing 
methods  for  massively  parallel  SIMD  architectures.  The 
CYCLOPS  system  is  implemented  on  the  Connection 
Machine.  The  current  implementation  is  capable  of  pro¬ 
ducing  a  dense  terrain  model  (depth  for  every  pixel)  for 
a  typical  pair  of  1024x1024  eierial  stereo  image  in  about 
eight  minutes,  using  a  Connection  Machine  with  4096 
processors.  First,  camera  model  information  is  used  to 
produce  corrected  images  with  only  horizontal  parallax. 
The  corrected  images  are  then  matched  with  a  multi¬ 
grid  optimization  algorithm.  Essentially,  the  matching 
algorithm  is  a  stochastic  regularization  method  that  tries 
to  find  the  flattest  dense  disparity  map  that  matches  the 
photometry  with  least  error.  It  does  so  by  iterating  a  mi- 
crocanonical  version  of  simulated  annealing  across  sev¬ 
eral  levels  of  a  resolution  pyramid,  using  the  results  from 
the  coarser  levels  to  initialize  the  optimization  search  at 
the  finer  levels.  After  the  '  c.  .c -ted  images  are  matched 
the  disparity  measurenw  .i;  '  converted  into  a  dense 
but  irregular  mesh  of  de.^th  r  w  isurements,  which  is  then 
resampled  into  a  grid  of ■  /  iions  with  respect  to  regu¬ 


larly  spaced  grouiul  coordinates. 

In  .some  of  our  most  recent  work,  we  have  integrated 
information  produced  by  .shading  and  stereo  in  the  con¬ 
text  of  a  shape-from-shading  algorithm  that  can  use 
stereo  depth  maps  to  provide  initial  and  boundary  con¬ 
ditions  [LeclercfcHobic.k91,  these  proceed tugs].  We  note 
that  shading  and  stereo  are  complementary  techniques 
for  two  rea.sons:  First,  relative  depth  discrimination  from 
.stereo  decrea.ses  with  absolute  depth,  whereas  orienta¬ 
tion  discrimination,  as  determineil  by  shatliug,  does  not. 
Thus,  for  example,  t  he  geometric  details  of  tree  canopies 
in  aerial  images  may  be  lost  to  stereo,  but  can  be  recov¬ 
ered  from  the  shading  information.  Second,  regions  in 
the  image  where  stereo  fails  becau.se  of  lack  of  interesting 
visual  events,  such  as  the  rolling  hills  of  desert  areas,  are 
good  candidate  regions  for  shading  analysis.  Our  imple¬ 
mented  system  was  applied  to  a  number  of  synthetic  and 
real  images.  The  algoi  itlim  was  able  to  recover  surface 
scratches  and  dents  as  little  .>s  1mm  in  depth  at  a  dis¬ 
tance  of  2m,  which  is  well  belovv  the  competence  of  stereo 
analysis  for  the  given  image  resolution.  We  are  currently 
attempting  to  use  coitfidence  measures  provided  by  the 
stereo  system  to  control  the  application  of  the  shading 
analysis  and  to  adjust  the  balance  between  the  shading 
and  stereo  solutions. 

Given  a  set  of  relatively  dense  depth  maps  of  a  given 
area  (as  might  be  jirovided  from  different  views  by  stereo, 
shading,  or  a  laser  range-finder),  one  must  still  integrate 
the  information  into  a  consistent  3-D  representation  of 
the  imaged  surfaces.  Fua  and  Sander  [these  proceedings] 
describe  an  integration  approach  based  on  fitting  local 
quadric  surfaces  to  the  data.  Their  algorithm  consists  of 
four  sequential  steps;  smoot  hing  of  the  points  by  itera¬ 
tive  local  surface  fitting;  resampling  the  smoothed  points 
onto  a  regular  grid;  computation  of  an  adjacency  graph 
of  the  points  with  clustering  of  the  connected  compo¬ 
nents;  and  triangulation  of  the  clusters.  This  results  in 
a  representation  of  the  depth  data  in  terms  of  locally  con¬ 
sistent  quadric  surfaces.  The  algorithm  was  applied  to  a 
number  of  complex  3-D  scenes,  integrating  as  many  as 
12  depth  maps  into  a  single  global  representation,  with 
excellent  results. 

While  the  stereo  problem  remainr  a  key  focus  of  our  re¬ 
search  program,  additional  effort  is  now  being  devoted  to 
developing  an  understanding  of  how  knowledge  of  scene 
depth  can  be  effectively  used  in  the  scene-partitioning 
and  object-recognition  tasks. 
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3  Interactive  Techniques  for 
Scene  Modeling:  A 
Cartographic  Modeling 
Environment 

Manual  photointerpretation  is  a  difficult  and  time- 
consuming  step  in  the  compilation  of  cartographic  in¬ 
formation.  However,  fully  automated  techniques  for  this 
purpose  are  currently  incapable  of  matching  the  human’s 
ability  to  employ  background  knowledge,  common  sense, 
and  reasoning  in  the  image-interpretation  task.  Near- 
term  solutions  to  computer-based  cartograiihy  must  in¬ 
clude  both  interactive  extraction  technic|ues  and  new 
ways  of  using  computer  technology  to  provide  the  end- 
user  with  useful  information  in  the  form  of  both  image 
and  map-like  interactive  computer  display.s. 

In  order  to  support  research  in  semiautomat.ed  and 
automated  computer-based  cartography,  we  have  de¬ 
veloped  the  SRI  Cartographic  Modeling  Environment 
(CME).  In  the  context  of  an  interactive  workstation- 
based  system,  the  user  can  manipulate  multiple  images; 
camera  models;  digital  terrain  elevation  data;  point,  line, 
and  area  cartographic  leatures;  and  a  wide  assortment 
of  three-dimensional  objects.  Interactive  capabilities  in¬ 
clude  free-hand  feature  entry,  feature  editing  in  the  con¬ 
text  of  task-based  constraints,  and  adjustment  of  the 
scene  viewpoint.  Synthetic  views  of  a  scene  from  arbi¬ 
trary  viewpoints  may  be  constructed  using  terrain  and 
feature  models  in  combination  with  texture  maps  ac¬ 
quired  from  aerial  imagery.  This  ability  to  provide  an 
end-user  with  an  interactively  controlled  scene-viewing 
capability  could  eliminate  the  need  to  produce  hard¬ 
copy  maps  in  many  application  contexts.  Additional  ap¬ 
plications  include  high-resolution  cartographic  compila¬ 
tion,  direct  utilization  of  cartographic  products  in  digital 
form,  and  generation  of  mission-planning  and  training 
scenarios. 

Recent  work  has  focused  on  porting  the  CME  to  a 
UNIX/C  platform  (from  its  current  LISP-machine  im¬ 
plementation)  in  order  to  support  technology  transfer 
goals.  In  particular,  the  CME  is  being  reimplemented 
on  a  Sun  Microsystems  SparcStation  II  using  X  Win¬ 
dows.  The  software  base  is  a  combination  of  Common 
Lisp  with  CLOS  and  C,  where  C  is  used  to  increase  per¬ 
formance  of  selected  time-critical  components.  In  addi¬ 
tion,  an  interface  is  being  developed  to  allow  C  applica¬ 
tion  programmers  to  acce.ss  the  CME  data  structures  and 
functions  that  are  implemented  in  LISP.  The  intent  is  to 
provide  a  single  environment  that  provides  the  full  range 
of  CME  functionality  to  both  LISP  and  C  programmers. 
Other  work  involves  developing  more  flexible  object  rep¬ 
resentations,  irregular  terrain  grids,  and  improved  inter¬ 
faces  to  other  systems  such  as  the  SRI-developed  C'ore 
Knowledge  System.  One  especially  important  technical 
improvement  involves  sensor  geometry  extensions. 


The  SRI  Cartographic  Modeling  Environment  u.ses 
sensor  geometry  models  in  two  principal  ways:  1)  pro¬ 
jecting  the  3D  world  coordinates  into  ‘iD  sen.sor  (pixel) 
coordinates,  and  2)  computing  the  iiitersection  of  a  3D 
ray  (corresponding  to  a  sensor  pixel)  with  a  terrain 
model.  The  basic  CME  system  currently  supports  only 
central  (perspective)  projection  and  orthographic  projec¬ 
tion.  In  central  projection,  each  point,  in  3-s]>ace  is  pro¬ 
jected  onto  the  camera  sensor  plane  along  a  ray  passing 
through  a  common  point,  the  projection  center.  We  are 
currently  implementing  a  generic  capability  for  dealing 
with  non-central-projection  sensor  geometries.  When  ac¬ 
complished,  e.ssential  operations  now  sujjported  for  cen¬ 
tral  prcjection  imagery  would  also  be  supported  for  other 
types  of  (orbiting)  sensors.  These  operations  include: 

1.  Display  of  three-dimensional  feature  models  that 
are  cartographically  registered  to  nou-ceutial- 
prpjection  imagery. 

2.  d'errain  rendering,  using  data  acquired  with  any  sen¬ 
sor  geometry,  to  a  format  simulating  any  other  sen¬ 
sor  geometry.  An  example  would  be  mapping  non¬ 
central-projection  imagery  onto  a  terrain  model  and 
generating  a  simulated  image  showing  the  result 
viewed  with  a  central-projection  sensor. 

Our  earlier  work  in  this  overall  task  area  was  presented 
in  two  papers,  one  describing  basic  design  issues  for  this 
type  of  system  [Hanson, Pentland,&Quam87],  and  the 
other  providing  an  overview  of  our  original  plans  for  the 
implementation  [Hanson&Quam88].  A  current  descrip¬ 
tion  appears  as  a  paper  in  these  proceedings  [Mundy,  et. 
al.]. 

4  Detection  and  Delineation  of 
Objects  in  Aerial  Imagery 

The  detection,  delineation,  and  recognition  of  any  sig¬ 
nificantly  broad  class  of  objects  (e.g.,  buildings,  airports, 
cultivated  land)  in  aerial  imagery  has  proven  to  be  an  ex¬ 
tremely  difficult  problem.  In  fact,  a  nominal  component 
in  the  solution  of  this  problem,  image  partitioning,  is 
considered  to  be  one  of  the  most  refractory  problems  in 
machine  vision. 

We  have  formulated  an  optimization-bcised  approach, 
applicable  both  to  image  partitioning  and  to  subsequent 
steps  in  the  scene  analysis  process,  that  involves  finding 
the  “best”  description  of  the  image  in  terms  of  some 
specified  descriptive  language. 

In  the  case  of  image  partitioning  [Leclerc88, 
Leclerc89a,  Leclerc89b,  Leclerc89c,  Leclerc89d],  we  em¬ 
ploy  a  language  that  describes  the  image  in  terms  of  re¬ 
gions  having  a  low-order  polynomial  intensity  variation 
plus  white  noise;  region  boundaries  are  described  by  a 
differential  chain  code.  The  best  description  is  defined  as 
the  simplest  one  (in  the  sense  of  least  encoding  length) 
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that  is  also  stable  (i.e.,  minor  perturbations  in  the  view¬ 
ing  conditions  should  not  alter  the  description).  This 
best  description  is  found  using  a  spatially  local  and  par¬ 
allel  optimization  algorithm  that  has  been  implemented 
on  the  Connection  Machine. 

A  desirable  processing  step  after  image  segmenta¬ 
tion  is  to  further  simplify  the  resulting  chain-code  and 
polynomial  description  by;  (1)  describing  the  bound¬ 
aries  using  straiglit  lines  and  otlier  more  global  mod¬ 
els  [Leclerc89c] ,  and  (2)  grouping  nonadjacent  regions 
whose  intensity  variation  can  be  more  simply  described 
by  a  single  polynomial  [LeclercQOa,  Leclerc90b(  </iese  pro¬ 
ceedings)]. 

In  situations  where  the  required  image  description 
must  extend  beyond  that  of  a  delineation  of  coherent 
regions,  we  require  an  extended  vocabulary  relevant  to 
the  semantics  of  the  given  task.  Fua  and  Leclerc  deal 
with  the  problem  of  boundary /shape  detection  given  a 
rough  estimate  of  where  the  boundary  is  located  and 
a  set  of  photometric  (intensity-gradient)  aiul  geometric 
(shape-constraint)  models  for  a  given  class  of  objects 
[Fua&Leclerc88,  Fua&;Leclerc90].  They  define  an  en¬ 
ergy  (objective)  function  that  assumes  a  minimal  value 
when  the  models  are  exactly  satisfied.  An  initial  es¬ 
timate  of  the  shape  and  location  of  the  curve  is  used 
as  the  starting  point  for  finding  a  local  minimum  of 
the  energy  function  by  embedding  this  curve  in  a  vis¬ 
cous  medium  and  solving  the  dynamic  equations.  This 
energy-minimization  technique,  wJiicli  evolved  from  a 
less-efficient  gradient-descent  approach  [Leclerc&Fua87], 
has  been  implemented  on  the  Connection  Machine.  It 
has  been  applied  to  straight-line  boundary  models  and 
to  more  complex  models  that  include  constraints  on 
smoothness,  parallelism,  and  rectilinearity.  In  an  inter¬ 
active  mode,  the  user  supplies  an  initial  estimate  of  the 
boundary  of  some  object  (which  may  be  quite  complex, 
like  the  outline  of  an  airplane)  and  then,  if  need  be,  cor¬ 
rects  the  optimized  curve  by  applying  forces  to  the  curve 
or  by  changing  one  of  few  optimization/model  param¬ 
eters. 

Automatic  recognition  and  delineation  of  important 
cartographic  objects,  such  as  man-made  structures,  from 
cteriai  imagery  has  been  addressed  [Fua&:Hanson89a, 
Furi&;Hanson89b].  The  basis  for  the  approach  is  a  the¬ 
oretical  formulation  of  object  delineation  as  an  opti¬ 
mization  problem;  practical  objective  measures  are  in¬ 
troduced  that  discriminate  among  a.  multitude  of  object 
candidates  using  a  model  language  and  the  minimal- 
encoding  principle.  This  approach  is  then  applied  in 
two  distinct  ways  to  the  extraction  of  buildings  from 
aerial  imagery:  the  first  is  an  operator-guided  procedure 
that  uses  a  massively  parallel  Connection  Machine  im¬ 
plementation  of  the  objective  measure  [Fiia89]  to  dis¬ 
cover  a  building  in  real  time  given  only  a  crude  sketch. 
The  second  is  an  automated  hypothesis  generator  that 
employs  the  objective  measure  during  various  steps  in 


the  hypothesis-generation  procedure,  as  well  as  in  the  fi¬ 
nal  stages  of  candidate  selection;  both  serial  and  i)ai  allel 
(Connection  Machine)  approaches  are  implemented. 

We  believe  that  the  above  f.echnitiu<?s  represent  signif¬ 
icant  advances  in  the  state-of-the-art  in  their  respective 
areas  of  image  partitioning  and  delineation.  The  im¬ 
plemented  systems  based  on  these  techniques  have  been 
able  to  produce  excellent  results  in  complex  situations 
where  existing  (typically  local)  apjnoaches  fail.  Future 
work  will  emphasize  the  incorporation  of  more  comi>lex 
models,  three-dimensional  contextual  information,  and 
efficient  parallel  implementations. 

An  important  issue  in  applications  involving  interac¬ 
tive  feature  extraction  is  the  language  available  to  per¬ 
mit  effective  communication  between  the  human  opera¬ 
tor  and  the  machine.  Some  of  the  techniques  discussed 
above  are  not  only  useful  in  the  analysis  of  imagery, 
but  also  as  components  in  the  interactive  exchange  of 
information.  For  e.xample,  the  partitioning  techniques 
we  described  above  are  needed  to  permit  the  human  to 
point  at  something  in  an  image  and  have  the  machine 
understand  which  image  region  is  being  discu.s.sed.  A 
second  needed  capability  is  an  effective  way  of  allowing 
the  machine  to  properly  interpret  the  3-0  structure  of 
an  object  depicted  in  a  2-D  sketch  provided  by  the  hu¬ 
man  operator.  In  a  paper  included  in  these  proceedings 
[FischlerfcLeclerc],  we  describe  recent  work  which  can 
provide  some  of  this  desired  functionality  (the  method 
discussed  employs  a  simple  optimization  technique  to  re¬ 
cover  the  3-D  wire  frame  corresponding  to  a  2-D  line 
drawing). 

5  Object  Recognition  in  the 
Natural  Outdoor  World 

The  natural  outdoor  environment  poses  significant  ob¬ 
stacles  to  the  design  and  successful  integration  of  the 
interpretation,  planning,  navigational,  and  control  func¬ 
tions  of  a  robotic  device  supported  by  a  general-purpose 
vision  system.  Many  of  these  functions  cannot  yet  be 
performed  at  a  level  of  competence  and  reliability  neces¬ 
sary  to  satisfy  the  needs  of  an  autonomous  robot.  Part  of 
the  problem  lies  in  the  inability  of  available  techniques, 
especially  those  involved  in  sen.sory  interpretation,  to  use 
contextual  information  and  stored  knowledge  in  recog¬ 
nizing  objects  and  environmental  features.  One  of  our 
goals  in  this  effort  has  been  to  design  a  core  knowledge 
structure  (CKS)  that  can  support  a  new  generation  of 
knowledge-based  generic  vision  systems.  A  second  goal 
is  to  actually  construct  a  vision  system,  which  employs 
the  CKS,  and  has  the  competence  to  recognize  objects 
appearing  in  ground  level  imagery  of  natural  outdoor 
scenes. 
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5.1  Condor:  A  Contextual  Vision 
System 

Much  of  the  progress  that  has  been  made  to  date 
in  machine  vision  has  been  based,  almost  exclusively, 
on  shape  comparison  and  classification  employing  lo¬ 
cally  measurable  attributes  of  the  imaged  objects  (e.g., 
color  and  texture).  Natural  objects  viewed  under  real¬ 
istic  conditions  do  not  have  uniform  shapes  that  can  be 
matched  against  stored  prototypes,  and  their  local  sur¬ 
face  properties  are  too  variable  to  be  unique  determin¬ 
ers  of  identity.  The  standard  machine  vision  recognition 
paradigms  fail  to  provide  a  means  for  reliably  recognizing 
any  of  the  object  classes  common  to  the  natural  outdoor 
world  (e.g.,  trees,  bu.shes,  rocks,  and  rivers).  In  this  ef¬ 
fort  [Strat&Fischler90],  we  have  devised  a  new  paradigm 
which  explicitly  invokes  context  and  stored  knowledge  to 
control  the  complexity  of  the  decision-making  processes 
involved  in  correctly  identifying  natural  objects  and  de¬ 
scribing  natural  scenes. 

Tlie  conceptual  architecture  of  the  system  we  describe, 
called  Condor  (for  context-driven  object  recognition),  is 
mucii  like  that  of  a  production  system;  there  are  many 
computational  processes  interacting  through  a  shared 
data  structure.  Interpretation  of  an  image  involves  the 
following  four  process  types. 

•  Candidate  generation  (hypothesis  generation) 

•  Candidate  comparison  (hypothesis  evaluation) 

•  Clique  formation  (grouping  mutually  consistent  hy¬ 
potheses) 

•  Clique  selection  (selection  of  a  “best”  description) 

Each  process  acts  like  a  daemon,  watching  over  the 
knowledge  base  and  invoking  itself  when  its  contextual 
requirements  are  satisfied.  The  input  to  the  system  is  an 
image  or  set  of  images  that  may  include  intensity,  range, 
color,  or  other  data  modalities.  The  primary  output  of 
the  system  is  a  labeled  3D  model  of  the  scene.  The 
labels  included  in  the  output  description  denote  object 
classes  that  the  system  has  been  tasked  to  recognize, 
plus  others  from  the  recognition  vocabulary  that  happen 
to  be  found  useful  during  the  recognition  process.  An 
object  class  is  a  category  of  scene  features  such  as  sky, 
ground,  geometric-horizon,  etc. 

A  central  component  of  the  architecture  is  a  special- 
purpose  knowledge/database  used  for  storing  and  pro¬ 
viding  access  to  knowledge  about  the  visual  world,  as 
well  as  tentative  conclusions  derived  during  operation  of 
the  system.  In  Condor,  these  capabilities  are  provided 
by  the  Core  Knowledge  Structure  (Strat&Smith88). 

Visual  interpretation  knowledge  is  encoded  in  context 
sets,  which  serve  as  the  uniform  knowledge  representa¬ 
tion  scheme  used  throughout  the  system.  The  invoca¬ 
tion  of  all  processing  operations  in  Condor  is  governed 
by  context  through  the  use  of  various  types  of  context 


sets:  an  action  is  initiated  only  when  one  or  more  of  its 
controlling  context  sets  is  satisfied.  Thus,  the  actual  se¬ 
quence  of  computations,  and  the  labeling  decisions  that 
are  made,  are  dictated  by  contextual  information  (stored 
in  the  Core  Knowledge  Structure),  by  the  computational 
state  of  the  .system,  and  by  the  image  data  available  for 
interpretation. 

The  customary  approach  to  recognition  in  machine  vi¬ 
sion  is  to  design  an  analysis  techniciue  that  is  competent 
in  a.s  many  contexts  as  possible.  In  contrast  to  this  ten¬ 
dency  toward  large,  monolithic  procedures,  the  strategy 
embodied  in  Condor  is  to  make  use  of  a  large  number 
of  relatively  simple  procedures.  Each  procedure  is  com¬ 
petent  only  in  some  re.stricted  context,  but  collectively, 
these  procedures  offer  the  potential  to  recognize  a  fea¬ 
ture  in  a  wide  range  of  contexts.  The  key  to  making  this 
strategy  work  is  to  use  contextual  information  to  predict 
which  procedures  are  likely  to  yield  desirable  results,  and 
which  are  not. 

Conrlor  operates  as  follows:  For  each  label  in  the  ac¬ 
tive  recognition  vocabulary,  all  candidate  generation  con¬ 
text  sets  are  evalnateil.  The  operators  associated  with 
those  that  are  satisfied  are  executed,  producing  candi¬ 
dates  for  each  class.  Candidate  comparison  context  sets 
that  are  satisfied  are  then  used  to  evaluate  each  candi¬ 
date  for  a  given  class,  and  if  all  such  evaluators  prefer 
one  candidate  over  another,  a  preference  ordering  is  es¬ 
tablished  between  them.  These  preference  relations  are 
assembled  to  form  partial  orders  over  the  candidates,  one 
partial  order  for  each  class.  Next,  a  search  for  mutually 
coherent  sets  of  candidates  is  conducted  by  incrementally 
building  cliques  of  consistent  candidates,  beginning  with 
empty  cliques.  A  candidate  is  nominated  for  inclusion 
into  a  clique  by  choosing  one  of  the  candidates  at  the 
top  of  one  of  the  partial  orders.  Consistency  determina¬ 
tion  context  sets  that  are  satisfied  are  used  to  test  the 
consistency  of  a  nominee  with  candidates  already  in  the 
clique.  A  consistent  nominee  is  added  to  the  clique;  an 
inconsistent  one  is  removed  from  further  consideration 
with  that  clique.  Further  candidates  are  added  to  the 
cliques  until  none  remain.  Additional  cliques  are  gen¬ 
erated  in  a  similar  fa.shion  a.s  computational  resources 
permit.  Ultimately,  one  clique  is  selected  as  the  best  se¬ 
mantic  labeling  of  the  image  on  the  basis  of  the  portion 
of  the  image  that  is  explained  and  the  reliability  of  the 
operators  that  contributed  to  the  clique. 

Condor  has  already  successfully  analyzed  large  num¬ 
ber  of  photographs  taken  at  an  experimental  site  in  the 
foothills  behind  Stanford  University.  Additional  images 
have  been  acquired  and  will  be  proces.sed  to  more  fully 
evaluate  our  approach.  Based  on  our  initial  experiments, 
and  the  unique  architecture  of  of  our  system,  we  are 
highly  optimistic  about  the  ability  of  Condor  to  over¬ 
come  many  of  the  limitations  (with  respect  to  object 
recognition)  inherent  in  the  traditional  machine  vision 
paradigms. 
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6  Object  Modeling  from 
Multiple  Images 

Our  goal  in  this  research  effort  is  to  develop  automated 
methods  for  producing  a  labeled  three-dimensional  scene 
model  from  image  sequences.  We  view  the  imago- 
seqiience  approach  as  an  important  way  to  avoid  many 
of  the  problems  that  hamper  conventional  stereo  tech¬ 
niques  because  it  provides  the  machine  with  botli  “re¬ 
dundant”  information  and  new  information  about  the" 
scene.  The  redundant  information  can  be  used  to  in¬ 
crease  the  precision  of  the  data  and  filter  out  artifacts; 
the  new  information  can  be  used  for  such  things  as  fill¬ 
ing  in  model  information  along  occlusion  boundaries  and 
disambiguating  matches  in  the  midst  of  periodic  struc¬ 
tures. 

We  have  developed  three  techniques  for  building  three- 
dimensional  descriptions  from  multiple  images.  One  is 
a  range-based  technique  that  builds  scene  models  from 
a  sequence  of  range  images.  Another  is  a  motion  anal¬ 
ysis  technique  that  analyzes  long  sequences  of  intensity- 
images.  And  the  third  technique  detects  moving  objects 
from  a  moving  platform.  These  techniques  are  briefly 
described  below. 

6.1  Building  Descriptions  from 
Sequences  of  Range  Images 

Our  approach  for  analyzing  sequences  of  range  images 
is  to  provide  the  system  with  a  wide  variety  of  generic 
object  and  terrain  repre.sentations  and  an  ability  to  judge 
the  appropriateness  of  these  representations  for  partic¬ 
ular  sets  of  data.  The  variety  of  representations  is  re¬ 
quired  for  two  reasons.  First,  it  is  needed  to  cover  the 
range  of  object  types  typically  found  in  outdoor  envi¬ 
ronments.  And  second,  it  is  needed  to  cover  the  range 
of  data  resolutions  obtained  by  a  robot  vehicle  exploring 
the  environment. 

In  this  approach  to  object  modeling  an  object’s  de¬ 
scription  typically  goes  through  a  sequence  of  distinct 
representations  as  new  data  are  gathered  and  processed. 
One  of  the.se  sequences  might  start  with  a  crude  blob 
description  of  an  initially  detected  object,  include  a 
detailed  structural  model  derived  from  a  set  of  liigli- 
resolution  images,  and  end  with  a  semantic  label  based 
on  the  object’s  description  and  the  sensor  system’s  task. 
This  evolution  in  representations  is  guided  by  a  structure 
we  refer  to  as  “representation  space”;  a  lattice  of  rep¬ 
resentations  that  is  traversed  as  new  information  about 
an  object  becomes  available.  One  of  these  representa¬ 
tions  is  associated  with  an  object  only  after  it  has  been 
judged  to  be-  valid.  We  evaluate  the  validity  of  an  ob¬ 
ject’s  description  in  terms  of  its  temporal  stability.  We 
define  stability  in  a  statistical  sense  augmented  with  a 
set  of  explanations  offering  reasons  for  missing  an  object 
or  having  parameters  change.  These  explanations  can 


invoke  many  types  of  knowledge,  including  the  jiliysics 
of  the  .sensor,  the  performance  of  the  segmentation  |uo- 
cediire,  and  the  reliability  of  the  matching  technique.  To 
illustrate  the  power  of  these  ideas  we  have  implemented 
a  system,  which  we  call  Tra.X,  that  constructs  and  refines 
models  of  outdoor  objects  detected  in  sequences  of  range 
data  gathered  by  an  unmanned  ground  vehich-  driving 
cross-country  [Bobick.tBolle.sS9]. 

We  are  continuing  to  explore  the  idea  of  using  stability 
to  evaluate  the  reliability  of  representations.  In  addition, 
we  plan  to  develop  new  explanations  based  on  support 
and  gravity  and  to  explore  ways  to  combine  other  tyjres 
of  reliability  criteria  wit  h  that  of  stability. 

6.2  Building  3-D  Descriptions  from 
Monocular  Intensity  Image 
Sequences 

We  have  developed  a  motion  analysis  technique, 
which  we  call  Epipolar-Plane  Image  (EPI)  Analysis 
[Bolles, Baker, .kMarimontSTj.  It  is  based  on  consider¬ 
ing  a  dense  .sequence  of  images  as  forming  a  solid  block 
of  data.  Slices  through  this  solid  at  appropriately  cho¬ 
sen  angles  intermix  time  and  spatial  data  in  such  a  way 
as  to  simplify  the  partitioning  problem.  The.se  slice.s 
have  more  explicit  structure  than  the  conventional  im¬ 
ages  from  which  they  were  obtained.  In  the  referenced 
paper  we  demonstrated  the  feasibility  of  this  novel  tech¬ 
nique  for  building  structured,  three-dimensional  descrip¬ 
tions  of  the  world. 

In  later  work  we  extended  this  technique  to  locate  sur¬ 
faces  in  the  spatiotemporal  solid  of  data,  instead  of  ana¬ 
lyzing  slices,  in  order  to  maintain  the  spatial  continuity 
of  edges  from  one  slice  to  the  next  [Baker&BollesSS]. 
This  surface-building  process  is  the  three-dimensional 
analogue  of  two-dimensional  contour  analysis.  We  have 
applied  it  to  a  wide  range  of  data  types  and  tasks,  in¬ 
cluding  medical  images  such  as  computed  axial  tomog¬ 
raphy  (CAT)  and  magnetic  resonance  imaging  (MRl) 
data,  visualization  of  higher  dimensional  (i.e.,  greater 
than  three-dimensional)  functions,  modeling  of  objects 
over  scale,  and  a.ssessment  in  fracture  mechanics. 

We  have  also  implemented  a  version  of  EPI  analysis 
that  works  incrementally,  applying  a  Kalman  filter  to 
update  the  three-dimensional  description  of  the  world 
each  time  a  new  image  is  received  (Baker&Bolles88]. 
As  a  result  of  these  changes  the  program  produces  ex¬ 
tended  three-dimensional  contours  instead  of  sets  of  iso¬ 
lated  points.  These  contours  evolve  over  time.  When  a 
contour  is  initially  detected,  its  location  is  only  coarsely 
estimated.  However,  as  it  is  tracked  through  several  im¬ 
ages,  its  shape  typically  changes  into  a  smooth  three- 
dimensional  curve  that  accurately  describes  the  corre¬ 
sponding  feature  in  the  world. 

Recently  we  have  extended  of  the  EPI  analysis  tech¬ 
nique  in  two  directions.  The  first  is  the  modeling  of 
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biological  structures  from  tomographic  data.  [BakerQO]. 
The  descriptive  formalism  we  are  developing  models  tis¬ 
sue  as  two-dimensional  manifolds  in  three  space.  We 
have  used  this  type  of  model  to  demonstrate  simple  ver¬ 
sions  of  surgical  simulation,  kinematic  modeling,  and 
kinematic  analysis.  In  the  .second  e.vtensioti  we  are  us¬ 
ing  the  temporal  tracking  mechanism  in  EPI  analysis  to 
detect  and  track  moving  objects  from  moving  sensors. 
We  have  added  evaluation  routines  that  s('l(?ct  key  fea¬ 
tures  to  be  ti'cvcked  on  moving  objects.  We  are  currently 
exploring  technicpies  for  constructing  lliree-dimensional 
descriptions  of  the  tracked  objects. 

6.3  Detecting  Moving  Objects  from 
Moving  Sensors 

Building  upon  our  work  in  motion  vision  and  terrain 
modeling,  we  have  recently  begun  development  of  tech¬ 
niques  for  detecting  and  tracking  moving  objects  from  a 
moving  platform.  This  work  is  being  performed  jointly 
with  the  Machine  Vision  Group  at  the  David  Sarnoff 
Research  Center. 

Motion  (in  a  sequence  of  images)  provides  one  of  the 
strongest  cues  available  about  the  presence  of  a  po.ssible 
target  in  a  scene.  However,  when  a  sensor  is  moving,  ev¬ 
erything  in  the  image  is  moving.  Therefore,  detection  of 
possible  targets  requires  separating  the  motion  induced 
by  the  movement  of  the  sensor  from  the  motion  caused 
by  the  movement  of  the  target.  One  approach  to  this 
problem,  which  has  been  developetl  at  Sarnoff  and  other 
places,  is  to  model  the  “background”  image  flow  as  a 
simple  parametric  flow  field,  and  then  use  this  model  to 
eliminate  image  motion  consistent  with  that  flow.  Any 
motion  not  consistent  with  the  background  movement  is 
labelled  as  a  possible  moving  object.  Of  course,  such  an 
approach  fails  dramatically  when  the  simple  background 
assumption  is  violated  (e.g.,  when  the  terrain  contains 
many  ridges  and  valleys,  which  induce  a  wide  variety  of 
background  image  motion) 

The  approach  we  have  taken  is  ba.sed  on  techniques 
developed  at  SRI  [Bolles, Baker, &Marimont87]  and  else¬ 
where  for  constructing  a  three-dimensional  model  of  a 
scene  from  the  motion  field  computed  from  extended 
sequence  of  monocular  images.  The  basic  assumption 
underlying  these  techniques  is  that  the  world  is  static 
and  that  computations  can  be  integrated  over  multiple 
frames  to  generate  a  stable  and  accurate  model  of  the 
geometry  of  the  scene.  Our  strategy  is  to  apply  these 
techniques  and  to  examine  areas  in  the  image  where  the 
computed  geometry  is  not  stable.  The  idea  is  that  the 
instabilities  are  caused  by  objects  that  are  not  static  in 
the  environment. 

Of  course,  the  computed  scene  geometry  can  be  un¬ 
stable  if  the  underlying  motion  field  computation  is  not 
stable.  Such  instabilities  arise  at  occlusion  boundaries, 
such  as  the  top  of  a  ridge,  where  accretion  and  deletion 


boundaries  make  a  motion  field  computation  undefined. 
Thus,  our  method  for  detecting  moving  ob  jects  also  di'- 
tects  these  occlusion  boundaries.  Me  are  currently  inves¬ 
tigating  the  characteristics  of  the  these  di/ferent  events 
in  the  space-time  history  in  order  to  be  abh-  to  distin¬ 
guish  betwerni  them. 

As  part  of  our  research  strategy  w<'  liave  tested  our 
algorithms  on  both  simulated  and  real  data.  (We  used 
the  C'artographic  Modcdiug  Environment  to  ])rovid('  ex¬ 
tensive  simnlatioii  data.)  The  advantage  of  simnlati'd 
data  is  that  we  know  “gromul  trutli"  and  tln'refore  are 
in  a  better  position  to  juilge  t  in'  rompet.i'iice  of  the  al¬ 
gorithms  than  when  we  analyze  real  data.  This  st  rategy 
has  already  paid  off.  Our  initial  <'xp<'rinn'iit.s  wit  h  simu¬ 
lated  data  pointed  out  a  serious  wi'akness  in  ilisplaying 
warped  images  to  demonstrate  the  results  of  optic  (low 
computations.  At  occlusion  boundaries  optic  flow  tech¬ 
niques  locate  matches  (and  com|)nle  flow  vectors)  for 
points  that  have  similar  greyscale  values.  '1  his  proce¬ 
dure  leads  to  stahilizerl  intensity  images  when  the  flow 
vectors  are  userl  to  warp  oin'  image  into  another,  but 
the  flow  vectors  are  incorrect.  We  are  now  careful  to  use 
other  techniques  forjudging  the  correctness  of  computed 
flow  vectors. 
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IMAGE  UNDERSTANDING:  INTELLIGENT  SYSTEMS 


Thomas  O.  Binford 


Abstract 

We  have  achieved  significant  results  toward 
recognition  of  a  very  large  class  of  complex  ob¬ 
jects  in  monocular  images  using  methods  that 
are  rigorous  and  generalizable  with  practical 
impact.  Inverse  generalized  transformational 
invariance  (GTI)  uses  the  original  definition 
of  generalized  cylinders  to  infer  the  shape  of 
parts  of  surfaces  by  an  inverse  process.  In¬ 
verse  GTI  generates  very  wide  applicability 
for  methods  based  on  invariants  and  quasi¬ 
invariants. 


Introduction 

We  have  achieved  significant  steps  toward  structural 
recognition  of  a  very  large  class  of  complex  objects  in 
monocular  images  using  methods  that  are  rigorous  and 
generalizable.  We  believe  that  these  results  are  im¬ 
portant  for  practical  applications  in  surveillance,  ATR, 
and  cartography. 

This  program  of  research  focuses  on  structural  in¬ 
terpretation  that  builds  up  part/whole  descriptions  of 
observed  objects  in  images  as  joints  between  observed 
GC  parts.  Structural  interpretation  is  modular;  we  de¬ 
scribe  significant  results  toward  each  module:  1.  image 
segmentation;  estimation  of  extended  curve  discontinu¬ 
ities  in  2D;  2.  figure/ground  discrimination  of  gener¬ 
alized  cylinder  (GC)  primitive  parts  from  background 
for  about  10  objects,  several  with  multiple  parts.  3.  es¬ 
timation  of  3D  shape  of  GCs  primitive  parts  from  2D 
image  data  for  four  GC  parts;  4.  estimation  of  shape 
of  compound  objects  from  shapes  of  GC  parts.  These 
methods  are  relevant  to  multi-sensor  integration. 

Recognition  from  among  a  moderately  large  set  of 

*This  research  was  supported  in  part  by  a  subcontract  to 
Advanced  Decision  Systems,  “Image  Understanding  Envi¬ 
ronments”  from  a  contract  to  the  Defense  Advanced  Re¬ 
search  Projects  Agency,  and  by  a  contract  from  NASA 
Ames  “Nap  of  the  Earth  Navigation  by  Helicopter” 


object  models  is  close  at  hand.  Structured  3D  object 
descriptions  as  joints  between  GC  parts  are  the  bases 
for  indexing  and  recognition  using  3D  models.  Nevatia 
and  Binford  demonstrated  indexing  and  matching  in 
ways  that  are  significant  for  monocular  data  [Nevatia 
and  Binford  73;  Nevatia  74]  They  used  depth  data  to 
generate  coarse  descriptions  of  3D  observed  objects  as 
search  keys  in  a  3D  object  model  database  and  demon¬ 
strated  maximum  likelihood  matching  of  3D  observed 
model  with  a  small  subset  of  3D  object  models  obtained 
by  indexing. 

We  expect  rapid  progress  in  research  in  development 
of  these  modules  over  the  next  two  to  three  years.  The 
goals  of  this  research  over  five  years  are:  1.  segmenta¬ 
tion  that  is  effective  in  complex  scenes,  especially  out¬ 
door  scenes;  2.  interpretation  of  complex  objects  in 
complex  scenes  with  multi-sensor  data;  3.  integration 
of  segmentation  and  structured  interpretation  in  Image 
Understanding  Environments;  4.  low-complexity  algo¬ 
rithms  that  demonstrate  feasibility  and  effective  per¬ 
formance  in  about  10  minutes  on  20  mip  workstations; 
algorithm  analysis  and  architecture  studies  that  estab¬ 
lish  feasibility  of  .1  second  response. 

This  research  integrates  mathematical  methods;  1. 
invariant  and  quasi-invariant  observables  derived  from 
geometric  representation  (GCs);  2.  geometric  and 
physical  representation  as  systems  of  constraints;  3. 
Bayesian  networks  for  resolving  uncertain  constraints 
from  estimation  and  evidential  reasoning  with  multi¬ 
sensor  data. 

Segmentation 

Segmentation  serves  goals  of  th.'  ^iitire  vision  system, 
especially  of  interpretation.  We  believe  that  we  are  at 
the  beginning  of  effective  and  practical  segmentation 
for  moderately  complex  images,  and  that  we  have  a 
road  map  for  rapid  progress  toward  moderate  success 
with  complex  images.  Here  are  basics  for  our  approach 
to  building  practical  and  theoretically  sound  segmen¬ 
tation.  What  seems  like  a  broad  spectrum  of  imaging 
problems  breaks  down  into  a  small  set  of  segmentation 


components  based  on  careful  analysis  of  physics  and 
geometry  of  observation.  There  is  much  in  common  be¬ 
tween  segmenting  depth  data  and  segmenting  intensity 
data.  Multi-sensor  segmentation  is  achievable.  Com¬ 
plete  segmentation  is  feasible,  in  the  sense  of  a  basis 
set  of  segmentation  modules.  Segmentation  problems 
can  be  made  well-defined.  Segmentation  is  important 
to  functioning  of  practical  systems.  Segmentation  can 
be  made  effective  and  repeatable.  These  basics  must 
be  defended  by  quantitative  scientific  evidence. 

Binford  and  Wang  [forthcoming]  have  succeeded  in 
improving  the  performance  of  a  gradient-based  oper¬ 
ator,  starting  from  a  modified  Canny  operator.  They 
have  improved  sensitivity  by  a  factor  of  4,  i.e.  decreased 
the  threshold  for  gradient  magnitude  by  a  factor  of 
4.  The  only  par2uiieters  used  are  measured  parame¬ 
ters,  sensor  noise  variance  and  impulse  response  func¬ 
tion.  Those  par^uneters  are  measured  from  the  sensor 
or  measured  from  image  content.  Analysis  has  elim¬ 
inated  biases  from  transverse  position  that  improved 
standard  deviation  of  transverse  position  by  a  factor 
of  2  or  more.  Orientation  estimate  had  a  bias  depen¬ 
dent  on  curvature,  typically  35  degrees.  That  bias  was 
reduced  and  then  eliminated  by  a  sequence  of  improve¬ 
ments.  They  defined  a  new  form  of  gradient  operator 
with  three  or  four  parameters:  orientation,  transverse 
position,  transverse  gradient  and  longitudinal  gradi¬ 
ent.  In  some  cases,  discrete  pixel  sampling  introduces 
biases.  They  analyzed  these  biases  of  discrete  sam¬ 
pling.  Analysis  in  all  cases  included  theoretical  char¬ 
acterization,  measurement  on  synthesized  images  and 
measurement  on  truth  in  real  images  which  humans 
had  characterized.  They  built  a  portable  environment 
for  experimentation  and  characterization  in  segmenta¬ 
tion.  In  the  end,  the  system  demonstrated  an  accurate 
2nd  order  statistical  model  proved  on  truth  in  real  data 
for  segmentation  by  this  new  operator.  The  operator 
made  an  impressive  segmentation  of  a  complex  scene 
for  a  real  application.  They  have  begun  to  apply  the 
methods  to  3D  images  to  address  multi-sensor  images. 

An  important  part  of  this  work  is  building  an  accu¬ 
rate  statistical  model  of  behavior  of  segmentation  that 
is  valuable  in  the  Bayesian  network  for  combining  evi¬ 
dence,  especially  in  multi-sensor  problems. 

This  operator  is  effective  for  step  edges  in  intensity 
images.  It  is  not  complete  in  the  sense  that  the  opera¬ 
tor  is  not  effective  for  many  image  features,  e.g.  lines 
or  spots.  The  author  has  promoted  development  of 
a  complete  set  of  operators  [Herskovits  and  Binford 
69].  Binford  and  Wang  have  begun  analysis  and  im¬ 
plementation  of  a  complete  basis.  When  the  complete 
set  exists,  it  will  still  be  ineffective  with  texture.  Over 
the  next  several  years,  the  author  plans  to  implement 
a  simple  effective  segmentation  for  outdoor  images  in¬ 
cluding  texture.  This  research  has  been  supported  in 
part  by  Nasa  Ames  as  part  of  a  program  for  Nap  of  the 
Earth  navigation  by  helicopters  (NOE). 


Figure/Ground 

A  structural  description  of  complex  objects  is  a 
part/whole  graph  with  GC  primitive  parts.  The  origi¬ 
nal  definition  of  GCs  by  generalized  translational  in¬ 
variance  [Binford  71],  more  generally  by  generalized 
transformational  invariance  (GTI),  provided  an  inverse 
method  for  detecting  GC  primitive  parts.  GTI  gen¬ 
erates  a  GC  by  a  transform  of  the  cross  section  as  a 
function  of  a  sweep  along  a  space  curve.  Inverse  GTI 
observes  GCs  (inverse  GCs)  by  estimating  the  inverse 
transform  among  cross  sections  and  swept  sides  of  GCs, 
i.e.  by  discovering  cross  sections  and  swept  sides  that 
are  related  by  a  simple  transform.  The  power  of  this 
method  comes  from  the  fact  that  complex  cross  sections 
with  simple  transforms,  together  with  composition  gen¬ 
erate  a  very  large  class  of  complex  objects  (including 
blending  and  articulation).  Complex  cross  sections  can 
themselves  be  simplified  as  ribbons,  i.e.  GCs  in  a  lower 
dimension  that  are  generated  by  sweeping  a  curve  along 
a  space  curve. 

Inverse  GTI  and  quasi-invariants  provide  powerful 
visual  evidence.  Bayesian  networks  implement  a  com¬ 
prehensive  mechanism  for  consistent  and  effective  com¬ 
bination  of  evidence,  visual  and  non- visual  [Binford  87]. 

Inverse  GTI  correspondence  relations  utilize  avail¬ 
able  data  from  surfaces  and  surf2u:e  boundaries,  includ¬ 
ing  depth  from  stereo  and  motion,  range  measurement, 
orientation  from  shading  or  photometric  stereo,  and 
boundary  discontinuity  curves  from  monocular  images. 
In  single  monocular  images,  discontinuity  curves  amd 
shading  are  available. 

GCs  are  generated  by  transforming  a  cross  section 
surface  by  a  congruence  relation  while  sweeping  the 
cross  section  along  a  space  curve.  Curves  on  cross  sec¬ 
tions  are  called  parallels;  curves  along  the  sweep  direc¬ 
tion  are  called  meridians,  including  limbs  and  edges. 
Most  attention  has  been  paid  to  observing  etnd  analyz¬ 
ing  limbs  of  GCs  [  Nevaia  and  Binford  73,  Sumaneweera 
et  al  87,  Ponce,  Chelberg,  Mann  87,  Mohan  and  Neva- 
tia  89].  For  long,  thin  GCs  in  single,  monocular  images, 
limbs  are  observable  with  low  computational  complex¬ 
ity  while  cross  sections  at  ends  Eire  small  and  distant; 
computational  complexity  high  for  ends  to  make  cor¬ 
respondence  by  the  methods  described  here.  Parallels 
and  meridians  are  complementary:  observing  cross  sec¬ 
tions  by  the  methods  used  here  has  low  complexity  for 
short,  fat  GCs,  e.g.  a  coin,  for  which  limbs  are  difficult 
to  make  correspondence  directly. 

The  community  has  neglected  information  about 
cross  sections,  although  the  author  intended  from  the 
beginning  to  incorporate  analysis  of  cross  sections,  and 
requested  a  student  to  begin  implementation  in  1980. 
In  this  work  we  integrate  information  from  cross  sec¬ 
tions  and  limbs  [Sato  and  Binford  92a].  For  typicsd  ob¬ 
jects,  information  about  both  cross  sections  and  limbs 
is  available.  Relations  between  cross  sections  are  much 
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stronger  than  relations  between  limbs  (meridians).  In 
fact,  there  are  very  strong  relations  between  cross  sec¬ 
tions  that  are  the  basis  for  rigorous  correspondence. 

The  exxperimental  results  make  the  method  intrigu¬ 
ing  and  the  breadth  of  scope  of  analysis  is  novel.  An¬ 
alytic  results  and  the  effectiveness  of  estimation  meth¬ 
ods  were  realized  by  the  author.  Experimental  results 
were  needed  to  be  convincing.  Experimental  results 
were  presented  at  various  stages,  at  INRIA  following 
ECCV  90  in  May  1990,  at  the  Encounter  with  Mathe¬ 
matics  and  Computer  Vision  in  May  1990,  at  DARPA 
lU  Workshop  1990,  in  a  paper  rejected  by  CVPR  91, 
in  a  paper  for  ISIR,  January  1991,  and  in  the  DARPA- 
ESPIRIT  Workshop  for  Invariants  in  Computer  Vision, 
March  1991.  Nevatia  and  colleagues  have  worked  with 
limbs  [Rao  and  Nevatia  87;  Mohan  and  Nevatia  89]. 
Recently,  they  have  made  observations  about  ’’parallel 
symmetry”  from  symbolic  data  [Ulupinar  and  Neva¬ 
tia  90]  without  experimental  data.  [Gross  and  Boult 

90]  show  results  using  shading  to  infer  properties  of 
SHGCs. 

Consider  some  GC  subclasses.  Cross  section  curves 
(parallels)  are  parallel  in  space  for  the  class  of  cylin¬ 
ders  with  arbitrary  cross  sections,  truncated  by  par¬ 
allel  planes.  Parallel  curves  is  an  affine  invariant,  a 
perspective  quasi-invariant  [Binford  81].  Cross  section 
curves  for  cones  with  arbitrary  cross  sections  truncated 
by  planes  are  scaled  in  space  from  a  point.  The  scaling, 
i.e.  the  ratio  of  two  intervals  on  a  line,  is  an  affine  in¬ 
variant  and  a  perspective  quasi- invariant.  Cross  section 
curves  of  SHGCs  are  also  scaled  about  a  single  point. 
Limbs  and  edges  along  the  swept  sides  of  GCs  (merid¬ 
ians)  are  also  scaled  about  the  axis.  We  have  potential 
low  complexity  methods  for  a  substantial  superclass  of 
SHGCs. 

Parallels  and  meridians  of  SHGCs  are  self-similar  un¬ 
der  projection.  For  stereo  and  motion,  curves  are  self- 
similar.  From  our  point  of  view,  this  is  the  dominant 
use  of  invariants  in  vision,  self-similarity  in  generic  vi¬ 
sion,  not  pose  estimation  for  objects  from  memory.  In 
this  theme,  each  object  is  its  own  model.  This  rep¬ 
resentation  and  analysis  of  quasi-invariance  guarantees 
that  invariance  is  almost  universal  in  images  [Binford 

91] .  IVue  invariants  are  few.  Binford  introduced  quasi- 
invariants  to  extend  the  set  of  quantitative  relations. 
Quasi-invariants  are  many.  Our  manifesto:  The  im¬ 
portant  properties  of  objects  are  quasi-invariant  and 
generically  observable  with  low  computational  and  ob¬ 
servational  complexity  (i.e.  with  moderate  image  res¬ 
olution).  The  imp/ycations  of  this  statement  are  ex¬ 
tremely  strong. 

Figure  2a  shows  an  edge  image  of  an  elbow.  Figure 
2b  shows  cross  section  curves  of  the  female  part  of  the 
elbow.  Figure  2c  shows  meridians  of  the  female  part 
of  the  elbow.  Correspondences  were  found  by  a  Hough 
transform  method  that  finds  curves  related  by  a  con¬ 
stant  scaling  [Sato  and  Binford  92].  Figure  2d  shows  a 
set  of  curves  found  on  the  male  part  of  the  elbow.  They 


are  threads,  a  helix,  outside  the  class  of  SHGCs.  Note 
that  cross  sections  of  an  SHGC  truncated  by  planes 
appear  non-straight  except  from  viewpoints  parallel  to 
the  truncating  planes,  i.e.  almost  everywhere.  Figures 
4-7  show  figure/ground  discrimination  with  other  ob¬ 
jects. 

A  Hough  transform  method  is  used  to  discover  in¬ 
verse  GTI  correspondence  [Sato  and  Binford  92a].  For 
cylinders,  the  transform  is  a  constant  translation,  pa¬ 
rameterized  by  an  angle  and  radius  bounded  by  the  im¬ 
age.  For  a  scaling  transform,  there  are  four  parameters, 
an  origin  with  two  parameters  (possibly  at  infinity)  and 
two  scale  factors  for  two  curves.  The  Hough  tran^orm 
to  discover  the  scaling  transform  is  rather  computa¬ 
tion  intensive  and  is  somewhat  insensitive.  Many  who 
use  Hough  transforms  have  the  same  complaints.  The 
author  planned  initially  to  implement  a  different  algo¬ 
rithm  with  low  complexity. 

Together  cross  sections  and  meridians  define  the  sur¬ 
face  of  the  female  part.  The  areas  covered  by  parallels 
and  meridians  correspond.  The  meridians  are  ribs,  in 
fact,  not  limbs  of  the  GC.  The  limbs  of  the  cylinder 
are  not  visible,  but  in  fact  the  limbs  and  the  limit  of 
the  surface  defined  by  the  cross  sections  are  defined  rel¬ 
atively  accurately  by  the  correspondence  between  the 
two  curves.  Typically,  relations  between  meridians  are 
found  from  relations  between  cross  sections. 

Defining  surfaces  from  discontinuities  is  an  exercize 
in  inferring  3D  from  2D  image  curves  [Binford  81].  We 
define  connected  components  in  space  from  connected 
components  in  the  image,  but  with  components  in  the 
image  defined  by  inverse  GTI,  not  by  closed  curves  as 
in  [Sumaneweera  et  al  88].  Note  that  noise  and  clut¬ 
ter  have  little  effect  on  figure/ground  discrimination. 
A  large  number  of  spurious  edges  from  background  or 
surface  markings  like  text,  or  missing  edges  from  failure 
of  segmentation  have  a  slight  effect  on  computational 
complexity  and  a  slight  effect  on  detection  of  inverse  re¬ 
lations.  Thus,  this  method  is  promising  for  images  with 
moderate  complexity,  but  not  for  images  with  texture. 

Estimating  3D  shape 

[Sato  and  Binford  92b,  Mann  and  Binford  92]  de¬ 
scribe  programs  that  build  a  model  of  parts  from 
generic  models,  almost  bottom-up  from  the  image. 
[Mann  and  binford  92]  also  describe  building  the  struc¬ 
tured  object  model  from  the  joint  of  the  two  part  mod¬ 
els.  The  resulting  model  is  in  the  right  form  to  perform 
indexing  from  the  stick  figure  methods  of  [Nevatia  and 
Binford  73]. 

We  regard  figure/ground  discrimination  by  inverse 
GTI  and  subsequent  description  of  primitive  parts  and 
compound  objects  as  model-based  in  the  sense  that 
SHGCs  or  other  GC  subclass  are  generic  models  [Bin¬ 
ford  82].  As  in  [Binford  82],  generic  models  occur  at 
all  levels  of  the  vision  system,  not  just  at  the  top  level, 
figure  1  shows  levels  in  the  object  representation  of  the 
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ultimate  Bayesian  net. 

Cross  sections  of  SHGCs  are  scaled  through  a  point 
in  space,  hence  their  images  are  scaled  through  a  point 
in  the  image  (possibly  infinite).  Inverse  GTI  discovers 
a  set  of  cross  sections  related  by  a  scaling  for  SHGCs. 
For  meridians,  the  scaling  varies  along  the  axis.  A 
stronger  condition  holds  than  the  well-known  tangent 
condition  [Ponce,  Chelberg,  Mann  87;  Sato  and  Bin- 
ford  92b].  Meridians  contribute  to  estimating  the  3D 
sweeping  rule  between  cross  sections. 

[Sato  and  Binford  92a]  analyzed  about  10  examples 
thus  far.  Inverse  GTI  defines  inverse  GCs,  a  superset 
of  SHGCs  that  are  primitive  parts.  Inverse  GCs  are 
surfaces  that  must  satisfy  inference  rules  for  surface 
interpretation  that  are  valuable  here  [Binford  81].  Note 
that  several  examples  are  compound  parts  formed  of 
several  primitive  parts  joined. 

Quasi-invariants  enable  estimate  of  sweeping  rule 
(scaling  along  the  axis)  and  cross  section  viewed  along 
its  normal.  [Sato  and  Binford  92b]  estimated  part 
shape  using  symmetry,  one  method.  Results  are  shown 
in  figures  4  and  5.  [Mann  and  Binford  92]  hypothesize 
circular  cylinders  and  helices  from  generic  models  and 
fit  least  squares  solutions,  shown  in  figures  3a  and  3b. 

Quasi-invariants  and  Probabilities 

An  observable  is  a  measurement  repeatable  by  dif¬ 
ferent  observers,  i.e.  invariant  under  isometries  of  the 
measurement  space,  or  an  invariant  functional  of  ob¬ 
servables,  e.g.  the  distance  between  two  identifiable 
points.  An  observable  may  be  invariant  under  trans¬ 
forms  other  than  isometries,  e.g.  perspective.  There 
are  a  few  perspective  invariants  that  are  useful  in  ma¬ 
chine  vision. 

Coincidence  is  a  true  invariant  but  it  is  never  ob¬ 
servable.  Incidence  is  an  implicit  assumption  in  inter¬ 
pretation  of  line  drawings;  incidence  is  central  in  in¬ 
terpreting  surfaces  from  curves  in  monocular  images. 
Non-coincidence  is  observable  and  quasi-invariant  un¬ 
der  rotation.  Non-coincidence  is  stable  for  points  until 
they  are  far  enough  away  from  the  observer  that  their 
apparent  distance  is  below  the  limiting  resolution.  If 
surfaces  are  non-coincident  or  non-smooth,  these  are 
quasi-invariant  observables  that  are  important  for  seg¬ 
mentation.  In  [Binford  87]  we  present  a  generic  ob¬ 
servability  model  for  matte  reflecting  surfaces.  A  quick 
summary  is  that  discontinuities  are  generic  observables, 
i.e.  observable  almost  everywhere. 

An  observable  is  quasi-invariant  with  respect  to  view¬ 
point  under  orthographic  projection  if  it  is  constant 
to  second  order  under  transforms  parameterized  by 
viewing  angle  on  the  viewing  sphere  [Binford  81,  Bin¬ 
ford,  Levitt,  Mann  87].  Observables  are  quasi-invariant 
under  perspective  projection  if  they  are  orthographic 
quasi-invariants  and  constant  to  second  order  in  the 
ratio  of  object  dimension  to  viewing  distance  on  the 
infinite  viewing  ball,  e.g.  if  they  are  invariant  under  or¬ 


thographic  projection.  Observables  are  quasi-invariant 
with  respect  to  source  orientation  or  location  if  they 
are  constant  to  second  order  in  lighting  angle  on  the 
unit  lighting  sphere,  i.e.  an  orthographic  transform,  or 
constant  to  second  order  in  the  ratio  of  object  dimen¬ 
sion  to  source  distance  on  the  infinite  lighting  bail,  i.e. 
quasi-invariant  under  lighting  perspective  transforms. 

Those  quasi-invariants  discussed  thus  far  have  been 
quasi-invariant  under  orthographic  or  perspective  pro¬ 
jection.  These  relations  are  much  stronger  than  stable 
under  general  viewpoint,  i.e.  relations  which  hold  on  an 
open  set  about  a  viewpoint.  Arguments  about  stability 
under  general  viewpoint  have  been  used  widely  in  anal¬ 
ysis  of  polyhedra  from  vertices.  Quasi-invariants  are 
stronger  in  the  sense  that  the  domain  over  which  they 
hold  is  large,  while  an  open  set  may  be  infinitesimal. 
There  is  a  natural  structural  hierarchy  for  generic  vi¬ 
sion  based  on  inclusion,  locality  in  the  parameter  space: 
properties  stable  under  general  viewpoint  hold  over  a 
small  support  in  parameter  space;  quasi-invariant  prop¬ 
erties  hold  over  a  large  support  in  measure  on  the  pa¬ 
rameter  space;  generic  properties  hold  except  on  a  com¬ 
pact  set  of  measure  zero,  i.e.  almost  everywhere.  In 
careless  notation:  the  domain  of  stable  under  general 
viewpoint  is  much  smaller  than  the  domain  of  quasi¬ 
invariant  which  is  the  same  order  of  magnitude  as  the 
domain  of  generic. 

Another  formulation  of  quasi-invariants  reasons 
about  inverses  under  projection  or  observation.  Here 
is  an  example  of  an  inverse  with  a  well-known  invari¬ 
ant  The  cross  ratio  is  an  index  set  for  colinear  4-tuples 
of  points  in  3  space  and  their  projections  in  2D.  Sets 
of  colinear  4-tuples  in  2D  or  3D  that  have  the  same 
cross  ratio  form  equivalence  classes.  The  cross  ratio  is 
invariant  under  perspective  projection  from  3  space  to 
2  space.  Thus,  the  cross  ratio  induces  an  inverse  that 
is  an  identity  of  the  index  sets,  an  identity  between 
the  cross  ratio  in  2D  and  the  cross  ratio  in  3D,  and 
induces  an  identity  between  equivalence  classes  in  2D 
and  equivalence  classes  in  3D,  i.e.  between  images  of 
colinear  4-tuples  and  their  inverse  images. 

A  quasi-invariant  provides  a  correspondence  between 
index  sets  for  structures  in  3D  and  2D.  The  corre¬ 
spondence  is  a  function  of  projection  parameters.  For 
invariants,  the  inverse  is  unique  from  2D  ind^x  sets 
and  their  equivalence  classes  to  3D  index  sets  and 
their  equivalence  classes,  i.e.  a  delta  function.  For 
quasi-invariants,  the  inverse  is  a  function  of  projection 
parameters  (usually  unknown),  approximately  a  delta 
function.  A  probability  distribution  over  viewing  pa¬ 
rameters  induces  a  distribution  over  the  inverse  of  index 
sets  in  2D  to  3D  and  from  corresponding  equivalence 
classes  in  2D  to  equivalence  classes  in  3D. 

Bayesian  Networks 

In  hierarchical  vision  system  representation,  objects 
are  composed  of  parts.  Figure  1  shows  a  current  per- 
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ceptual  hierarchy.  Uncertainty  is  inherently  conditional 
on  context,  i.e.  conditional  on  hypothesis  including 
viewing  conditions.  Physics  implemented  in  geometry 
provides  a  basis  for  determining  prior  and  conditional 
probabilities  for  relations  between  visual  elements.  All 
measurements  have  uncertainty.  Almost  all  our  con¬ 
straints  are  uncertain  constraints  involving  measure¬ 
ment.  Physical  constraints  as  uncertain  geometric  con¬ 
straints  provide  conditional  probabilities,  the  structure 
for  an  implicit  Bayes  net.  Solution  of  an  implicit  Bayes 
net  is  inherently  an  inverse  process.  Quasi-invariants 
provide  an  implicit  subnet  that  can  be  used  for  hypoth¬ 
esis  generation  to  control  evidential  accrual  [Levitt  88]. 
Quasi-invariants  provide  a  way  of  inverting  a  local  sub¬ 
net,  defined  by  an  object  primitive  part. 

Inference  from  images  in  machine  vision  is  massively 
ambiguous  and  errorful  because  the  evidence  provided 
in  an  image  relates  to  the  appearance  of  an  object, 
not  the  object  itself.  Evidence  in  support  or  denial 
of  a  given  object  is  partial  and  sometimes  incorrect 
due  to  noise  in  segmentation  and/or  errors  in  interpre¬ 
tation  algorithms.  On  the  other  hand,  there  is  typ¬ 
ically  an  abundance  of  evidence.  Bayesian  inference 
provides  a  framework  that  is  mathematically  coherent, 
and  thus  provides  a  sound  basis  for  accruing  belief  in 
support  or  denial  of  hypotheses  interpreting  observa¬ 
tions  as  physical  objects  and  their  relationships.  Evi¬ 
dence  accrual  by  Bayesian  networks  is  a  natural,  coher¬ 
ent  mechanism  for  globally  consistent  interpretation  of 
visual  evidence  from  multiple  sensors  and  sources  with 
non-visual  knowledge. 

Probability  is  used  to  accrue  belief;  [Levitt,  Binford, 
Mann  87)  described  initial  experiments  with  Bayesian 
nets.  Utility  theory  is  used  to  control  evidential  ac¬ 
crual  [Levitt,  Binford  and  Ettinger  88).  Visual  inter¬ 
pretation  has  a  very  large  search  space  that  the  system 
cannot  search  exhaustively.  The  utility  theory  method 
has  been  used  in  practice  and  together  with  Bayesian 
nets,  provides  a  comprehensive  method  to  integrate  ev¬ 
idence  in  computer  vision.  [Chelberg  89]  discriminated 
objects  (valve,  elbow  and  other)  from  these  experi¬ 
ments  with  range  data  using  Bayesian  networks  based 
on  these  works. 

Small  parts  of  the  entire  Bayesian  network  were  im¬ 
plemented  here  [Mann  and  Binford  92]. 

lUE:  Image  Understanding  Environments 

[Mann  and  Binford  92]  describe  a  mathematically 
based  representation  for  constraints.  The  entire  SUC¬ 
CESSOR  effort  has  been  oriented  toward  a  geometric 
and  physical  basis  for  its  type  system.  Typical  sys¬ 
tems  have  been  image  level  with  3D  range  images,  ter¬ 
rain  data.  Simple  modeling  has  been  added.  An  aim 
of  SUCCESSOR  has  been  to  explore  the  level  of  es¬ 
timation  and  object  interpretation:  to  implement,  in¬ 
corporate,  and  refine  evidential  reasoning  in  the  form 
of  Bayesian  networks,  symbolic  constraint  expressions. 


segmentation  and  representation  of  image  features  and 
structures,  geometric  models  (generic,  abstract  and 
specific(,  and  quasi-invariant  reasoning  about  inverses 
from  images.  Considerable  progress  has  been  made 
along  several  of  those  aspects  of  high  level  vision. 
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Figure  1:  Perceptual  Hierarchy  DAG  for  Object  Interpretation  including  monocular  images. 
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2b.  Corresponding  edges  from  cross  sections  of  female  cylinder  from  elbow 
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Figure  2:  Segmentation  into  GC  primitive  parts  of  elbow  joint  image 
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Figure  4.  Figure/Ground  discrimination  and  3d  shape  estimation  for  Heart  Shaped  Cup. 
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5e.  cross  section  viewed  along  normal; 


Figure  5.  Figure/Ground  discrimination  and  3d  shape  estimation  for  Rice  Bowl. 
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Abstract 

Image  Understanding  research  at  CMU  ad¬ 
dresses  a  broad  spectrum  of  issues,  from  appli¬ 
cations  of  machine  vision  to  the  development 
of  new  sensors  and  basic  vision  science.  The 
focal  areas  of  our  research  include: 

•  Vision  for  Site  Modeling 

•  Physics-Based  Vision 

•  Parallel  Vision 

•  Sensor  Development 

•  Vision  for  Object  Recognition  and  Manip¬ 
ulation 

•  Visionfor  Robot  Vehicles 

1  Vision  Methods  for  Site  Modeling 

R2q>id  site  modeling  -  dw  generation  of  a  detailed  thiee- 
dirmnsional  nude]  of  a  surveyed  site  -  is  a  critical  prob¬ 
lem  for  both  military  and  civilian  applications,  including 
geographic  database  generation  for  cartography,  recon¬ 
naissance,  damage  assessment,  combat  simulation,  and 
autonomous  air/ground  vehicle  navigation. 

Central  to  the  site  modeling  problem  is  the  develop¬ 
ment  of  efficient  and  reliable  Image  Understanding  tech¬ 
niques  to  analyze  and  extract  precise  three-dimensional 
sh^  information  from  multiple  visual  or  range  im¬ 
ages  taken  from  a  moving  platform,  such  as  a  scouting 
ground/air  vechicle,  or  a  stereoscopic  camera.  We  have 
recently  developed  three  new  tech^ques:  1)  the  factor¬ 
ization  method  for  shape  and  motion  recovery  from  an 
image  sequence;  2)  the  multi-baseline  stereo  method  for 
reliable  a^  dense  depth  mapping;  and  3)  the  landmark 
object  modeling  metiiod  from  range  image  sequence.  All 
of  these  methods  have  been  tested  with  images  taken  in 
a  controlled  laboratory  environment  (to  provide  ground- 
truth  data  for  quantitative  evaluation)  atxl  taken  outdoors 
under  real  lighting  and  geometry  conditions.  Their  per¬ 
formance  has  been  demonstrated  to  exceed  that  of  previ¬ 
ous  methods. 


1.1  The  Factorization  Method 

The  structure  from  motion  problem  -  recovering  scene 
geometry  and  camera  motion  from  a  sequence  of  images 
-  has  attracted  much  of  the  attention  of  the  vision  conunu- 
nity  over  the  last  decade,  and  yet  it  is  common  knowledge 
that  existing  solutions  wodc  well  for  perfect  images,  but 
are  very  sensitive  to  noise  There  are  two  fundamental  rea¬ 
sons  for  this.  First,  when  camera  motion  is  small,  effects 
of  camera  rotation  and  translation  are  conjugate:  for  ex¬ 
ample,  rotation  about  the  z-axis  and  translation  along  the 
X-axis  both  generate  a  very  simUar  change  in  an  image. 
Any  attempt  to  recover  or '^fierentiate  between  these  two 
motions  is  naturally  noise  sensitive.  Second,  conq>uta- 
tion  of  shape  as  relative  deptii,  for  example,  the  height  of 
a  building  as  the  difference  of  depths  between  the  top  and 
the  bottom,  is  very  sensitive  to  noise,  since  it  is  a  small 
difference  between  large  values.  These  difficulHes  are 
especially  magnified  when  the  objects  are  distant  from 
the  camera  relative  to  their  sizes,  which  is  usually  the 
case  for  interesting  applications  such  as  site  modeling. 

Recently  we  (Tomasi  and  Kanade)  [1][2]  observed  that 
both  difficulties  disappear  when  the  problem  is  reformu¬ 
lated  in  world-center^  coordinates  unlike  the  conven¬ 
tional  camera-centered  formulation.  This  new  formula¬ 
tion  links  object-centered  shape  to  image  motion  directly, 
without  using  retinotopic  <fepth  as  an  intermediate  quan¬ 
tity,  and  leads  to  a  simple  and  well-behaved  solution. 
Furtherm(»e,  the  mutual  independence  of  shape  and  mo¬ 
tion  in  world-centered  coordinates  makes  it  possible  to 
cast  structure-from-motion  as  a  factorization  problem,  in 
which  a  matrix  representing  image  measurements  is  de¬ 
composed  directly  into  camera  motion  and  object  shape. 

More  specifically,  an  image  sequence  can  be  repre¬ 
sented  as  a  2F  X  P  measurement  matnx  W,  which  is 
made  up  of  the  horizontal  and  vertical  coordinates  of  P 
points  tracked  through  F  frames.  If  image  coordinates 
are  measured  with  respect  to  their  centroid,  we  prove  the 
following  rimk  theorem:  under  orthography,  the  mea¬ 
surement  matrix  is  of  tank  3.  As  a  consequence  of  this 
theorem,  we  show  that  the  measurement  matrix  can  be 
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factored  into  the  product  of  two  matrices  M  and  5.  That 
is,  W  =  MS,  where  M  is  of  size  2F  x  3  and  encodes 
camera  motion,  and  5  is  a  3  x  P  matrix  which  encodes 
shape  in  the  coordinate  system  attached  to  the  object  cen¬ 
troid. 

The  rank  theorem  precisely  ct^tures  the  nature  of  the 
redundancy  in  an  image  sequence,  and  permits  a  large 
number  of  points  and  frames  to  be  processed  in  a  concep¬ 
tually  sinqrle  and  computationally  efiScient  way  to  reduce 
the  effects  of  noise.  The  resulting  algorithm  is  based 
on  Singular  Value  Deconqrosition,  which  is  numerically 
well-behaved  and  stable.  The  robustness  of  the  recovery 
algorithm  in  turn  enables  us  to  use  an  image  sequence 
with  a  very  short  interval  between  frames,  which  makes 
feature  tracking  relatively  easy. 

We  have  demonstrated  the  accuracy  and  robustness  of 
the  method  in  a  series  of  experiments  on  laboratory  and 
outdoor  sequences,  with  and  without  occlusions.  With  a 
laboratory  experiment  where  a  camera  is  moved  by  means 
of  a  high-precision  positioning  platform  for  collecting  the 
ground-truth  data,  it  was  shown  that  the  corrputed  mo¬ 
tion  has  errors  consistently  less  than  0.1  degrees.  The 
computed  motion  preserves  all  discontinuities  in  the  ro¬ 
tational  velocities.  The  errors  in  shape  recovery  were 
found  to  be  less  than  1  %. 

The  method  was  also  tested  with  image  sequences  of 
outdoor  scenes  taken  by  a  hand-held  camcorder.  Outdoor 
images  are  harder  to  process  than  lab  images  because 
of  unpredictable  lighting  changes  and  camera  "jaggle". 
Also,  in  real  applications  features  can  appear  and  disap¬ 
pear  from  image  to  image  due  to  occlusions,  which  pro¬ 
duces  a  W  with  incottq>lete  colurtms.  However,  we  can 
still  iq>ply  the  method  successfully.  The  paper  by  Tomasi 
and  Kan^  in  these  proceedings  presents  the  theory  and 
experimental  results  of  the  factorization  method. 

1,2  Molti-Baseliiie  Stereo 

Much  progress  has  been  made  in  methods  for  stereo  vi¬ 
sion  to  reconstruct  the  three-dimensional  world  from  im¬ 
ages  taken  from  slightly  different  points  of  view.  How¬ 
ever,  a  fast,  reliable  stereo  vision  system  for  general  vi¬ 
sion  use  remains  an  unrealized  goal  of  computer  vision. 
The  main  difficulty  is  still  the  correspondence  problem  - 
to  get  high  precision,  the  cameras  should  be  far  apart,  but 
then  it  is  difficult  to  match  corresponding  points. 

One  of  the  most  common  methods  used  to  deal  with 
the  problem  is  a  coarse-to-fine  control  strategy,  in  which 
coarse  resolution  matching  removes  false  matches  and 
high  resolution  gives  a  precise  depth  value.  Using  the 
coarse-fine  strategy,  however,  does  not  always  remove 
false  matches,  especially  when  there  is  inherent  ambigu¬ 
ity  in  matching  such  as  a  repeated  pattern  over  a  large 
region  (eg.,  a  scene  of  a  picket  fence). 

We  (Kanade  and  Okutomi)  [3]  developed  a  new  tech¬ 
nique  called  Multi-Baseline  Stereo  which  uses  multiple 


images  obtained  by  cameras  that  are  laterally  displaced 
(either  or  both  horizontally  and  vertically)  to  produce 
Afferent  baselines.  The  technique  shares  features  with 
the  trinocular  stereo,  the  EPI  method,  and  Kalman-filter 
based  methods  in  that  it  utilizes  multiple  images  to  in¬ 
crease  the  accuracy  and  precision  of  matching,  but  the 
computational  approach  is  very  different  Previous  meth¬ 
ods  obtain  intem^ate  candidate  matchings  over  multi¬ 
ple  stereo  images  and  check  the  consistency  among  these 
to  find  the  correct  combinations  with  increased  precision. 
Since  the  intermediate  decisions  on  correspondences  are 
inherently  noisy  and  ambiguous,  finding  the  correct  com¬ 
binations  requires  sophisticated  consistency  checks  and 
search  or  filtering. 

In  contrast  the  Multi-Baseline  Stereo  uses  the  simple 
fact  that  for  any  baseline,  disparity  divided  by  the  baseline 
distance  is  constant  for  each  point 


Therefore  if  we  represent  evidences  of  matching  from  in¬ 
dividual  stereo  image  pairs  with  respect  to  C,  rather  than 
the  disparity  d  as  is  usually  done,  then  they  should  show 
irKiivate  consistently  the  correct  matching  position.  We 
may  then  add,  integrate,  or  fuse  such  evidences  of  match¬ 
ing  so  that  the  resultant  integrated  evidence  uniquely  de¬ 
termines  the  correct  match. 

We  have  inqrlemented  this  idea  by  using  the  SSD  (sum 
of  squared  differences)  over  a  small  window  as  the  sim¬ 
plest  and  most  effective  measure  of  matching  .  We  rep¬ 
resent  the  SSD  values  from  individual  stereo  pairs  with 
respect  to  the  inverse  distance  1/z.  The  resulting  SSD 
functions  from  all  stereo  pairs  are  added  together  to  pro¬ 
duce  the  sum  of  SSDs,  which  we  call  SSSD-in-inverse- 
distance.  We  have  proven  that  the  SSSD-in-inverse- 
distance  function  exhibits  a  unique  and  clear  mirumum 
at  the  correct  matching  position  even  when  the  underly¬ 
ing  intensity  patterns  of  the  scene  include  ambiguities  or 
repetitive  patterns. 

How  this  method  works  can  be  best  illustrated  by  the 
example  in  figure  1  which  shows  two  of  the  ten  images 
of  a  scene.  The  top  part  of  the  image  (grid  pattern)  is 
conrpletely  repetitive.  So,  the  matching  is  inherently 
ambiguous  for  a  point  in  that  region.  Figure  2  shows 
the  SSSD-in-inverse-depth  for  such  a  point  The  bot¬ 
tom  curve  which  is  obtained  by  a  single  baseline  shows 
multiple  minimums.  We  can  observe,  however,  that  as 
the  number  of  baselines  increases  to  two,  four  and  eight, 
the  SSSD-in-inverse-distance  has  a  better-defined  mini¬ 
mum.  The  computation  is  completely  local,  and  does  not 
involve  any  search,  optimization,  or  smoothing. 

The  method  has  been  tested  with  both  indoor  and  out¬ 
door  scenes.The  paper  by  Kanade,  Okutomi,  and  Naka- 
bara  in  these  proce^ngs  presents  the  theory  and  experi¬ 
ment  that  we  have  product  so  far. 
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Figure  2:  Combining  multiple  baseline  stereo  pairs 


1.3  Landmark  Modeling  from  Range  Image 
Sequence 

We  (Hebert)  have  been  developing  techniques  [4]  to  map 
the  environment  of  a  mobile  robot  using  a  laser  range 
finder.  We  have  extended  the  map  building  techniques  to 
build  explicit  three-dimensional  models  of  the  landmarks 
and  to  use  those  models  for  navigation  by  matching  the 
models  with  observed  object  models  [S].  Including  ex¬ 
plicit  shape  models  allows  for  selective  landmark  identi¬ 
fication  and  for  a  more  detailed  map  of  the  environment 

The  object  models  are  built  by  gathering  range  data 
from  several  closely-spaced  images.  The  features  and 
data  coUected  on  each  object  are  used  to  fit  a  discrete 
surface,  represented  by  a  mesh  of  points,  which  consti¬ 
tutes  the  object  model  stored  in  the  map.  This  iq)proach 
does  not  involve  an  explicit  segmentation  of  the  observed 
scene.  Instead,  features  extracted  from  individual  range 
and  reflectance  images  are  grouped  into  clusters  corre¬ 
sponding  to  objects  in  the  scene.  The  features  are  range 
and  reflectance  edges  and  near-vertical  regions.  Each 
cluster  is  assumed  to  correspond  to  one  object  Qusters 
are  tracked  firom  image  to  image  using  a  previously  de¬ 
veloped  matching  algorithm.  For  each  object  the  surface 
fitting  process  is  iterated  using  data  and  features  from  the 
clusters  in  the  new  and  previous  images. 

We  have  implemented  and  tested  these  algorithms  us¬ 
ing  the  Navlab  and  Navlab  II  as  testbed  vehicles  with 
the  Erim  laser  range  finder.  The  Perception  range  finder, 
which  has  better  spatial  resolution,  was  also  used  for 
off-line  experimentation.  Models  of  natural  and  man¬ 
made  objects  were  successfully  built  and  matched  with 
observations  during  vehicle  travel,  yielding  correct  ve¬ 
hicle  position.  In  our  experiments  so  far,  the  perception 
system  was  tested  in  isolation,  gathering  image  and  posi¬ 
tion  data,  building  and  matching  models  without  actually 
sending  driving  commands  or  position  corrections  to  the 
vehicle.  Our  goal  is  now  to  incorporate  the  algorithms 
in  the  existing  Navlab  navigation  environment  to  demon¬ 
strate  improved  mission  capabilities. 


2  Physics-Based  Vision 

It  has  long  been  recognized  that  traditional  feature  detec¬ 
tion  such  as  edge-finding  acts  in  ignorance  of  the  optical 
physics  of  vision,  and  thus  provides  an  unreliable  foun¬ 
dation  for  machine  vision.  To  address  this  problem.  The 
CMU  Image  Understanding  group  has  been  spearhead¬ 
ing  physics-based  vision  that  performs  an  explicit  anal¬ 
ysis  of  the  constraints  of  optics,  physics,  geometry,  and 
uncertainty  involved  in  vision  [6].  We  continue  to  make 
progress  in  this  area  for  the  analysis  of  appearance  includ¬ 
ing  color  and  interreflection,  surface  roughness,  texture, 
photometric  stereo,  and  image  acquisition. 

2.1  Vlsiial  Analysis  of  Appearance 

To  date,  physics-based  vision  research  has  primarily  been 
limited  to  trivialized  scenes  with  a  single  surface  of  rela¬ 
tively  simple  properties,  analyzed  under  fairly  ideal  con¬ 
ditions.  litis  is  a  far  cry  firom  the  analysis  of  the  complex 
scenes  found  in  real-life  situations,  with  mixtures  of  ma¬ 
terials,  shapes,  colors,  aixl  textures,  uneven  illumination, 
and  many  objects  present  in  juxtaposition. 

Qearly,  such  complexity  is  beyond  our  ability  to  de¬ 
scribe  and  analyze  at  present  However,  this  analysis 
of  the  ^pearaiKe  of  objects  in  arrangements  is  a  neces¬ 
sary  evolution  from  the  study  of  physics-based  vision  in 
idealized  scenes. 

Understanding  Interreflection  Between  Objects 

The  computer  graphics  community  has  modeled  inter- 
reflection  for  some  time,  but  it  presents  very  formidable 
problems  for  machine  vision.  This  is  because  inteneflec- 
tion  can  take  so  many  forms.  In  one  extreme  case,  re¬ 
flection  from  a  mirror  can  take  the  form  of  the  appear¬ 
ance  of  nearby  objects.  Recent  work  in  machine  vision 
has  addressed  the  analysis  of  interieflection  on  a  single 
concave  surface;  and  the  detection  (but  not  analysis)  of 
interieflection  among  multiple  colored  surfaces. 

In  our  work,  we  (Novak  and  Shafer)  address  the  under¬ 
standing,  nuxleling,  and  analysis  of  interreflection  among 
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multiple  objects.  Even  if  we  restrict  ourselves  to  dielec¬ 
tric  materials  that  obey  the  Dichromatic  Reflection  Model 
[7]  [8],  the  analysis  is  very  difficult;  to  provide  sufficient 
information  for  analysis,  we  are  studying  color  images. 
According  to  the  Dichromatic  model,  ffie  color  histogram 
for  a  single  surface  will  form  a  "skewed  T”  shape  within 
a  plane,  with  a  baseline  corresponding  to  the  difffise  color 
of  the  surface  and  a  branch  in  the  direction  of  the  color 
of  the  highlight  (i.e.  the  illumination  color).  We  have 
shovm  that  interreflection  between  two  surfaces  causes 
this  histogram  to  have  a  more  complex  shape,  which  will 
not  generally  be  planar.  In  the  area  of  the  interreflc  don, 
there  are  several  components  due  to  body/body,  surface/ 
body,  body/surface,  and  surface/surface  interreflection. 
These  components  interact  in  a  highly  structured  manner 
to  modify  the  distribution  in  the  color  histogram. 

The  regularity  of  these  structures  in  the  histogram, 
however,  forms  a  basis  for  analysis  of  surface  properties. 
For  exanqple,  the  main  highlight  on  the  object  corresponds 
to  the  main  branch  of  the  color  histograra  The  length 
of  this  branch  is  inversely  related  to  the  roughness  of 
the  surface,  because  a  smooth  surface  concentrates  the 
highlight  into  a  narrow  and  hence  very  bright  spot,  while 
a  rough  surface  diffuses  the  highlight  and  thus  produces 
a  shorter  branch  in  the  histogram.  Also,  the  intersection 
of  this  branch  with  the  branch  for  the  (diffuse)  body 
reflection  color  will  be  broad  when  the  solace  is  rough 
and  the  highlight  is  spread  over  a  broad  patch  of  the 
surface.  If  the  surface  is  smoother,  the  intersection  of 
these  branches  will  be  narrow  because  the  highlight  is 
concentrated  over  a  small  spot  on  the  object  surface.  The 
point  at  which  these  branches  meet  corresponds  to  some 
amount  of  body  reflection,  which  in  turn  tells  the  angle 
of  illumination  on  die  surface. 

In  the  case  of  interreflection,  there  ate  many  more 
structures  on  the  histograra  For  exanq>le,  the  sur¬ 
face/surface  reflection  causes  a  "secondary  highlight"  on 
the  surface,  whic  j  corresponds  to  a  new  branch  present 
in  the  histogram.  This  has  no  counterpart  when  looking 
at  a  single  surface  without  interreflection.  The  secondary 
highlight  is  very  revealing,  because  its  length,  angle,  and 
position  all  are  related  to  the  roughness  of  the  two  sur¬ 
faces  involved. 

Other  structures  in  the  color  histogram  under  inter¬ 
reflection  reveal  information  about  the  geometric  rela¬ 
tionship  of  the  surfaces  and  the  light  source,  and  even 
about  the  relationship  of  the  reflectance  spectra  of  the 
surfaces’  body  reflection.  Our  work  in  this  area  is  now 
aimed  at  demonstrating  each  of  these  analyses  of  the  color 
histogram  and  determining  the  reliability  and  usefulness 
of  each.  To  the  extent  that  we  succeed,  we  may  be  able 
in  the  future  to  analyze  and  understand  interreflection  of 
colored  objects  and  overcome  the  confusion  suffered  by 
current  methods  for  color  image  segmentation. 


Determining  Surface  Roughness  from  Reflection 

When  a  person  looks  at  an  object,  or  an  image  of  an 
object,  it  is  easy  to  determine  in  qualitative  terms  whether 
the  object’s  surface  is  smooth  or  rough.  The  primary 
source  of  this  information  is  the  sharpness  or  blurriness  of 
the  edges  of  reflections  seen  on  that  surface.  If  the  surface 
is  smooth,  reflections  seen  there  will  be  very  sharp-edged 
and  clear,  as  in  the  commercial  where  the  woman  bolds  up 
a  dinner  plate  flesh  out  of  the  dishwasher  and  exclaims, 
"I  can  see  myself!"  On  the  other  hand,  if  the  surface  is 
rough  or  dirty,  then  reflections  seen  on  that  surface  will 
have  fuzzy  ^ges  and  be  indistinct  This  is  the  aspect 
of  i^pearance  sometimes  called  "distincmess-of-image 
gloss". 

We  (Stone  and  Shafer)  are  studying  how  to  analyze  this 
property  in  real  images  to  estimate  surface  roughness. 
We  begin  by  noting  that  the  most  revealing  aspect  of 
surface  reflections  is  the  blurriness  of  a  reflection  of  a  step 
edge.  For  example,  if  there  is  a  light  source  in  the  scene, 
its  reflection  is  apparent  as  a  highlight  on  the  surface. 
Frequently,  other  objects  or  secondary  light  sources  (such 
as  windows)  cast  light  onto  a  surface,  whose  reflection 
similariy  reveals  the  roughness  of  the  reflecting  surface. 

In  terms  of  mathematical  physics,  these  situations  are 
all  the  same  -  a  source  with  a  step  edge  casting  light 
onto  a  reflecting  surface  that  reflects  the  light  into  the 
camera.  The  roughness  of  the  reflecting  surface  in  ef¬ 
fect  convolves  the  incident  step  edge  with  some  blurring 
function;  this  gives  rise  to  the  ^pearance  of  a  blurred 
reflection  edge  in  the  image.  We  have  developed  a  math¬ 
ematical  characterization  of  this  convolution  based  on  the 
Torrance-Sparrow  reflection  model,  which  is  very  widely 
used  and  accepted.  By  identifying  the  edge  of  the  reflec¬ 
tion  and  analyzing  intensity  profiles  across  that  edge,  we 
can  estimate  the  probability  distribution  function  of  the 
surface  microflicet  normals  in  the  surface  microstructure. 

To  date,  we  have  analyzed  only  the  simplest  arrange¬ 
ment  of  a  very  large  source  with  a  single  clearly  defined 
step  edge,  reflecting  from  a  planar  surface  into  the  cam¬ 
era.  We  have  found  diffraction  gratings,  under  particular 
conditions,  to  be  useful  as  a  test  object,  since  the  surface 
facet  normals  distribution  is  well-known.  Our  ongoing 
work  will  attempt  to  generalize  all  of  these  assumptions. 
Our  goal  is  to  be  able  to  look  into  a  complex  scene,  iden¬ 
tify  a  reflected  pattern  on  a  surface,  and  by  analyzing  its 
blurriness  or  sharpness,  conqrute  the  roughness  of  that 
surface. 

Analyzing  Surface  and  Image  Texture 

3D  texture  analysis  has  been  a  difficult  area  for  ma¬ 
chine  vision,  because  a  texture  by  definition  has  a  com¬ 
plex  geometry  in  the  image.  A  good  way  to  characterize 
texture  is  by  tire  Fourier  transform,  but  unfortunately  this 
destroys  inherent  geometric  information.  Instead,  we 
use  a  form  of  local  Fourier  transform  called  the  image 
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spectrogram,  which  is  a  kind  of  dense  wavelet  decompo¬ 
sition  of  the  image.  In  die  spectrogram,  a  local  Fourier 
transform  is  computed  in  a  neighbortiood  around  each 
pixel  in  the  image.  Thus,  the  spectrogram  is  S(x,y,u,v) 
where  x  and  y  are  image  coordinates  and  u  and  v  are 
spatial  frequencies.  The  spectrogram  allows  a  unified 
treatment  of  geometry,  Fourier  transfcums,  wavelets,  and 
image  resolution  and  sampling,  which  makes  it  particu¬ 
larly  useful  fw  analyzing  3D  image  textures.  We  have 
investigated  other  similar  representations  including  Ga¬ 
bor  functions,  orthorKumal  wavelets,  steerable  filters,  and 
the  Mgner  distribution.  We  find  the  image  spectrogram 
to  be  superior  for  3D  machine  vision  because  it  gives  a 
dense  representation  of  space/firequency  information  and 
allows  tdl  the  mathematics  of  Fourier  transforms  to  be 
brought  to  bear  for  geometric  image  analysis. 

Our  primary  study  (Krumm  and  Shafer)  has  been  on 
the  analysis  of  texture  gradients  on  3D  surfaces  by  using 
the  spectrogranL  We  published  in  the  1990  DARPA  lU 
Workshop  fuoceedings  our  theory  that  unified  many  im¬ 
portant  image  phenomena  including  texture  gradients, 
aliasing,  and  lens  parameters  [Knimm91].  Now,  we 
have  expanded  our  work  from  one-dimensional  to  two- 
dimensional  signals  (images)  and  developed  a  wodcing 
algorithm  for  computing  shiqie  from  texture. 

Our  new  shape-from-texture  algorithm  exploits  the 
systematic  changes  in  apparent  finequency  in  the  image 
of  a  textured  shape.  On  a  plane  viewed  under  perspec¬ 
tive,  the  spatial  frequencies  of  the  texture  increase  as 
the  plane  recedes  firom  the  viewer.  We  compute  the  lo¬ 
cal  Fourier  transform  at  various  points  on  such  a  plane. 
These  transforms  ate  related  by  an  approximately  affine 
transformation  of  the  u-v  frequency  axes;  the  parameters 
of  that  affine  transform  ate  determhied  by  the  orientation 
of  the  plane.  By  computing  the  affine  transform  to  match 
the  local  Fourier  transforms,  we  can  calculate  the  slope 
of  the  plane.  This  method  requires  no  feature  detection; 
it  worlu  on  relatively  small  neighborhoods,  which  should 
allow  us  to  extend  it  for  curved  surfaces. 

2,2  Specular  Lobe  Objects  -  Four  Light 
Photometric  Stereo 

The  brightness  of  a  pixel  in  an  image  results  from  the  re¬ 
flection  of  light  The  amount  oflight  reflected  depends  on 
characteristics  of  the  reflecting  surface,  and  imaging  ge¬ 
ometry.  Two  types  of  reflectance  models  have  been  used 
to  preffict  image  brightness:  geometrical  optics  models 
and  physical  optics  models.  Geometrical  optics  models 
are  approximate  models.  They  are  ^propriate  when  the 
wavelength  oflight  is  much  st^er  than  the  roughness  of 
the  reflecting  surhree,  and  are  much  simpler  than  physical 
optics  models.  A  unified  reflectance  model  was  proposed 
by  Nayar,  Ikeuchi,  and  Kanade  [9],  which  combines  the 
geometrical  and  physical  models.  This  mottel  is  able  to 
predict  the  brightness  of  surfaces  that  exhibit  a  diffuse 


lobe,  specular  lobe,  and  specular  spike. 

By  using  the  unified  reflectance  model,  we  developed 
a  photometric  sampler  device  [10]  that  determines  the 
surface  orientation  and  reflection  parameters  of  surfaces 
exhibiting  a  linear  combination  of  lambertian  and  spec¬ 
ular  spike  components.  A  photometric  sampler  device 
uses  a  series  of  extended  light  sources  to  sequentially 
illuminate  the  object  under  inspection. 

Another  interesting  class  of  surfaces  is  those  which 
exhibit  a  linear  combination  of  lambertian  and  specular 
lobe  components.  We  (Solomon,  and  Ikeuchi)  have  de¬ 
veloped  an  algorithm  that  extracts  the  surface  orientation 
and  reflectance  parameters  for  this  class  of  objects.  The 
method,  which  we  call  the/our  light  photometric  stereo, 
uses  four  lights  to  sequentially  illuminate  the  object  under 
inspection.  The  lights  are  positioned  so  that  the  specular 
lobes  of  each  light  source  do  not  intersect  The  four  lights 
produce  three  types  of  regions  on  an  object:  regions  il¬ 
luminated  by  all  four  lights,  regions  illuminated  by  three 
lights,  and  regions  illuminated  by  only  two  lights.  Each 
of  these  regions  provides  different  information. 

For  each  region  an  algorithm  has  been  developed  to 
extract  the  shape  and  roughness  of  objects  that  exhibit 
a  specular  lobe.  The  wodc  extends  photometric  stereo 
techniques  for  specular  lobe  objects,  from  the  regions 
illuminated  by  all  four  light  sources  to  the  entire  gaussian 
sphere.  We  have  tested  our  algorithm  with  real  specular 
lobe  objects,  including  a  specular  painted  sphere,  and  a 
plastic  helmet 

2,3  Research  in  Image  Acquisition 

To  carry  out  studies  in  physics-based  analysis  of  images, 
it  is  important  to  have  hi^-quality  image  data.  Acquir¬ 
ing  such  data  demands  high-precision  imaging  equipment 
and  careful  calibration,  so  that  errors  in  experimental  re¬ 
sults  can  be  properly  attributed  to  limitations  of  equip¬ 
ment  calibration,  physical  naodeling,  or  computation. 

We  (Vinson  and  Shafer)  have  just  completed  the  con¬ 
struction  of  a  second  automated  camera  system  for  the 
Calibrated  Imaging  Lab.  The  new  system  consists  of  a 
motorized  13x  zoom  lens  and  a  cooled  slow-scan  sci¬ 
entific  camera.  The  new  lens  features  11,100  steps  of 
resolution  for  focal  length  (zoom),  4,000  steps  for  focus 
distance,  and  2,700  steps  for  aperture.  The  new  camera 
features  12  bits  of  dynamic  range,  direct  digitization  of 
each  cell  of  the  sensor  element,  a 400: 1  signal-to-noise  ra¬ 
tio,  and  fully  variable  exposure  times.  We  are  presently 
doing  the  r^ometric  and  geometric  calibration  of  the 
camera  system. 

We  are  now  studying  range-from-focus,  ana  have 
made  progress  in  Constant-Magnification  Focusing  and 
understanding  spatial  aliasing  in  cameras.  Constant- 
Magnification  Focusing  deals  with  the  problem  of  focus 
magnification,  the  slight  change  in  image  magnification 
that  occurs  when  a  lens  is  focused  rWillson91].  This 
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can  disturb  the  calculations  of  traditional  focus-quality 
measures,  resulting  in  errors  in  range-from-focus  compu¬ 
tations.  With  Constant-Magnification  Focusing,  we  per¬ 
form  a  very  small  zoom  adjustment  as  the  lens  is  focused, 
in  order  to  maintain  a  constant  image  magnification.  This 
eliminates  the  bias  caused  by  focus  magnification. 

3  Parallel  Vision 

The  big  news  this  year  is  the  arrival  of  the  first  iWarp 
computers  from  Intel.  We  have  received  four  iWarp  com¬ 
puters,  three  8x8  arrays  and  one  8  cell  array  (which  will 
be  expanded  to  16  cells).  The  arrival  ofthese  machines  is 
the  culmination  of  years  of  joint  work  between  Carnegie 
Mellon  and  Intel  in  the  design  of  the  iWarp  chip  and  its 
associated  software. 

iWarp  is  explicidy  designed  for  use  in  signal  and  im¬ 
age  processing;  it  incorporates  a  very  high  per-chip  I/O 
rate  (320  MB/s  total  over  eight  physical  pathways)  with 
high  performance  (20  MFLOPS  and  a  comparable  num¬ 
ber  of  MIPS),  as  weU  as  a  fast  memory  interface  (160 
MB/s).  The  high  I/O  rate  is  essential  for  signal  and  im¬ 
age  processing  operations,  in  which  the  ratio  between 
computation  and  I/O  is  much  lower  than  for  scientific 
computing.  The  iWarp  chips  can  communicate  either 
through  conventional  message-passing,  or  through  sys¬ 
tolic  I/O  in  which  words  of  data  are  fed  to  the  pathway 
directly  from  the  computational  unit,  with  extremely  low 
latency.  Non-adjacent  chips  can  commuiucate  through 
long-<tistance  connections  mediated  by  the  hardware  of 
intervening  cells. 

Our  work  on  parallel  computer  vision  has  led  to  the 
commercial  support  of  the  Apply  programming  language 
[111  by  Intel  for  iWarp.  Apply  allows  the  program¬ 
mer  to  easily  write  any  local  image  processing  operation 
(such  as  point  operations,  edge  (tetection,  and  srtKxtth- 
ing)  and  have  it  automatically  tn^ped  efficiently  onto 
iWarp,  without  programmer  Imowl^ge  of  details  such 
as  data  distribution  and  interprocessor  communication. 
Apply  comes  with  a  large  library  of  image  processing 
programs  called  the  WEB  library,  originally  developed 
on  the  Carnegie  Mellon  Warp  machine.  Apply  and  iWarp 
are  already  being  used  in  a  number  of  labs,  including 
General  Electric  Aerospace. 

A  successor  to  Apply  has  been  developed  that  allows 
both  local  and  glob^  image  processing  operations  to  be 
written  [12l[13l.  This  new  language.  Adapt,  supports  all 
Apply  operations  as  well  as  such  operations  as  histogram, 
image  warping,  two-dimensional  fast  Fourier  transform, 
and  connected  components.  The  language  has  been  im¬ 
plemented  on  iWarp,  and  is  currently  in  use  at  Carnegie 
Mellon  for  research  in  feature  extraction  for  a  vision 
algorithm  compiler,  spectrogram  generation,  and  multi¬ 
baseline  stereo.  In  action,  an  implementation  of  the 
second  DARPA  Image  Understanding  benchmark  is  un¬ 
der  way. 


One  of  the  more  exciting  applications  of  Adapt  is  to 
overcome  the  image  I/O  bottleneck  in  the  current  con¬ 
figuration  of  iWarp.  At  present,  iWarp  is  attached  to  the 
VME  bus  of  a  Sun,  which  limits  the  I/O  that  can  be  done 
to  iWarp  to  a  few  megabytes  per  second  -  too  slow  for 
real  time  image  processing.  However,  recently  image 
compression  chips  and  boards  based  on  the  JPEG  image 
compression  standard  have  come  on  the  maricet,  which 
can  compress  images  at  video  rate  by  as  much  as  28:1 
without  loss  of  image  quality.  At  that  compression  rate, 
even  full  512x512  color  images  could  be  fed  across  the 
VME  bus  at  30  frames/second.  We  are  presently  imple¬ 
menting  the  JPEG  algorithm  on  iWaip,  which  will  allow 
images  to  be  uncompressed  and  processed  there,  then 
tecompressed  for  display  by  another  compressing  fiame 
buffer.  This  is  the  first  application  of  image  compres¬ 
sion  to  overcoming  the  tracfitional  parallel  computer  I/O 
bottleneck  that  we  are  aware  of. 

The  development  of  the  JPEG  standard  has  created 
great  opportunities  for  the  application  of  imaging.  A 
new  emerging  image  processing  library  standard  may 
well  have  similar  impact  For  several  years,  a  group 
of  image  processing  experts  (primarily  from  conq>anies 
with  an  interest  in  imaging  such  as  Datacube,  Sun,  Mitre, 
Hewlett-Packard,  and  so  on)  has  been  defining  a  standard 
imaging  model  and  program  library  for  image  processing. 
This  work,  now  c^ed  the  Programmer’s  Imaging  Ker¬ 
nel  System,  has  recently  been  proposed  as  an  ISO  stan¬ 
dard.  Carnegie  Mellon  and  Intel  are  jointly  developing 
an  Adapt  implementation  of  this  standard.  It  is  believed 
that  this  standard  will  serve  as  mote  than  just  a  program 
library  for  iWarp;  as  iWarp  can  easily  be  implemented  on 
a  variety  of  MB4D  parallel  computers  (an  iitq)lementa- 
tion  on  the  Touchstone  is  underway)  it  should  serve  as  a 
basis  for  developing  architecture-independent  image  pro¬ 
cessing  systems  and  for  comparing  and  evaluating  new 
high  performance  conqruters  for  image  processing. 

4  VLSI  Smart  Sensor>based  Range  Finder 

We  (Gross,  Carley  and  Kanade)  have  been  developing 
a  high-speed  light-stripe  range-imaging  system  based  on 
an  analog  VLSI  smart  sensor,  which  is  capable  of  ac¬ 
quiring  range  frames  at  rates  two  orders  of  magnitude 
better  than  currently  available  ranging  methods  [14][15l. 
At  the  heart  of  this  system  is  an  intelligent  VLSI  sensor 
which  uses  a  novel  cell-parallel  approach  to  stripe-based 
range  imaging.  The  cell-parallel  dgorithm  we  employ  is 
practical  only  because  one  can  integrate  computation  and 
memory  functions  in  a  single  photoreceptive  cell. 

In  a  conventional  light-stripe  system,  a  complete  range 
map  is  obtained  via  the  step-and-repeat  process  of  pro¬ 
jecting  a  stripe,  grabbing  a  video  image,  extracting  the 
stripe  position  from  the  image,  and  stepping  the  stripe 
until  an  entire  scene  has  been  scanned.  Though  practical, 
the  speed  of  sampling  range  data  with  this  conventional 
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technique  is  severely  limited  to  the  order  of  a  second. 
Also,  the  time  requited  to  build  a  range  map  in  this  fash¬ 
ion  is  proportion^  to  the  desired  horizontal  resolution. 

The  new  sensor  is  based  on  the  same  optical  and  geo¬ 
metrical  principle,  but  it  operates  somewhat  differently. 
In  contrast,  our  range-finder  acquires  range  data  while  a 
stripe  is  swept  continuously  over  the  scene.  A  special¬ 
ized  VLSI  sensor  consists  of  a  two-dimensional  array  of 
smart  photosensitive  cells,  each  of  which  has  circuitry 
that  detects  and  remembers  the  time  at  which  it  observes 
the  peak  incident  light  intensity  during  a  sweep.  A  given 
cell  predefines  a  unique  line  of  sight,  and  the  “timestamp” 
it  records  determines  a  particular  orientation  of  the  stripe. 
Thus,  sensing  elements,  woricing  in  parallel,  gather  infor¬ 
mation  sufficient  to  extract  a  complete  range  map  fix>m 
a  single  pass  of  the  light  stripe.  The  time  required  to 
build  the  range  map  is  independent  of  the  map’s  spatial 
resolution. 

Analog  signal  processing  is  weU  suited  for  performing 
the  kinds  of  computations  necessary  to  achieve  intelli¬ 
gent  sensors.  Integrated  circuitry  which  performs  com¬ 
putations  on  analog  values  is  sniall  in  area.  The  density 
which  with  the  sensing  and  processing  task  can  be  per¬ 
formed  within  a  given  die  area  is  thus  enhanced.  An^og 
processing  circuits  do  not  generate  the  switching  noise 
associated  with  digital  implementations.  Finally,  sensed 
data  is  by  its  nature  analog.  Analog  processing  avoids 
the  need  to  convert  sensed  data  before  computatioos  can 
take  place. 

Currently,  a  prototype  cell-parallel  range-imagirig  sys¬ 
tem,  based  on  a  smart  VLSI  sensor,  is  operational.  The 
sensor  in  this  prototype  system  is  a  full-custom  integrated 
circuit  implemented  in  a  2  /rm  CMOS  process.  The  chip 
contains  896  smart  cells  arranged  in  a  28  x  32  array 
and  measures  7.8  mm  x  9.2  mm.  Within  each  smart  cell 
is  a  photoreceptor,  light-stripe  detection  circuitry,  and 
the  means  to  record  and  read  out  range  data  sanqrles. 
The  prototype  range  imaging  sensor  and  system  acquires 
range  images  at  rates  up  to  1,000  fiames  per  second; 
see  figure  3.  Repeatability  of  the  range  data  has  been 
measured  to  be  on  the  order  of  a  percent 

A  second  generation  light-stripe  sensor  irrqrlenrenta- 
tion  is  currently  being  tested.  This  new  chip  incorporates 
several  advantages  over  the  first  design,  lihe  die  area  of 
smart  cells  in  the  new  chip  is  36%  smaller  than  that  of 
the  cells  of  the  first  generation  sensor.  Stripe  detection 
is  done  in  a  more  robust  manner  and  range  data  read¬ 
out  circuitry  has  been  simplified.  In  addition,  the  new 
cell  provides  a  means  to  record  the  intensity  of  the  light- 
stripe  measured  when  range  data  sample  is  acquired.  The 
cell’s  intensity  ouq>ut  can  be  used  to  aid  in  the  light-stripe 
system  range  calibration  process.  Scene  reflectance  in¬ 
formation  can  also  be  derived  from  this  intensity  data. 

One  area  of  sensor  design  methodology  being  pursued 
lately  involves  moving  circuitry  out  of  the  smart  cells  in 


Figure  3:  A  28x32  range  image  of  a  cup.  This  image  can 
be  captured  within  1  to  2  msec. 


order  to  pack  the  photoreceptive  areas  more  tightly.  This 
approach  is  motivated  by  the  need  to  provide  good  optical 
efficiency  in  a  practical  sensor.  Imaging  optics  must  ’oe 
designed  to  accurately  illuminate  the  entire  surface  of  the 
device.  However,  the  gaps  between  pbotosites  containing 
interleaved  circuitry  waste  the  light  from  the  stripe  that 
falls  there.  Moving  circuitry  fit)m  the  photosensitive  ar¬ 
eas  allows  these  gaps  to  be  closed.  The  light  collected  by 
a  given  optical  system  is  therefore  used  more  effectively. 

Integrating  sensing  and  processing  on  integrated  cir¬ 
cuit  substrates  bolds  great  promise  for  developing  of  so¬ 
phisticated  data  acquisition  systems.  Conq>utation  at  the 
point  of  sensing  allows  raw  sensor  data  to  be  tailored  to 
the  needs  of  higher  level  system  requirements  in  a  paral¬ 
lel  fashion.  The  ability  to  acquire  ^ta  intelligently  also 
means  that  new  sensing  methodologies  can  be  developed. 
One  of  the  most  distinguishing  features  of  this  research  is 
that  it  is  not  just  parallel  implementation  of  known  algo¬ 
rithms  by  VLSI  technology  to  achieve  increased  speed, 
such  as  VLSI  chips  for  convolution.  Rather,  we  are 
demonstrating  that  integration  of  sensing  and  process¬ 
ing  can  make  possible  modifications  of  the  operational 
principles  of  information  acquisition  (in  our  case,  range 
imaging)  which  results  in  a  qualitative  in:q)rovement  in 
performance. 

5  Vision  for  Object  Recognition  and 
Manipulation 

One  of  the  most  important  applications  for  machine  vi¬ 
sion  is  robotic  manipulation  of  objects.  The  CMU  Image 
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Understanding  group  has  been  developing  a  method  fra 
modeling  shapes,  sensms,  and  tasks,  and  an  automated 
method  of  constructing  vision-guided  object  recognition 
and  manipulation  system  by  means  of  peecon^ilation  and 
learning. 

5.1  Vkioa-GiildcdSampttiigUsiagDefbrmnble 
Surfaces 

We  (Hebert,  Ikeuchi,  Delingette)  have  developed  a 
viskm-guided  manipulation  system  to  collect  small  rock 
saiiq>les  in  ntUural  terrain.  Our  initial  system  used  su- 
perquadrics  f<v  shape  repiesentatioo  and  a  simple  clam¬ 
shell  gfipptr  for  manipolatioo.  Recently,  we  have  ex¬ 
tended  the  system  by  using  more  accurate  representations 
and  finer  control  of  tiie  manipulation  [16]  so  as  to  deal 
with  complex  scenes.  We  use  Ctee-form  defoinuti>]e  sur¬ 
faces  as  tte  base  representation  for  tiie  object  models,  im¬ 
plemented  as  a  triangular  mesh  of  typically  five  hundred 
nodes.  The  models  are  consulted  finm  surface  features, 
range,  surface  normals,  and  curvature  discontinuities. 

For  mcne  dexterous  manipulation,  we  constructed  a 
three-finger  gripper  that  allows  for  finer  control  of  the 
grasping  operations.  Optimal  grasping  positions  are  com¬ 
puted  by  analyzing  the  local  shiqie  of  the  object  at  the 
three  points  of  contact  f(»^  every  possible  position  of  the 
gripper.  The  evaluation  criterion  is  based  on  the  sum  of 
the  areas  of  contact  between  finger  and  object  at  the  three 
contact  points.  We  have  demonstrated  the  sampling  task 
in  complex  scenes. 

5.2  Vision  A^rithm  Compiler 

The  Vision  Algorithm  Compiler  (VAC)  [17]  is  a  method¬ 
ology  that  we  have  been  developing  over  the  years  to 
provide  an  alternative  to  the  traditional,  expensive  hand- 
coding  of  model-based  vision  systems.  (3iven  a  vision 
task,  a  VAC  starts  with  noodels  of  objects,  sensors,  and 
processing  techniques.  Working  off-line,  a  VAC  ana¬ 
lyzes  the  models  and  automatically  compiles  a  recog¬ 
nition  strategy,  which  is  coiiq)iled  into  an  excecutable 
program  which  performs  the  task  on-line. 

During  the  past  several  years,  woik  on  the  VAC  has 
I»ogressed  in  several  areas.  We  have  worked  on  mod¬ 
eling  sensors  [18],  a  fiame-based  geometric  nKxleling 
system  [19],  generation  of  optimal  classification  strate¬ 
gies  [20],  accurate  determination  of  object  position  and 
orientation  [21],  and  an  example  system  for  bin-picking 
tasks  [22]. 

Recently  we  (Ikeuchi,  Kanade,  and  Sato)  applied  tiie 
VAC  methodology  to  the  task  of  recognizing  a  specular 
object  in  an  ooticd  image  and  an  object  in  SAR  images, 
as  well  as  determining  its  postion  and  orientation  [23]. 
Figure  4  shows  the  overall  system.  The  figure  consists 
of  sensor-independent  modules  (Aspect  Generator,  Tem¬ 
plate  Generator,  Aspect  Classifier,  and  Fine  Matcher), 
and  a  sensor-dependent  module  (Sensor  Simulator).  It 


Figure  4:  Preconqrilation  method  for  object  recognition 


should  be  noted  that  most  of  the  developed  techniques 
are  conunon  between  specular  objects  and  SAR  images 
except  the  sensor  models. 

In  the  off-line  ptecompilation  phase,  images  of  the 
object  are  synthesized  fiom  a  representative  set  of  view¬ 
points  that  cover  the  entire  viewing  sphere.  On  the  basis 
of  the  visible  features,  the  sanqrle  images  are  grouped 
into  aspects.  Figure  5  shows  the  syndiesized  SAR  im¬ 
ages  for  an  airplane.  Each  aspect  is  characterized  by  a 
matching  tenqrlate  that  models  the  appearance  and  distri¬ 
bution  of  the  specular  features  for  that  aspect  For  each 
aspect,  a  procedure  is  created  and  attached  which  should 
be  used  for  calculating  the  precise  pose  of  die  object  once 
an  image  is  classified  to  belong  to  the  aspect  In  the  on¬ 
line  recognition  phase,  an  unknown  image  is  input  to  the 
system.  According  to  the  compiled  strategy,  ^  image 
goes  through  the  two  stages  for  recognition:  aspect  clas¬ 
sification  and  fine  matching.  Figure  6  shows  an  example 
of  the  recognition  result 

The  system  proved  itself  enable  of  localizing  very 
specular  objects  in  optical  images.  The  only  changes  to 
the  system  required  to  deal  with  different  image  types 
was  to  change  the  sensor  simulator  model. 

5.3  Assembly  Plan  firom  Observation 

Another  self-programming  ctqiability  that  we  (Ikeuchi, 
Kang,  Paul,  Wheeler,  Suehiro)  have  been  working  on  are 
systems  which  "leam"  or  program  themselves  by  observ¬ 
ing  a  human  performing  die  same  task;  we  call  this  an 
assembfy-plan-from-observation  method.  A  human  per¬ 
forms  assembly  operations  in  front  of  a  TV  camera  of  the 
system.  The  system  recognizes  such  assembly  operations 
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Figure  5:  Simulated  SAR  images  of  an  aiiplane 
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Figure  6:  Recognition  result  of  an  airplane  SAR  image  by 
a  program  compiled  by  Vision  Algorithm  Compiler.  The 
five  pictures  indicate  the  five  candidate  aspect  (rotation 
angles)  and  the  bottom  figure  shows  the  overlay  of  the 
final  recognized  airplaiK  rotation.  Note  ttiat  die  dark 
region  corresponds  to  the  shadow  of  the  airplane. 


and  generates  an  assembly  plan  to  be  able  to  repeat  the 
same  assembly  operations  using  its  robot  arm. 

This  year,  we  tove  focused  on  a  class  of  assembly  op¬ 
erations,  such  as  put-on  or  insert-into.  One  of  the  impor¬ 
tant  purposes  of  such  assembly  operations  is  to  achieve 
face  contacts  between  objects.  We  have  examined  possi¬ 
ble  face  contact  relations  among  polyhedra  and  classified 
diem  into  nine  equivalent  groups.  We  further  exam¬ 
ined  the  transitions  which  occur  among  these  equivalent 
groups  and  the  assembly  operations  which  cause  such 
transitions.  The  results  are  summarized  into  the  pro¬ 
cedure  tree  which  associates  transitions  and  necessary 
assembly  operations  among  the  groups. 

The  system  works  as  follows.  It  recognizes  object 
configurations  at  each  step  of  assembly  operations  and 
then  extracts  face  contact  relations  fixim  these  configura¬ 
tions.  By  consulting  the  procedure  tree,  the  system  can 
associate  an  extracted  relation  transition  with  the  neces¬ 
sary  operations  to  achieve  the  transition.  Once  a  class  of 
assembly  operations  is  recovered  from  observation,  by 
filling  in  necessary  motion  parameters,  such  as  where  to 
put  or  firom  which  direction  to  insert,  the  system  com¬ 
pletes  the  learning  of  the  assembly  plan.  By  repeating 
this  sequence,  the  system  success&Uy  generates  the  se¬ 
quence  of  assembly  operations  necessary  to  repeat  the 
assembly  performed  by  the  human  [241.  Techniques  to 
deal  with  noisy  measurements  have  been  also  developed 
[25] 

We  plan  to  develop  modules  to  recognize  human  grasp¬ 
ing  strategies  and  global  path  plantung  firom  observing  a 
sequence  of  images. 

6  Vision  for  Autonomous  Mobile  Robots 

Mobile  robots  are  vital  for  tasks  in  recotmaissance,  ex¬ 
ploration,  and  all  missions  to  be  carried  out  in  remote  lo¬ 
cations.  Visual  navigation,  however,  is  quite  challenging 
because  of  the  need  for  very  robust  sensing  in  change¬ 
able  outdoor  conditions  and  the  complexity  of  situation 
assessment  for  a  vehicle  moving  through  the  environ- 
ment  CMU  has  several  major  programs  in  vision  for 
mobile  robot  vehicles,  including  cross-country  and  road¬ 
way  terrestrial  navigation  and  underwater  navigation. 

6.1  Uiunanned  Ground  VeMcle 

Work  on  the  Navlab  [26],  including  perception,  planiung, 
and  new  vehicle  construction,  has  now  been  grouped  into 
a  single  project  as  part  of  DARPA’s  new  UGV  prograra 
We  (Thorpe  and  otiiers)  have  a  new  vehicle,  the  Navlab 
n,  which  is  a  converted  HMMWV.  Perception  algorithms 
runiung  on  the  Navlab  and  Navlab  n  include  road  fol¬ 
lowing,  object  recognition,  and  cross-country  traversal. 
These  vehicles  navigate  using  using  sonar,  giga-hertz 
radar,  and  the  ERIM  laser  rangefinder.  We  have  re¬ 
cently  demonstrated  steering  around  obstacles  and  track¬ 
ing  guard  rails  and  parked  cars.  Work  on  cross-country 


mobility  has  concentrated  on  achieving  higher  speeds 
through  careful  pipelining  of  the  processing,  and  incor¬ 
porating  detailed  models  of  vehicle  control  and  kinemat¬ 
ics.  The  best  runs  to  date  have  achieved  speeds  of  over  6 
mph  on  moderate  off-road  terrain.  EDDIE,  the  toolkit  for 
building  software  architectures,  and  the  Aimotated  Mrq) 
data  management  system  have  been  distributed  to  Cater¬ 
pillar,  University  of  Massachusetts,  JPL,  Florida  Atlantic 
University,  and  other  sites. 

Pomerleau  developed  ALVINN,  a  neural  net  road  fol¬ 
lowing  system  [27],  and  has  used  it  to  drive  the  NAVLAB 
n  up  to  SS  mph  on  highways,  and  for  a  continuous  dis¬ 
tance  of  over  21  miles.  The  networks  are  trained  by 
watching  a  human  drive,  with  synthetic  noise  added  to 
the  training  examples  to  decrease  sensitivity  to  features 
such  as  other  vehicles  and  guard  rails.  ALVINN  has  also 
been  modified  to  produce  confidence  measures  in  the  out¬ 
put  of  the  network[28].  This  will  allow  several  nets  to  be 
run  in  parallel,  for  instance  nets  trained  on  different  types 
of  road,  and  the  output  from  the  most  confident  net  used 
for  driving. 

Road  following  using  symbolic  feature  tracking  in 
YARF  (Kluge)  now  incorporates  robust  statistics  and 
road  models.  YARF  tracks  features  such  as  yellow  center 
lines  and  white  edge  lines.  Detected  feature  locations  are 
recorded  in  a  local  map.  After  each  image  is  processed, 
all  the  features  in  the  vehicle’s  current  vicinity  are  used 
to  update  the  current  model  of  road  location,  orientation, 
and  curvature.  Robust  statistics  sqrproaches  to  weighting 
features  are  used  to  reduce  sensitivity  to  outliers,  and  to 
give  indications  of  possible  intersections  or  changes  in 
road  structure.  YARF  routinely  drives  the  Navlab  along 
city  streets  near  the  CMU  campus. 

6.2  Pianning  for  Robot  Perception 

Recent  advances  in  mobile  robots  have  demonstrated  sys¬ 
tems  that  can  follow  a  simple  lane  or  find  a  known  land- 
made.  The  next  step  is  to  develop  robots  capable  of 
driving  in  traffic,  which  requires  inteipreting  much  more 
con^>lex  road  configurations,  signs  and  signals,  and  inter¬ 
acting  with  other  moving  traffic.  We  (Reece  and  Shafer) 
ate  studying  this  problem,  which  we  call  tactical  driv¬ 
ing,  with  the  Pharos  traffic  simulation  program  and  the 
Ulysses  coiiq)utational  driving  program  [Reece91a].  We 
began  this  work  by  developing  Pharos,  a  traffic  simu¬ 
lation  program  that  models  a  network  of  streets.  The 
input  database  describes  the  exact  geometry  of  each  lane 
of  each  street,  and  the  details  of  each  sign,  signal  and 
lane  marking.  Into  this  network  of  streets,  Pharos  injects 
vehicles  we  call  "zombies"  that  turn  randomly  at  each 
intersection,  but  follow  actual  driving  laws  and  vehicle 
kinematics  models  as  they  travel.  To  make  their  driving 
decisions,  ten  times  per  (simulated)  second,  each  vehicle 
follows  a  decision-niaking  process  by  examining  the  data 
structures  of  Pharos  to  determine  the  lane  of  travel,  con- 
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straints  imposed  by  vehicles  and  traffic  control  devices, 
and  intended  maneuvers  such  as  overtaking  and  turning. 

To  study  robot  driving,  we  developed  a  robot  driving 
program  called  Ulysses  ^t  controls  one  vehicle  in  the 
simulated  world  of  Pharos  [Reece91a].  Whenever  the 
"robot”  vehicle  is  queried,  a  message  is  sent  from  Pharos 
to  Ulysses  to  ask  for  the  next  steering  and  speed  com¬ 
mand.  Ulysses  carries  out  a  decision-making  process 
that  simulates  how  a  (robot)  driver  would  decide  what  to 
do.  In  this  process,  Ulysses  sends  "perceptual  requests" 
to  Pharos,  such  as  asking  for  the  upcoming  lane  geom¬ 
etry  or  the  distance  and  speed  of  the  vehicle  in  front  of 
the  robot  Pharos  answers  these  requests  just  as  a  robot 
perception  system  would  do.  When  it  has  run  through  its 
decision-making  process,  Ulysses  reports  its  computed 
speed  and  steering  commands  to  Pharos,  which  carries 
out  that  action  and  continues  its  simulation  of  traffic. 
The  decision-making  process  of  Ulysses  is  completely 
feasible  for  a  robot  vefficle. 

Our  chief  finding  to  date  concerns  the  perceptual  cost 
of  driving.  Traditional  AI  planners  are  based  on  the  no¬ 
tion  that  the  perception  system  continuously  provides  a 
complete  model  of  the  worid,  within  which  tte  planner 
can  find  whatever  it  is  interested  in.  However,  the  cost 
of  finding  all  possible  cars,  signs,  road  markings,  and 
other  objects,  ten  times  per  second,  while  driving,  is  stu¬ 
pendous.  It  isn’t  even  vaguely  conceivable  that  a  real 
robot  driver  would  be  able  to  provkte  such  a  continu¬ 
ously  updated  model  of  everything  in  the  environment 
Instead,  the  Ulysses  program  makes  requests  for  the  spe¬ 
cific  things  it  needs  to  know  at  each  tick  of  die  clock, 
in  order  to  make  the  specific  decisions  it  needs  to  make. 
In  addition,  since  the  perception  requests  ate  sequential, 
each  one  can  constrain  the  geometry  of  the  later  requests. 
We  carefully  model  the  "perceptual  cost"  for  Pharos  to 
answer  each  request  made  by  Ulysses,  and  have  shown 
that  this  demand-driven  model  of  perception  is  several  or¬ 
ders  of  magnitude  lower  in  cost  than  the  complete  world 
model  assumed  by  traditional  AI  planners  [Reece91b, 
Reece91c]. 

6 J  Uitderwater  Perception 

We  (Hebert  and  Langer)  have  been  working  on  under¬ 
water  terrain  modeling  by  sonar  imaging  for  autonomous 
underwater  vehicles  (AUV).  Recovering  terrain  shape 
from  sonar  data  is  con^licated  by  the  ambiguity  inherent 
to  sonar  data,  in  which  the  effects  of  range  and  intensity 
are  complicated  by  the  shape,  reflectivity,  and  roughness 
of  the  observed  target,  and  by  the  transmission  charac¬ 
teristics  of  sonar.  Recovering  shape  from  sonar  data  is 
therefore  a  highly  undeiconstrained  problem,  much  like 
shape  recovery  from  visual  data.  Our  approach  is  to  use  a 
reflection  model  that  relates  the  observ^  range  and  inten¬ 
sity  to  all  the  unknown  surface  parameters.  Using  initial 
values  of  those  parameters,  shapes  are  reconstructed  first 


in  the  neighborhood  of  high-intensity  points  which  cor¬ 
respond  to  more  reliable  data.  An  iterative  process  is 
then  used  to  revise  both  shape  and  surface  parameters. 
The  resulting  sparse  map  is  interpolated  using  a  standard 
regularization-based  interpolation  algorithm. 

We  have  tested  the  algorithm  on  a  number  of  sonar  im¬ 
ages  collected  by  Florida  Atlantic  University  (FAU)  [29, 
30].  The  results  show  that  surface  maps  are  correctly 
recovered  even  when  very  different  types  of  terrain,  such 
as  rocks  and  sand,  are  present  in  the  scene.  This  departs 
firom  previous  algorithms  which  assumes  implicitly  that 
the  bottom  surface  is  homogeneous  and  flat  Our  con¬ 
tinuing  research  is  in  merging  multiple  sonar  images  and 
quantifying  the  performance  evaluation  of  terrain  recon¬ 
struction. 
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abstract! 

This  report  summarizes  progress  in  image 
understanding  research  at  the  University  of 
Massachusetts  over  the  past  year.  Many  of  the 
individual  efforts  discussed  in  this  paper  are  further 
developed  in  other  papers  in  this  proceedings.  The 
stimmary  is  organiz^  into  several  areas: 

1.  MobUe  Robot  Navigation 

2.  Image  Sequence  Processing 

3.  Interpretation  of  Static  Scenes 

4.  Image  Understanding  Architecture 

The  research  program  in  computer  vision  at  UMass 
has  as  one  of  its  goals  the  integration  of  a  diverse  set 
of  research  efforts  into  a  system  that  is  ultimately 
intended  to  achieve  real-time  image  interpretation  in 
a  variety  of  vision  applications. 


The  UMass  mobile  robot  navigation  project  continues 
to  integrate  a  number  of  different  algorithms  with  the 
goal  of  achieving  robust  landmark-based  navigation. 
The  primary  component  technologies  are  described  in 
several  sections  of  this  paper.  Sections  1.1  and  1.2 
discuss  the  use  of  landmarks  derived  from  a  partial 
geometric  nuxlel  of  the  enviroiunent  to  determine  the 
pose  of  the  vehicle.  Section  2.2  outlines  one 
mechanism  by  which  an  initial  partial  model  of  the 
environment  might  be  automatically  acquired  from  a 
motion  sequence.  Section  3.1  discusses  techniques  for 
learning  recognition  strategies  within  the  &hema 
object  recognition  system,  which  is  capable  of 
identifying  naturally  occurring  objects.  Our  goal  is  to 
integrate  natural  objects  into  the  landmark-based 
navigation  system  for  outdoor  navigation  and  to 

^This  research  has  been  supported  in  part  by  the  Defense 
Advanced  Research  Projects  Agency  under  TACOM 
contract  number  DAAE07-91-C-R035,  HDL  contract  number 
DAAL02-91-K-0047,  and  ETL  contract  number  DACA76-89- 
C-0016,  by  the  National  Science  Foundation  under  grant 
CDA-8922572  and  IRI-91 13690,  and  by  RADC  under 
contract  number  F30602-91-C-0037 


embed  the  results  of  this  research  into  the  planning 
and  control  framework  developed  by  Fennema  [18] 
which  can  effectively  utilize  landnuirks  at  a  number 
of  levels,  including  low-level  perceptual  servoing  for 
producing  accurate  motor  actions,  and  plan-level 
perceptual  servoing  for  maintaining  adherence  to  a 
navigation  plan. 

1.1.  Automated  Model  Acquisition  and 
Extension 

We  are  continuing  our  efforts  towards  the  robust 
detemnination  of  pese  (location  and  orientation)  of  the 
vehicle  in  a  pertially  modelled  3D  environnent  via 
the  constraints  derived  from  recognized  3D 
landmarks  [25]. 

Recently,  Kumar  has  performed  experiments  on 
model  extension  (Kumar  and  Hanson  [25,  26]  using 
basic  techniques  from  px>se  determination.  Points  or 
lines  whose  3D  positions  are  known  are  tracked 
across  frames  using  the  line  tracking  algorithm  of 
Williams  [39],  or  the  peint-tracking  algorithm  of 
Sawhney  [33].  From  these  px>ints  or  lines,  the  relative 
orientation  of  pairs  of  frames  are  determined.  The 
depths  of  unmodelled  px>ints,  which  are  also  tracked 
over  the  sequence,  are  then  computed  using 
triangulation.  The  sensitivity  of  the  acquired  depth  to 
errors  in  the  image  center  has  also  been  investigated. 
In  expjeriments  using  two  image  sequences  for  which 
ground  truth  is  available,  the  3D  px)sitions  of  the 
unmodelled  p>oints  were  recovered  with  an  average 
error  in  depth  of  .25%  and  1.3%.  The  error  for  the 
second  case  is  larger  than  for  the  first  in  p>art  due  to 
the  larger  field  of  view  (40°  compared  to  22°  )  which 
increases  the  sensitivity  to  errors  in  the  location  of  the 
image  center.  Given  that  there  must  be  some  error  in 
the  original  3D  px)sitions  of  landmarks,  recovery  of 
new  3D  px>ints  to  this  accuracy  is  a  surprising  and 
dramatic  result. 

1.2.  Landmark  Recognition 

Beveridge  [3]  continues  to  develop  his  model-directed 
matching  algorithms,  which  are  being  applied  to 
landmark-based  robot  navigation.  His  previous 
research  used  a  priori  knowledge  of  an  approximate 
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robot  pose  to  project  landmarks  in  a  3D  model  into 
the  image  plane.  This  allows  2D  model  lines  to  be 
matched  with  2D  data  lines  extracted  from  the 
sensory  data  by  minimizing  the  error  in  the  spatial 
alignment  under  rotation,  translation,  and  scaling 
transformations  in  the  plane.  The  resultant 
correspondence  between  landmarks  and  image 
features  for  the  best  2D-to-2D  match  is  used  to 
recover  the  actual  3D  pose  of  the  robot.  This  system 
has  been  effectively  applied  to  pose  recovery  in  the 
UMass  mobile  robot  project.  However,  this  system 
may  not  be  able  to  recover  from  the  2D  distortion 
produced  when  the  projection  is  done  from  an 
incorrect  sensor  pose. 

A  new  3D-to-2D  model  matching  system  has  been 
developed  to  match  3D  landmarks  directly  to  2D 
image  features.  During  the  iterative  matching 
process,  new  3D  transformations  between  the  world 
and  the  camera  are  computed,  and  landmark  features 
are  re-projected  into  the  image.  This  accounts  for 
perspective  distortion  during  the  search,  and 
therefore  allows  recovery  of  the  robot's  true  position 
more  reliably  than  the  2D-to-2D  matching  system. 
However,  the  original  system  usually  improves  the 
initial  erroneous  pose  estimate,  and  does  so  in 
roughly  one  fifth  the  time  required  by  the  second. 

2.  linage  Sequence  Processing 

2.1.  The  p-Field:  A  Computational  Model 
for  Bino^ar  Motion  Processing 

Balasubramanyam  [1,  2]  is  developing  an  integrated 
framework  for  stereo  and  motion  analysis.  Given  a 
binocular  camera  system  moving  through  a  static 
environment,  it  is  possible  to  obtain  a  three- 
dimensional  field  of  vectors,  where  each  vector  is 
parallel  to  the  induced  relative  3D  motion  of  an 
imaged  point  and  scaled  in  magnitude  by  the  depth  of 
that  point.  This  3D  vector  field,  referr^  to  as  the  p- 
field,  is  derived  from  optic  flow  and  disparity,  over  a 
pair  of  stereo  frames  at  successive  time  instants.  The 
behavior  of  the  p-field  is  examined  for  specific  cases 
of  restricted  motion,  as  well  as  for  general  motion.  In 
particular,  the  behavior  of  the  p-field  under 
translational  vehicle  motion  promises  to  be  more 
stable  under  small  vehicle  rotations  than  the  behavior 
of  the  flow  field.  We  expect  that  the  p-field  will  allow 
more  robust  recovery  of  the  sensor  motion 
parameters,  and  tracking  of  3D  points  through  a 
sequence  of  images.  Ultimately,  this  analysis  should 
provide  more  robust  recovery  of  3D  environmental 
information  than  independent  stereo  and  motion 
analyses  whose  results  are  combined  after  the  fact. 


2.2.  Reconstruction  of  Shallow  Structures 

In  many  man-made  environments,  obstacles  in  the 
path  of  a  mobile  robot  can  be  characterized  as  shallow, 
that  is,  they  have  relatively  small  extent  in  depth 
compared  to  the  distance  from  the  camera.  Sawhney 
[321  (these  proceedings)  presents  a  framework  for 
segmenting  shallow  structures  from  the  background 
over  a  sequence  of  images.  Shallowness  is  first 
quantified  in  terms  of  affine  describability.  This  is 
embedded  in  a  tracking  system  within  which 
hypothesized  model  structures  undergo  a  cycle  of 
pr^iction  and  model— matching.  Structures  emerge 
either  as  shallow  or  non-shallow  based  on  their  affine 
trackability.  This  work  rejects  continuity  heuristics  for 
purely  image  motion  in  favor  of  temporal  continuity 
defined  as  the  consistency  of  generic  3D  models, 
namely  shallow  structures.  This  work  will  be 
applicable  to  obstacle  avoidance  and  model 
acquisition  by  a  mobile  robot.  In  two  indoor 
experiments,  object  structure  represented  as  frontal 
planes  was  recovered  to  a  depth  accuracy  in  the  range 
of2-57o. 

23.  Multi-Frame  Structure  from  Motion 

Recovering  structure  from  motion,  even  using 
infomrtation  from  multiple  image  frames,  is  difficult, 
partly  because  motion  error  can  introduce  large, 
correlated  errors  in  the  structure  estimate.  Thomas 
and  Oliensis  [35]  (these  proceedings)  propose  a 
method  for  recursively  recovering  structure  from 
motion  that  can  deal  with  this  problem.  The 
algorithm  is  based  on  the  observation  that  errors  in 
the  motion  produce  cross-correlations  in  the  structure 
errors  across  the  3D  points.  Conversely,  these 
correlations  are  the  record  of  the  motion  error.  Thus, 
to  explicitly  incorporate  motion  error  in  a  recursive 
algorithm,  a  record  of  the  correlations  in  the  structure 
errors  must  be  maintained  and  updated. 

Input  for  the  algorithm  consists  of  point 
correspondences  tracked  over  many  image  frames. 
Horn's  relative  orientation  algorithm  [23]  is  used  to 
provide  two-frame  structure  estimates.  For  this 
algorithm,  a  somewhat  complex  error  analysis  is  used 
to  estimate  the  expected  structure  errors,  including 
the  cross-correlations.  The  fusing  of  the  new  structure 
estimate  with  the  old  is  done  using  a  standard 
Kalman  filter,  but  with  the  cross-correlations  taken 
into  account.  The  results  on  synthetic  images  show 
that  the  structure  estimates  improve  over  time  as 
expected;  encouraging  results  on  real  images  are  also 
reported  [30, 34, 35]. 
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3.  Interpretation  of  Static  Scenes 

3.1.  Learning  3D  Recognition  Strategies 

In  an  effort  to  automate  aspects  of  model  acquisition 
for  image  interpretation.  Draper  [13-16]  has  been 
examining  the  role  of  learning  in  model-based  vision. 
In  particular,  he  is  addressing  the  automated 
construction  of  robust  control  strategies  responsible 
for  creating  'instance-of  relations  during 
interpretation.  Given  a  set  of  generic  parameterized 
knowledge  sources,  the  goal  is  to  construct  the 
Schema  Learning  System  (SLS)  which  will  leam  (from 
a  set  of  training  images)  a  recognition  strategy  for  a 
particular  object  class  that  minimizes  the  cost  of 
recognition,  subject  to  a  set  of  accuracy  constraints 
supplied  by  the  user.  Recognition  strategies  are 
represented  by  recognition  graphs,  which  are  similar 
in  many  ways  to  decision  trees.  Unlike  decision  trees, 
however,  recognition  graphs  direct  hypothesis 
creation  as  well  as  hypothesis  verification.  Object- 
specific  strategies  are  learned  in  a  two  step  process 
[15]  (these  proceedings).  The  first  step  involves 
learning  which  hypotheses  should  be  generated.  The 
second  leams  how  to  verify  them  efficiently.  Thus, 
the  task  of  SLS  is  to  leam  control  and  evidence 
combination  strategies,  not  new  models  or  knowledge 
sources.  Initial  experimental  results  demonstrate  the 
potential  of  this  approach. 

3.2.  View  Description  Networks 

Model-directed  object  recognition  becomes  much 
more  difficult  when  the  viewpoint  of  the  three- 
dimensional  object  is  unknown  [6-8].  Bums  [5]  (these 
proceedings)  describes  a  system  designed  to 
effectively  match  a  single  2D  image  of  a  potentially 
cluttered  scene  to  a  library  containing  multiple 
polyhedral  objects  and  demonstrates  its  performance 
on  several  scenes.  This  approach  to  recognition  can 
be  characterized  by  three  general  ideas.  Description 
networks  optimize  the  search  for  matches  to  objects 
from  a  multiple  object  library  by  organizing 
information  about  the  objects  into  a  single  network 
representation.  View  descriptions  contain  organized 
descriptions  of  the  projections  of  the  objects  from 
views  for  which  the  objects  are  expected  to  be  seen; 
these  are  used  during  the  match  phase.  Finally,  the 
correctness  of  view  description  matches  are  verified  by 
estimating  the  3D  pose  of  the  associated  object, 
evaluating  the  estimation  error  and  searching  for 
additional  assignments  between  object  and  images 
features,  given  the  estimated  pose. 


An  important  part  of  this  approach  is  the  design  of 
the  recognition  phase  of  the  system:  given  a  compiled 
view  description  network  for  an  object  library,  the 
system  must  direct  an  effective  search  for  the  correct 
matches  in  cluttered  images.  This  implies  that  three 
key  problems  are  addressed:  recognition  from  an 
unknown  view,  the  indexing  problem  (selection  of  a 
few  high  probability  candidates  based  on  key 
features),  and  model  based  incremental  recognition 
among  the  candidate  competing  hypotheses  based  on 
partial  matches. 

33.  Model  Extension  Using  Projective 
Invariants 

Collins  [9,  11]  has  dcveIop>ed  a  new  approach  to 
modeling  man-made  environments  based  on  results 
from  projective  geometry.  It  is  well  known  that  the 
images  of  coplanar  points  and  lines  under  rigid-body 
motion  are  related  by  a  linear  transformation  in 
homogeneous  coordinates.  Given  four  known 
reference  points  or  lines  on  the  plane,  the  positions  of 
all  other  points/lines  on  that  plane  can  be 
reconstructed,  regardless  of  camera  position  or 
intrinsic  calibration  parameters.  Collins  [10]  (these 
proceedings)  has  extended  these  results;  it  is  shown 
that  it  is  possible  to  obtain  partial  and  in  some  cases 
complete  3D  reconstructions  of  those  points  and  lines 
lying  outside  the  reference  plane.  The  main  results 
are  that  with  a  calibrated  camera,  one  reference  plane 
tracked  through  two  images  is  enough  for  complete 
reconstruction  of  the  environment,  while  for  an 
uncalibrated  camera  it  is  sufficient  to  have  two 
reference  planes  tracked  through  two  images.  The 
effects  of  noise  in  the  observations  are  considered, 
resulting  in  a  general  framework  for  data  fusion  in 
projective  space. 

3.4.  Shape  from  Shading  Revisited 

Shape  from  shading  has  traditionally  been  considered 
an  ill— p)osed  problem.  However,  in  recent  work, 
Oliensis  [28,  29]  has  demonstrated  that  the  solutions 
to  shape  from  shading  are  often  well— determined, 
with  little  or  no  ambiguity.  For  the  case  of 
illumination  that  is  symmetric  around  the  viewing 
direction  (i.e.  the  light  source  is  behind  the  camera),  it 
was  shown  in  [27]  that  there  is  in  general  a  unique 
solution  to  shape  from  shading.  This  proof  is  valid 
for  general  Lambertian  objects  (without  holes),  and  is 
the  first  proof  that  the  problem  of  shape  from  shading 
can  be  well— posed  in  general.  These  arguments  were 
extended  to  the  case  of  general  illumination  direction 
in  [29],  where  it  was  demonstrated  that,  in  this  case 
also,  the  solutions  to  shap)e  from  shading  are  strongly 
constrained  over  much  of  the  image. 
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Recently,  Dupuis  and  Oliensis  [17]  (these 
proceedings)  has  developed  a  new  approach  to  shape 
from  shading,  based  on  a  connection  with  a  calculus 
of  variations/optimal  control  problem,  and  has 
demonstrated  its  performance  on  reasonably  complex 
images.  The  approach  leads  naturally  to  an  algorithm 
for  shape  reconstruction  that  is  simple,  fast,  provably 
convergent,  and,  in  nwny  cases,  probably  convergent 
to  the  correct  solution.  The  algorithm  is  robust 
against  noise  and,  in  contrast  with  standard 
variational  algorithms,  does  not  require 
regularization.  An  explicit  representation  is  given  for 
the  surface:  its  height  is  expressed  as  the  minimal  cost 
for  an  optimally  controlled  trajectory. 

4.  The  Image.UndefStandmgAjchitgg.fa.rg 

The  lUA  project  continues  as  a  three-way 
collaboration  between  UMass,  Amerinex  Artificial 
Intelligence,  Inc.  (AAI),  and  the  Hughes  Research 
Labs  (HRL);  this  coordinated  effort  is  summarized  in 
[36]  in  these  proceedings.  The  first  lUA  prototype 
hardware  has  been  assembled,  tested,  and  is  almost 
fully  functional.  The  low-level  processor  for  the 
second  generation  lUA  and  its  controller  have  been 
designed,  and  a  software  simulator  has  been  built  for 
the  controller  and  low-level  array.  The  design  for  the 
intermediate  level  has  just  been  completed. 

UMass  has  developed  a  SIMD  version  of  the 
Wormhole  Routing  technique  that  takes  advantage  of 
the  Coterie  Network  in  our  low-level  processor,  in 
order  to  provide  general  permutation  routing 
capability  roughly  equivalent  in  performance  to  that 
found  in  the  Connection  Machine,  without  the  need 
for  special  hardware.  The  significance  of  this  routing 
capability  is  that  it  allows  us  to  build  very  compact, 
low-cost,  mesh-based  parallel  processors,  of 
reasonable  size  (up  to  about  a  million  processors)  that 
can  perform  general  data-parallel  processing. 

Our  experience  with  the  Coterie  Network  has  resulted 
in  the  description  of  a  more  general  progranuning 
paradigm,  called  multi-associative  processing.  In  turn, 
the  capabilities  of  the  Coterie  Network  have  been 
exploit  for  directly  and  indirectly  supporting  multi¬ 
associativity. 

Consideration  of  a  set  of  issues  that  must  be 
addressed  by  a  parallel  symbolic  database  for 
inteimediate-level  processing  is  in  progress.  These 
include  the  problems  of  managing  data  from 
continuous  streams  of  images,  controlling  persistence 
of  the  data,  representations  of  the  data,  distribution  of 


the  data  and  maintenance  of  its  consistency,  and  real¬ 
time  systems  issues. 

A  C++  class  library  has  been  implemented  for  an 
image  plane  data  type  that  supports  the  development 
of  low-level  vision  operations  that  are  easily 
implemented  by  a  parallel  processor.  This  approach 
to  parallel  programming  has  the  advantage  that  it 
does  not  involve  a  non-standard  language  --  it  is 
merely  a  new  object  class  written  in  C++.  The  only 
difference  between  a  sequential  implementation  and  a 
parallel  implementation  is  the  run-time  library 
selected  for  linking. 

5.  Ongoing  and  New  Work 

5.1.  Multi-Sensor  Dextrous  Manipulation 

Crupen  and  Weiss  [20,  21]  are  collaborating  on  a 
multi-sensor  approach  to  dextrous  n\anipulation  in  a 
robot  workcell.  Models  of  objects  in  the  environment 
are  constructed  incrementally  using  an  active  sensing 
paradigm  in  order  to  support  the  ability  to  form  stable 
grasp  configurations  with  a  Utah-MIT  hand.  The 
system  consists  of  a  camera  mounted  on  one  robot 
arm  and  the  hand  mounted  on  another.  The 
transformation  from  the  camera  coordinate  system  to 
the  hand  coordinate  system  is  computed  using  the 
pose  refinement  algorithm  deveIop)ed  in  [24]. 

One  of  the  major  issues  with  respect  to  modeling  is 
fusing  information  from  multiple  views  and  different 
sensors.  The  particular  application  involves  the 
integration  of  haptic  and  visual  data  to  produce  a 
triangulation  of  the  surface  of  an  object  to  be  grasped. 
Haptic  sensing  here  is  the  determination  of  the  point 
of  contact  of  the  hand  with  the  object  based  on  force 
measurements.  This  gives  a  very  rough  estimate  of 
the  position  and  normal  to  the  surface.  The  Giblin- 
Weiss  [19]  algorithm  provides  estimates  of  position, 
surface  normal  and  curvature  from  a  sequence  of 
image  with  known  camera  irrotion. 

5.2.  Figural  Completion  from  Principles  of 
Perceptual  Organization 

Visual  psychology  provides  strong  evidence  that 
generic  knowledge  of  surfaces  and  occlusion  is 
exploited  very  early  in  the  perceptual  grouping  of 
image  contours.  Previous  work  by  Williams  [38] 
showed  how  generic  knowledge  of  this  sort  could  be 
captured  as  integer  linear  constraints  and  how  the 
problem  of  segmenting  simple  scenes  into  (potentially 
overlapping)  surfaces  could  be  cast  as  an  integer 
linear  programming  problem.  The  first  system  built 
along  these  lines  demonstrated  the  completion  of 
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gaps  in  straight  sided  figures,  such  as  those  caused  by 
occlusion  of  one  (opaque)  surface  by  a  second 
(opaque)  surface,  and  subsequent  recovery  of  the 
surfaces  using  only  straight  line  interpolating 
contours.  This  limitation  severely  restricted  ^e  range 
of  figures  to  which  the  system  could  be  applied 
(reconstruction  of  occluded  comers,  for  example,  was 
impossible).  In  the  past  year,  a  new  system  has  been 
built  which  uses  cubic  bezier  splines  of  least  energy  as 
the  interpolating  contours.  The  system  now  captures 
curved  illusory  contours  in  figures  that  have  been 
formulated  by  Kanizsa  and  other  perceptual 
psychologists. 

5.3.  Perceptual  Organization  of  Curves 

Token-based  grouping  has  thus  far  been  applied  to 
the  problems  of  recovering  straight-line  stmcture  [4] 
and  more  recently  curvilinear  structure  [12]  from  the 
edge  data  of  images.  IDolan  is  currently  extending  this 
approach  to  local  parallel  implementations  of  this 
grouping  paradigm.  A  SIMD  model  of  curvilinear 
grouping  l^s  been  designed  to  be  implemented  in  the 
CAAPP  layer  of  the  lUA  [37].  The  model  is  relatively 
simple  and  promises  many  orders  of  magnitude 
spe^up  in  extraction  of  straight  and  curved  lines.  A 
MIMD  version  is  currently  being  designed  which 
should  alleviate  the  contention  problems  in  the  SIMD 
design  by  utilizing  both  the  CAAPP  and  ICAP  layers. 

5.4.  Qualitative  Navigation 

One  way  to  solve  the  computational  burden  of 
maintaining  accurate  geometric  maps  for  navigation 
is  to  eliminate  such  maps  altogether.  In  contrast  to 
model-based  approaches  to  navigation  where  a  map 
is  required  that  explicitly  represents  the  geometry  and 
location  of  3D  objects  in  the  world,  Pinette  [22, 31]  is 
developing  a  method  for  qualitative,  image-based 
navigation  via  homing.  This  approach  maintains  only 
a  topological  map  of  the  world,  representing 
particular  places  in  the  world  and  the  directions 
between  neighboring  places.  A  place  is  represented 
explicitly  in  the  map  by  the  image  of  the  world  as 
seen  from  that  location.  Spatial  reasoning  is 
performed  directly  on  images  using  only  the  bearings 
of  landmarks  from  a  current  location  and  a 
neighboring  target  location,  and  does  not  need  to 
acquire  exact  shape  and  range  information.  The  work 
is  developing  a  theoretical  foundation  for  qualitative 
reasoning  in  the  incremental  homing  paradigm, 
including  cases  with  a  lack  of  precision  in  recovering 
the  direction  of  landmarks,  and  the  presence  of  errors 
in  landmark  correspondence. 
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ABSTRACT 

Our  program  in  Image  Undertianding  haa  continued  to 
focui  on  the  critical  iatuet  of  aeUction,  indexing,  aaliency 
computation  and  integration  ofviaual  cuea.  Efficient  ao- 
lutiona  to  theae  prohlema  are  conaidered  central  to  the 
development  of  robuat  object  recognition  ayatema,  one  of 
our  primary  reaearch  foci.  We  have  alao  continued  our 
work  on  the  computation  and  the  uae  of  low  level  viaual 
cuea  auch  aa  motion,  atereo,  color  and  texture,  on  analog 
VLSI  circuita  and  on  learning. 

1  Introduction 

Reseuch  at  the  MIT  AI  Lab  has  continued  along  a  range 
of  fronts,  from  low  level  processing,  such  as  stereo,  mo¬ 
tion,  *-olor  and  texture  analysis,  through  intermediate 
stages  of  integration  of  visual  information,  to  higher  level 
tasks  such  as  object  recognition  and  navigation.  This 
report  summarises  our  main  recent  accomplishments  in 
these  areas.  As  usual  in  these  reports,  we  refer  interested 
readers  to  other  publications  for  details. 

2  Object  Recognition 

Because  it  has  been  one  of  our  central  focal  points,  we 
be^  with  our  work  in  object  recognition.  In  approach¬ 
ing  the  problem  of  recognising  objects  from  noisy  images 
of  cluttered  scenes,  we  have  found  it  convenient  to  sep¬ 
arate  out  three  different  aspects  of  the  problem: 

•  Selection:  Given  a  large  set  of  image  features,  se¬ 
lect  (or  group)  subsets  likely  to  have  come  from 
single  objects. 

•  Indexing:  Given  one  of  these  image  feature  sub¬ 
sets,  select  a  small  set  of  object  models  from  the 
library  that  are  likely  to  match  the  data. 

•  Matching:  Given  a  data  feature  subset  and  an 
object  model,  determine  if  there  is  a  legal  transfor¬ 
mation  that  would  carry  the  model  into  a  pose  in 
the  image  that  is  consistent  with  the  data,  possi¬ 
bly  by  finding  a  matching  between  data  and  model 
features. 

We  will  describe  our  recent  work  in  each  of  these  areas, 
beginning  with  the  matching  problem,  since  this  must 
be  solved  even  if  we  use  trivial  solutions  to  the  other  two 
problems. 


3  Matching  methods 

The  goal  of  recognition  can  be  summarised  as  that  of 
deducing  the  existence  of  a  legal  transformation  from 
model  to  image  and  measuring  the  scope  of  the  associ¬ 
ated  interpretation,  i.e.  is  there  an  instance  of  the  trans¬ 
formed  object  model  in  the  scene,  and  how  much  of  the 
model  can  be  accounted  for  in  the  data.  More  formally, 
we  define: 


{fill  <  i  <  m} 
{/ill  <  <  <  *} 
T 

/:  {1,...,#}  »-♦  {l,...,m,*} 


set  of  model  features 
set  of  data  features 
legal  transformation 
a  mapping. 


Then  we  want  to  find  the  mapping  to,  pairing  data  and 
model  features  (or  excluding  data  features  by  pairing 
them  with  ★),  from  the  set  of  mappings 

{/  I  fi)  <  e  V»  such  that  f(»)  ^  (1) 

that  maximises 

A>  =  argmMp({(»,f(i))|»=  1,...,#})  (2) 

where  7/  is  defined  as  the  transformation  associated  with 
i,  p  is  some  appropriate  measure  (e.g.  Euclidean  distance 
in  the  case  of  point  features,  or  maximum  Euclidean  sep¬ 
aration  in  the  case  of  line  features)  and  g  is  some  measure 
of  the  magnitude  of  a  match  (e.g.  number  of  matched 
model  features,  or  total  linear  extent  of  such  model  fea¬ 
tures). 

In  recent  years,  we  have  considered  three  different 
classes  of  approach  to  this  problem.  The  first  concen¬ 
trates  on  finding  the  correspondence  /,  the  second  con¬ 
centrates  on  finding  the  pose  T,  and  the  third  is  a  hybrid 
approach  that  combines  aspects  of  both. 

3.1  Interpretation  Tree  Search 

The  first  class  of  methods  searches  for  correct  solutions 
in  a  Correspondence  space,  that  is,  a  discrete  a  di¬ 
mensional  space,  where  each  dimension  correponds  to 
one  of  the  a  sensory  data  features,  and  along  each  di¬ 
mension  of  which  there  are  m  -I- 1  possible  values,  corre¬ 
sponding  to  the  pairing  of  the  data  features  to  each  of 
the  m  model  features,  or  to  the  null  character,  indicating 
that  the  data  features  is  extraneous.  We  have  used  the 
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idea  of  an  Interpretation  l^ee  [Crimson,  1990]  to  struc¬ 
ture  the  search  for  correct  correspondences  in  this  space. 
The  method  searches  a  tree  of  all  possible  pairings  of 
subsets  of  data  features  to  model  features,  by  executing 
a  depth  first  walk  along  each  of  the  dimensions  of  the 
space.  This  tree  is  in  principle  of  exponential  sise,  but 
effective  use  of  constraints  on  the  relative  shapes  of  sets 
of  features  reduces  the  expected  cost  considerably.  The 
method  has  been  shown  to  be  quite  efficient  at  finding 
solutions  in  cluttered  data,  using  either  2D  data  or  3D 
data.  Recent  formal  analysis  of  the  method  by  Crimson 
has  led  to  the  following  results  [Crimson,  1990]: 

•  If  selection  is  perfect  (no  spurious  data  is  included) 
and  indexing  is  correct,  then  the  expected  amount 
of  search  is  quadratic  in  the  number  of  data  and 
model  features. 

•  If  selection  is  not  used  and  indexing  is  correct,  then 
the  expected  amount  of  search  is  polynomial  in  the 
number  of  data  and  model  features,  but  is  exponen¬ 
tial  in  the  sise  of  the  correct  solution. 

•  If  selection  is  adequate  (where  a  formal  definition 
of  adequate  can  be  given  as  a  function  of  the  ratio 
of  spurious  data  to  model  size),  indexing  is  per¬ 
fect,  and  one  heuristically  terminates  the  search 
once  a  sufficiently  good  solution  is  found  (where  for¬ 
mal  methods  for  defining  thresholds  for  “sufficiently 
good”  are  av^able),  then  the  expected  amount  of 
search  is  quartic  in  the  number  of  data  and  model 
features. 

•  Ifindexing  returns  an  incorrect  answer  (i.e.  one  tries 
to  match  a  model  not  present  in  the  data),  even 
using  heuristic  search  termination,  the  expected 
amount  of  search  to  deduce  that  the  model  is  not 
present  in  the  data  is  exponential  in  the  problem 
parameters. 

Our  conclusion  from  this  is  that  good  selection  meth¬ 
ods  are  essential  for  practical  applications  of  Interpre¬ 
tation  Tree  methods.  Although  the  results  have  been 
proven  only  in  the  context  of  construned  search  methods 
like  the  Interpretation  TVee,  we  suspect  that  the  conclu¬ 
sions  may  well  be  relevant  to  other  approaches  to  recog¬ 
nition.  Because  of  this,  much  of  our  current  research 
focuses  on  this  problem,  as  detailed  below. 

3.1.1  Variations  on  Correspondence  Search 

One  of  the  reasons  that  correspondence  space  search 
can  have  such  a  high  cost  is  that  it  may  waste  time 
redundantly  searching  portions  of  the  space  that  have 
already  been  considered.  Thomas  Breuel  [1990]  has  de¬ 
veloped  a  variation  on  this  depth  first  search  method 
that  avoids  this  problem.  In  particular,  as  the  inter¬ 
pretation  tree  is  search,  building  up  a  correspondence, 
the  method  also  concisely  and  efficiently  keeps  track  of 
those  regions  of  pose  space  that  have  already  been  ex¬ 
plored  by  the  search  process.  By  doing  this,  the  search 
process  can  be  altered  so  that  it  does  not  re-examine  ge¬ 
ometrically  equivalent  interpretations  of  the  data.  This 
leads  to  an  algorithm  that  has  a  worst  case  polynomial 
time  performance,  while  preserving  many  of  the  practical 
advantages  of  correspondence  space  search. 


3.2  Pose  Space  Search 

The  second  class  of  recognition  methods  concentrates  on 
finding  the  correct  pose,  rather  than  finding  the  correct 
correspondence.  In  the  case  of  2D  objects  constrained 
to  a  known  plane,  this  in  principle  entails  searching  a 
3D  space,  and  in  the  more  general  case  of  3D  objects, 
this  entails  search  a  6D  space.  Although  this  clesirly 
would  seem  to  be  an  improvement  over  the  exponential 
approaches  of  correspondence  space  search,  one  gener¬ 
ally  requires  a  fine  (or  nearly  infinitesimal)  tesselation  of 
the  space,  thereby  increasing  the  cost  of  straightforward 
search  methods. 

In  the  ideal  case  of  perfect  sensor  data,  the  tesselation 
of  the  pose  space  is  generally  not  a  problem.  One  can 
simply  search  over  all  possible  pairings  of  model  and  im¬ 
age  features,  compute  the  associated  transformation  and 
vote  for  that  transformation  in  pose  space,  a  la  Hough 
transforms.  When  uncertainty  is  allowed  in  the  measure¬ 
ments,  however,  one  must  be  more  careful  about  voting 
for  the  entire  volume  of  transformations  consistent  with 
the  pairing  of  a  noisy  sensor  measurement  and  a  model 
feature,  and  this  increases  the  demand  on  searching  fine 
tesselations  of  the  pose  space. 

One  way  around  this  problem  is  to  exploit  the  geom¬ 
etry  of  pose  space  directly.  Todd  Cass  [1990]  has  devel¬ 
oped  a  powerful  framework  for  doing  this,  by  providing 
a  formulation  of  the  problem  in  which  one  can  develop 
a  polynomial-time  algorithm  that  guarantees  finding  aU 
feasible  interpretations  of  the  data,  modulo  uncertainty, 
in  terms  of  the  model.  The  approach  is  based  on  repre¬ 
senting  the  model  and  the  sensory  data  in  terms  of  local 
geometric  features  such  as  vertices  and  line  segments. 
It  assumes  bounds  on  the  uncertainty  in  the  position 
or  orientation  of  the  data  features  due  to  sensor  error. 
One  can  show  that  there  are  only  a  polynomial  number 
of  quantitatively  different  transformations  that  align  the 
model  and  the  data  modulo  error.  Object  localisation 
is  eiccomplished  using  a  polynomial-time  search  through 
the  set  of  aU  model  transformations  to  find  those  that 
align  large  subsets  of  model  and  data  features  within  the 
uncertainty  bounds. 

Intuitively,  this  approach  can  be  considered  as  follows. 
For  each  pairing  of  a  data  and  model  feature,  there  is 
a  set  of  transformations  that  will  align  the  model  fea¬ 
tures  within  the  uncertainty  region  about  the  data  fea¬ 
ture.  This  set  of  transformations  carves  out  a  volume 
in  pose  space.  If  we  consider  all  pairings  of  data  and 
model  features,  we  get  a  set  of  such  volumes,  and  we 
are  interested  in  finding  points  in  the  pose  space  con¬ 
tained  within  the  intersection  of  a  large  number  of  such 
volumes.  One  could  find  such  points  by  simply  sampling 
points  in  pose  space  at  some  fine  spacing,  a  method  used 
earlier  by  Cass  in  implementing  a  very  fast  recognition 
scheme  on  the  Connection  Machine.  It  turns  out,  how¬ 
ever,  that  one  can  efficiently  find  such  volumes  by  de¬ 
coupling  the  search  over  the  full  pose  space  into  a  cou¬ 
pled  search  over  the  translational  components  and  a  sec¬ 
ond  search  over  the  rotational  components.  Moreover, 
one  can  use  the  structure  of  these  geometric  arrange¬ 
ments  to  find  very  efficient,  polynomial-time  algorithms 
for  finding  the  boundaries  of  these  pose-space  volumes. 
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A  similar  method  has  been  developed  by  Bienel  [1991]. 
3.3  Alignment  Methods 

The  third  class  of  methods  combines  aspects  of  the  first 
two.  In  particular,  the  alignment  [Huttenlocher  &  Ull- 
man,  1987]  method  searches  through  portions  of  corre¬ 
spondence  space  to  find  possible  matches  of  triples  of 
image  features  to  triples  of  model  features.  For  each 
such  pairing,  the  method  then  computes  the  associated 
transformation  (which  can  be  modeled  as  an  affine  trans¬ 
formation  in  the  image)  and  uses  this  to  transform  the 
other  model  features  into  the  image.  These  hypothesised 
features  are  then  compared  against  the  image  data  to 
verify  or  refute  the  correctness  of  the  determined  trans¬ 
formation. 

3.3.1  Linear  Combinations  of  Images 

In  its  original  form,  the  alignment  method  used  a  full 
3D  model  of  the  known  objects.  Recently,  Basri  and 
Ullman  [1990]  have  derived  a  new  method  for  aligning 
objects  with  images  that  only  requires  a  small  number 
of  images  of  the  object  as  the  model. 

In  particular,  they  show  that  a  3D  object  can  be  rep¬ 
resented  by  a  linear  combination  of  2d  images  of  the 
object.  If  =  {Ml,. . . ,  Mk}  is  the  set  of  pictures  rep¬ 
resenting  a  given  object,  and  P  is  the  2D  image  of  an 
object  to  be  recognised,  then  P  is  considered  an  instance 
otM  if 

h 

P  =  Y,<^iMi 

«=i 

for  some  constants  0!j.  If  P  is  the  image  of  a  3D  object 
with  sharp  edges,  then  P  can  be  represented  as  the  linear 
combination  of  3  other  images  of  the  object.  By  linear 
combination,  we  mean  that  the  x  coordinate  of  any  point 
in  P  is  simply  a  linear  combination  of  the  x  coordinates 
of  the  corresponding  point  in  each  of  the  three  canoni¬ 
cal  images.  Similarly  the  y  coordinate  of  any  point  in 
P  is  simply  a  linear  combination  of  the  y  coordinates 
of  the  corresponding  point  in  each  of  the  three  canoni¬ 
cal  images.  WUle  the  two  linear  combinations  for  the 
X  and  y  coordinates  are  different,  the  same  two  linear 
combinations  bold  for  all  points  in  the  image. 

As  a  consequence,  if  one  can  establish  a  correspon¬ 
dence  between  all  of  the  points  in  the  three  canonical  im¬ 
ages,  then  an  elegant  solution  to  the  recognition  problem 
is  to  use  the  alignment  method  to  establish  an  hypothe¬ 
sised  correspondence  between  a  small  number  (typically 
3)  of  features  in  a  new  image  and  one  of  the  canonic^ 
images,  use  this  to  compute  the  coefficients  of  the  linear 
combination,  apply  that  linear  combination  to  the  three 
canonical  images  to  generate  an  hypothesised  image  of 
the  model  if  the  original  correspondence  is  correct,  and 
simply  overlay  that  hypothesised  edge  image  onto  the 
original  image  and  verify  the  hypothesis  by  comparing 
this  edge  image  with  the  originiJ  image.  Experiments 
with  the  method  indicate  that  the  hypothesised  edge 
mi^  can  very  closely  overlap  the  actual  edge  map  when 
a  correct  alignment  correspondence  is  chosen. 

While  the  basic  method  applies  to  objects  with  sharp 
edges,  the  same  result  holds  for  objects  with  smoth 
edges,  in  approximation.  In  particular,  using  earlier 


work  by  Basri  and  Ullman  [1988],  one  can  accurately 
predict  the  appearance  of  smooth  objects,  provided  one 
has  depth  and  curvature  information  about  each  contour 
point  as  well  as  the  2d  contour  image.  Building  on  this 
idea,  they  show  that  one  can  extend  the  linear  combinar 
tions  method  to  smooth  objects,  where  now  any  view  of 
the  object  is  represented  as  a  linear  combination  of  five 
canonical  images  of  the  object. 

The  method  is  initially  formulated  in  terms  of  wire 
frame  objects.  When  solid  objects  that  can  self-occlude 
are  considered,  one  may  need  additional  views  to  over¬ 
come  the  occlusion,  but  the  same  general  approach  holds. 

3.3.2  Model  Building  for  Alignment 

Given  the  success  of  the  basic  linear  combinations 
method,  we  have  also  developed  several  extensions  to  it. 
Since  the  heart  of  the  modeling  approach  is  to  establish 
a  correspondence  between  all  the  feature  points  in  a  set 
of  canonical  images,  we  need  efficient  methods  for  doing 
this  automatically.  One  extension  considers  this  problem 
within  the  more  general  context  of  matching  contours  in 
two  images.  This  applies  to  a  whole  range  of  problems, 
including  determining  optical  flow,  matching  features  for 
alignment-based  object  recognition,  and  finding  corre¬ 
spondence  for  long-range  and  apparent  motion.  Ivan 
Bachelder  (together  with  Shimon  Ullman)  [1991]  has  re¬ 
cently  finished  the  development  of  a  scheme  for  matching 
partiaUy  constrained  contours  in  two  images  using  local 
affine  transformations.  It  assumes  that  constraint  lines, 
each  narrowing  down  the  match  for  a  contour  pmnt  in 
the  first  image  to  aline  in  the  second  image,  are  available 
for  several  contour  points  in  the  first  image.  The  new 
scheme  constrains  the  matching  by  assuming  that  con¬ 
tours  are  the  orthographic  projections  of  locally  coplanar 
points,  thereby  reducing  the  recovery  of  correspondence 
to  a  local,  linearly  constrained,  non-iterative  calculation. 
Suggested  applications  include  the  determination  of  op¬ 
tical  flow  in  short-range  motion  and  the  matching  of 
aligned  contour  views  in  either  alignment-based  object 
recognition  or  long-range  motion. 

To  determine  the  match  for  a  contour  point  the 
scheme  finds  the  best  affine  transformation,  in  a 
weighted  least  squares  sense,  that  satisfies  the  match 
constraint  lines  and  specially  matched  pmnts  within  an 
oriented  local  neighborhood.  This  neighborhood  is  es¬ 
tablished  by  weighing  the  constraint  equations.  The 
weight  for  a  given  point  constraint  is  the  modulation  of 
the  Gaussian  distance  to  the  local  origin,  which  estab¬ 
lishes  a  circularly  symmetric  local  neigkhorhood,  by  the 
Gaussian  distance  feom  the  local  tangent  to  the  contour, 
which  attempts  to  limit  the  neighborhood  to  a  single 
contour.  The  width  of  the  modulating  Gaussian  is  set 
such  that  the  axes  of  the  oriented  neighborhood  are  pro¬ 
portional  to  the  local  axes  of  inertia.  To  determine  the 
width  of  the  circularly  symmetric  Gaussian,  the  effective 
sise  of  the  local  neighborhood,  several  sises  are  simulta¬ 
neously  considered  and  the  smallest  one  which  yields  a 
stable,  unique  solution,  as  determined  by  the  condition 
number  of  a  linear  constraint  matrix,  is  chosen.  The 
maximum  allowed  condition  number  is  set  according  to 
how  much  noise  is  expected  in  the  constraint  lines.  In 
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the  event  that  this  condition  numbet  exceeds  the  maxi¬ 
mum  allowed  condition  number  even  at  the  largest  neigh¬ 
borhood,  there  is  more  than  one  possible  solution,  from 
which  we  choose  the  smaUest  afiine  transformation  which 
predicts  matches  for  all  neighborhood  points  that  devi¬ 
ate  the  least  from  a  purely  translational  matching.  The 
neighborhood  used  to  find  the  best  pure  translation  for 
a  point  is  again  determined  using  the  condition  number 
criterion.  Unique  matches  for  ate  guaranteed  through 
the  use  of  a  general  pseudoinverse  formulation  that  uses 
a  modified  singular  value  decomposition  technique  guar¬ 
anteeing  continuous  solutions  along  contours.  As  the 
constraint  matrix  becomes  less  stable  according  to  the 
condition  number,  the  solution  gradually  changes. 

This  entire  procedure  is  repeated  for  every  contour 
point.  Note,  however,  that  the  match  for  each  point 
may  be  computed  in  parallel.  Furthermore,  since  the 
predicted  match  for  a  point  is  simply  given  by  the  trans¬ 
lational  component  of  the  local  affine  transformation, 
finding  the  match  involves  explicitly  solving  for  only  the 
two  translational  components  of  the  affine  transforma^ 
tion.  A  closed  form  solution  for  the  match  involves  us¬ 
ing  a  continuous  version  of  the  pseudoinverse  to  invert 
a  2  X  2  matrix  and  a  4  x  4  matrix,  the  coefficients  of 
which  are  weighted  summations  of  local  point  constraint 
parameters  that  may  be  determined  in  parallel. 

Simulation  results  obtained  on  noisy  synthetic  and 
natural  imagery  have  demonstrated  the  robustness  of  the 
approach  and  seem  to  agree  with  supporting  biological 
evidence  for  the  employment  of  such  a  scheme  in  the 
primate  motion  pathway. 

Amnon  Shashua  has  developed  a  second  approach  to 
the  model  building  problem,  specifically  to  the  problem 
of  recovering  shape  and  correspondence  from  two  ortho¬ 
graphic  views  and  4  corresponding  points.  In  attempting 
to  ^dress  a  problem  traditionally  seen  in  the  context  of 
three  or  more  orthographic  views,  two  arguments  are 
raised:  (i)  shape  should  be  represented  as  a  deviation 
from  a  reference  plane,  rather  than  its  traditional  Eu¬ 
clidean  form,  and  (ii)  in  the  attempt  to  recover  three- 
dimensional  information  from  the  changing  2D  image, 
it  may  be  more  advantageous  to  address  the  problem 
of  recovering  shape  and  full  correspondence/fiow,  rather 
than  the  problem  of  recovering  shape  and  3D  motion 
parameters. 

Shashua  has  shown  that  it  is  possible  to  recover  shape 
and  full  correspondence/flow  simultaneously,  by  using 
the  instantaneous  change  in  brightness  as  an  integral 
part  of  the  computational  model.  The  resulting  equa¬ 
tions  are  further  manipulated  in  way  that  the  contribu¬ 
tion  of  motion  and  shape  to  the  displacement  vector  of 
a  moving  point  are  completely  separated.  The  motion 
component  is  the  transformation  that  aligns  p,  the  image 
coordinates  of  an  object  point  P,  with  p,  the  image  co¬ 
ordinates  of  the  projection  of  P  onto  a  reference  plane  in 
the  second  view.  The  remaining  displacement  between 
p  and  p ,  the  corresponding  point  of  p,  depends  only  on 
shape.  Shape  is  therefore  defined  as  the  deviation  of  P 
from  a  reference  plane,  defined  by  the  4  corresponding 
points,  along  the  line  of  sight. 

The  separation  of  sh^e  and  motion  is  used  to  de¬ 


fine  a  two-stage  process  in  which  motion  is  first  factored 
out  and  then  the  unknown  shape  component  is  recov¬ 
ered  using  the  instantaneous  brightness  change  between 
both  views.  One  advantage  of  this  approach  is  that  both 
stages  are  insensitive  to  the  range  of  motion,  thereby 
making  it  possible  to  deal  directly  with  distant  views. 
It  is  further  shown  that  with  this  representation  one  can 
derive,  in  a  straightforward  manner,  many  previously  es¬ 
tablished  results  relating  to  constraint  lines,  3D  motion 
parameters  from  two  orthographic  views  and  the  predic¬ 
tion  of  novel  views  for  three-dimensional  object  recogni¬ 
tion. 

Finally,  the  entire  computational  model  requires  sim¬ 
ply  to  establish  the  affine  transformation  that  aligns 
three  corresponding  points  in  both  views  with  respect 
to  the  fourth  corresponding  point,  serving  as  an  origin, 
in  both  views.  The  affine  parameters  define  a  constraint 
line  from  which  both  shape  and  the  location  of  the  cor¬ 
responding  point  are  recovered  by  using  the  constant 
brightness  equation  as  a  second  constraint  line,  and  then 
finding  their  intersection. 

An  alternative  exploration  of  building  models  by  find¬ 
ing  correspondences  between  sets  of  images  has  been  de¬ 
veloped  by  Ronen  Basri.  He  has  focused  on  the  task 
of  shape  recovery  from  a  motion  sequence,  which  re¬ 
quires  the  establishment  of  correspondence  between  im¬ 
age  points.  The  two  processes,  the  matching  process 
and  the  shape  recovery  one,  are  traditionally  viewed  as 
independent.  Information  obtained  during  the  process 
of  shape  recovery,  however,  can  be  used  to  guide  the 
matching  process.  Basri  has  developed  a  technique  that 
builds  on  the  constraints  imposed  on  the  correspondence 
by  rigid  transformations,  extending  them  to  objects  that 
undergo  general  affine  (non  rqpd)  transformation  (in¬ 
cluding  stretch  and  shear),  as  well  as  to  rigid  objects 
with  smooth  surfaces.  In  all  these  cases  corresponding 
points  lie  along  epipolar  lines,  and  these  lines  can  be  re¬ 
covered  from  a  small  set  of  corresponding  points.  Basri 
has  developed  an  algorithm  that  takes  advantage  of  such 
epipolar  lines  to  recover  the  correspondence  from  three 
contour  images.  The  algorithm  has  been  implemented 
and  used  to  construct  object  models  for  recognition. 

3.3.3  Constraints  on  Modd  Images 

On  a  related  note,  the  original  linear  combinations 
approach  used  as  its  model  a  set  of  edge  maps  taken  un¬ 
der  similar  conditions  in  which  the  points  on  the  edge 
maps  had  been  placed  in  correspondence.  The  study 
of  Mooney  images  (high  contrast  images  of  frees)  shows 
that  the  setting  of  light  sources  plays  a  significant  role  on 
the  interpretation  of  the  image  of  a  3D  object.  Amnon 
Shashua  has  conducted  a  computational  study  showing 
that  lighting  conditions  can  be  compensated  for  prior  to 
matching  the  image  of  the  object  with  its  internal  model. 
The  entire  computation  can  be  carried  out  by  a  single 
layer  percepfron.  The  study  presents  three  observations: 
(i)  an  image  of  an  object  taken  with  an  arbitrary  light¬ 
ing  condition  can  be  expressed  as  a  linear  combination 
of  three  images,  each  taken  from  a  different  direction 
of  light  source,  (ii)  the  linear  coefficients  can  be  recov¬ 
ered  from  the  sero-crossings,  provided  their  location  is 
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accurate,  and  (iii)  accuracy  can  be  traded  off  by  using 
the  sign-bits  instead  of  sero-crossings  for  recovering  the 
linear  coefficients. 

3.3.4  Correspondence  for  Alignment 

Independent  of  how  the  object  models  are  built,  align¬ 
ment  methods  still  need  to  establish  a  correspondence 
between  the  image  data  and  the  model.  The  usual 
method  used  for  determining  correspondence  is  to  com¬ 
pute  features  explicitly  from  images  and  then  match 
them  to  their  equivalents  in  models.  This  approach  may 
be  problematic,  however.  First,  it  is  difficult  to  com¬ 
pute  the  features  reliably.  Second,  the  possible  pair¬ 
ings  of  image  and  model  features  is  large  and  the  search 
for  a  solution  may  be  extensive.  In  the  past  year,  Shi¬ 
mon  UUman  and  Pam  Lipson  have  been  investigating 
schemes  to  solve  the  correspondence  problem,  and  there¬ 
fore  determine  alignment,  without  explicitly  computing 
features.  They  have  explored  three  techniques  based  on 
Ullman  and  Basri’s  linear  combination  approach  for  ob¬ 
ject  recognition.  The  goal  is  to  find  coefficients  that  align 
a  model  with  an  image.  They  start  with  a  rough  align¬ 
ment  stage.  They  then  improve  the  correspondence  by 
performing  a  simultaneous  matching  of  points.  The  tech¬ 
niques  are  designed  to  handle  models  and  images  sepa¬ 
rated  by  relatively  small  rotations  of  up  to  30  degrees. 
They  have  tested  the  three  techniques  on  synthetic  and 
natural  objects.  The  results  to  date  are  promising.  They 
are  currently  evaluating  the  performance  of  the  tech¬ 
niques  as  the  output  from  the  rough  alignment  stage 
degrades. 

3.4  Alternative  matching  methods 

An  alternative  formulation  of  the  matching  problem  has 
been  explored  recently  by  Sandy  Wells  [1990],  focusing 
on  finding  statutical  characterisations  of  feature  based 
recognition,  attempting  to  balance  simplicity  with  accu¬ 
racy.  The  method  uses  a  maximum  a-posteriori  probabil¬ 
ity  (MAP)  criteria  within  the  general  context  of  model- 
based  recognition. 

Wells  has  derived  a  simple  MAP  model  matching  cri¬ 
terion  that  captures  important  aspects  of  recognition  in 
controlled  situations.  A  detailed  metrical  object  model 
is  assumed.  A  probabilistic  model  of  image  features  is 
combined  with  a  simple  prior  on  both  the  pose  and  the 
feature  interpretations  to  yield  a  mixed  objective  func¬ 
tion.  The  parameters  that  appear  in  the  probabilistic 
models  can  be  derived  from  images  in  the  application 
domain.  By  extremising  the  objective  function,  an  op¬ 
timal  matching  between  model  and  image  features  re¬ 
sults.  Within  this  framework,  good  models  of  feature 
uncertainty  allow  for  robustness  despite  inaccuracy  in 
feature  detection.  In  addition,  the  relative  likelihood  of 
features  arising  from  either  the  object  or  the  background 
can  be  evaluated  in  a  rational  way.  The  objective  func¬ 
tion  provides  a  simple  and  uniform  means  of  evaluating 
match  and  pose  hypotheses  by  the  amount  of  the  image 
that  b  explained  in  terms  of  the  model,  as  well  as  the 
metrical  consistency  of  the  hypothesis.  It  allows  these 
two  aspects  to  be  traded  off  in  a  rational  way  based  on 
domain  statistics.  An  experimental  implementation  of 


MAP  model  matching,  among  features  derived  from  low 
resolution  edge  images,  has  been  built  and  tested. 

4  Selection  methods 

For  all  of  the  matching  systems  we  have  developed  for 
recognition,  performance  both  in  terms  of  speed  and  ac¬ 
curacy  depends  critically  on  the  set  of  data  features  sup¬ 
plied  to  the  algorithm.  Thus,  all  of  these  methods  can 
benefit  from  good  selection  methods,  that  bolate  subsets 
of  the  data  features  likely  to  correspond  to  a  single  ob¬ 
ject.  While  model-driven  selection  methods,  such  as  the 
Hough  transform,  or  Geometric  Hashing  are  of  use  with 
small  libraries  of  objects,  we  are  interested  in  methods 
that  ako  work  with  large  libraries,  and  this  suggests  the 
use  of  data-driven  selection  methods. 

4.1  SaUency 

One  method  for  selecting  groups  of  data  features  on 
which  to  focus  is  the  saliency  networks  developed  by 
Shashua  and  Ullman  [1988]  ,  the  basis  of  which  was  de¬ 
scribed  in  earlier  proceedings.  The  key  idea  is  to  use 
local  measures  of  curvature  and  orientation  as  part  of  a 
global  optimisation  process  to  find  contours  that  appear 
most  salient  relative  to  the  entire  set  of  image  contours. 

Shashua  and  Ullman  [1991]  have  continued  to  extend 
this  work,  by  developing  a  network  model  to  perform 
grouping  of  image  contours.  The  input  to  the  net  are 
fragments  of  image  contours,  and  the  output  is  the  par¬ 
titioning  of  the  fragments  into  groups,  together  with  a 
saliency  measure  for  each  group.  The  grouping  is  based 
on  a  measure  of  overall  length  and  curvature.  The  net¬ 
work  decomposes  the  overall  optimisation  problem  into 
independent  optimal  pairing  problems  performed  at  each 
node.  The  resulting  computation  maps  into  a  uniform 
locally  connected  network  of  simple  computing  elements. 

A  grouping  of  the  image  contours  is  defined  as  the 
formation  of  a  set  of  disjoint  groups,  each  correspond¬ 
ing  to  a  curve  that  may  have  any  number  of  gaps,  and 
whose  union  covers  all  the  contour  fragments  in  the  im¬ 
age.  Given  a  function  F{A)  that  measures  some  desired 
property  of  a  group  A,  we  would  like  to  find  a  disjoint  set 
of  groups  that  maximises  F’(Ai)  over  aU 

possible  groupings.  The  optimisation  problem  is  defined 
more  specifically  below. 

For  the  purpose  of  grouping  it  is  convenient  to  con¬ 
sider  the  image  as  a  graph  of  edge  elements.  The  vertices 
of  the  graph  correspond  to  image  pixels,  and  the  arcs  to 
elementary  edge  fragments.  The  input  to  the  grouping 
problem  is  a  contour  image,  represented  by  a  subset 
of  the  elements  in  the  graph.  A  path  in  the  graph  cor¬ 
responds  to  a  countour  in  the  image  having  any  number 
of  gaps. 

A  grouping  of  these  elements  is  a  collections  of  chains 
of  elements  Ai, ...,  Am  such  that  A<  n  A^  =  0  i  ^  j  and 
UiAi  D  ET .  To  define  an  optimal  grouping  we  will  define 
a  function  F{A)  that  measures  the  quality  of  a  group  A. 
An  optimal  grouping  is  then  a  grouping  that  maximises 
Sli  ^(-^«)  possible  groupings  of  the  elements. 

This  problem  is  similar  to  the  saliency  computation 
and  it  is  motiviated  by  similar  considerations  of  find¬ 
ing  smooth  long  boundaries.  The  difference  is  that  now 
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we  are  trying  to  form  explicit  grouping,  and  the  opti- 
maility  measure  is  therefore  not  just  evaluated  at  each 
image  location  independently,  but  is  defined  instead  over 
possible  groupings.  This  precludes,  for  example,  a  part 
of  a  contour  from  participating  in  a  number  of  different 
groups. 

4.2  Visual  Attention 

A  second  way  to  build  on  the  saliency  concept  is  to  en- 
corporate  the  idea  of  visual  attention,  l^veer  Syeda 
[1990, 1991]  has  been  developing  a  computational  model 
of  visual  attention,  and  has  used  it  to  demonstrate  atten¬ 
tions!  selection  as  a  useful  module  for  object  recognition. 

Towards  this  end,  two  modes  of  human  attentional 
behavior,  namely  attracted  attention  and  pay-attention 
modes  were  identified.  The  attracted  attention  mode 
of  behavior  is  spontaneous  and  is  commonly  exhibited 
by  an  unbiased  observer  (i.,e.  with  no  a  priori  inten¬ 
tions)  when  some  object  or  some  aspect  of  the  scene 
attracts  his/her  attention,  while  the  latter  is  a  more  de¬ 
liberate  behavior  exhibited  by  an  observer  looking  at  a 
scene  with  a  priori  goals  (such  as  the  task  of  recognising 
an  object,  say)  and  hence  paying  attention  to  only  those 
objects/aspects  of  a  scene  that  are  relevant  to  the  goal. 

Briefly,  the  model  suggests  that  the  scene  represented 
by  the  image  be  processed  by  a  set  of  interacting  feature 
detectors  that  generate  a  hierarchy  of  maps,  representing 
features  such  as  brightness,  color,  texture,  depth,  group¬ 
ing  of  edges,  and  others  such  as  shape,  site,  symmetry, 
etc.  The  feature  maps  are  then  processed  by  filters  in¬ 
corporating  strategies  for  selecting  distinctive  regions  in 
these  maps.  The  choice  of  these  strategies  is  guided  by  a 
central  control  mechanism  that  combines  top-down  task 
level  and  a  priori  information  with  the  bottom-up  infor¬ 
mation  derived  from  the  features,  to  demonstrate  either 
mode  of  attentional  behavior  as  desired.  Finally,  an  ar¬ 
biter  module  housing  another  set  of  strategies  selects  the 
most  significant  features  across  the  feature  maps,  which 
can  then  be  used  in,  say  an  object  recognition  system. 

A  system  implementing  the  computational  model  de¬ 
scribed  here  is  being  built.  The  aim  of  this  is  to  demon¬ 
strate  how  selection  can  be  achieved  via  attention  and 
specifically,  how  it  can  be  of  use  in  object  recognition.  So 
far,  three  features  were  chosen,  namely,  color,  texture, 
and  parallel-line-groups.  The  respective  feature  maps 
were  built,  and  the  selection  filters  for  finding  distinc¬ 
tive  regions  in  these  maps  have  been  developed.  In  ad¬ 
dition,  a  version  of  the  arbiter  module  to  combine  the 
saliency  information  from  the  various  features  has  been 
built.  Some  of  this  work  is  now  described. 

The  color  feature  map  was  developed  that  describes 
the  color  image  as  consisting  of  perceptually  different  col¬ 
ored  regions.  Here,  a  method  of  perceptual  categorisa¬ 
tion  of  a  color-space  was  introduced  that  made  possible 
fast  color  region  segmentation.  A  color  saliency  map  was 
then  built  which  used  a  color  saliency  measure  that  em¬ 
phasised  attributes  that  are  also  salient  in  human  color 
perception. 

The  texture  feature  map  was  generated  by  regarding 
the  image  as  being  generated  by  a  space-limited  station¬ 
ary  stochastic  process.  Here,  the  segmentation  of  the 


textured  image  was  obtained  by  a  comparison  of  the  lin¬ 
ear  prediction  spectra  of  adjacent  windowed  regions  of 
the  image.  Properties  such  as  the  relative  distribution 
of  dark  and  bright  blobs  were  then  made  use  of  to  judge 
the  distinctiveness  of  a  region.  This  was  used  to  generate 
the  texture  saliency  map. 

Lastly,  the  parallel-line-groups  feature  map  high¬ 
lighted  groups  of  closely-spaced  parallel  lines  in  an  edge 
image  (i.e.,  brightness  image  passed  through  an  edge  de¬ 
tector).  It  has  been  found  that  some  texture  information 
can  be  modeled  this  way.  For  example,  printed  letters  on 
a  surface  (such  as  a  bottle)  appear  as  a  bunch  of  closely 
spaced  parallel  lines  when  passed  through  an  edge  de¬ 
tector.  Similarly,  some  types  of  wooden  tables  show  this 
type  of  texture  in  an  image. 

The  three  feature-maps,  the  respective  saliency-maps, 
and  the  arbiter  module  built  constituted  the  implemen¬ 
tation  of  the  attracted-attention  mode  of  the  model.  The 
next  phase  of  the  work  implemented  the  pay-attention 
mode.  This  was  built  with  the  aim  of  performing  selec¬ 
tion  in  model-based  object  recognition.  Here,  the  color 
and  texture  information  in  the  model  (extracted  using 
the  feature  maps  described  earlier)  was  used  to  build  a 
description  of  the  object-model.  This  description  was 
then  used  to  design  strategies  for  the  selection  filters. 
This  involved  developing  new  algorithms  for  finding  in¬ 
stances  of  regions  in  the  image  satisfying  object-model 
color  and  texture  descriptions. 

Finally,  a  3D  feom  2D  recognition  system  was  built 
to  evaluate  the  selection  mechanism.  Initial  studies  with 
selection  based  on  color  information  alone,  have  been 
encouraging.  Much  of  the  future  work  will  concentrate 
on  the  integration  of  the  selection  mechanism  with  the 
recognition  system,  and  also  developing  faster  ways  of 
matching  model  features  to  data  features  that  explmt 
information  derived  from  the  selection  process. 

4.2.1  Other  uses  of  color 

Another  approach  to  the  selection  problem  has  been 
developed  by  Kah  Kay  Sung.  Similar  to  Syeda,  his  work 
has  focused  on  using  color  as  a  cue  for  finding  salient 
re(pons.  Color  is  an  excellent  visual  cue  for  gathering 
surface  materia]  information  from  the  scene,  because  if 
treated  appropriately,  it  factors  out  image  effects  due  to 
surface  orientation  changes  and  illumination  differences. 
The  specific  approach  involves  extending  traditional  sig¬ 
nal  and  image  processing  operations  to  work  with  color 
data.  The  ultimate  goal  is  to  establish  a  formal  relation¬ 
ship  between  early  level  scalar  field  image  processes  and 
their  corresponding  vector  field  notions,  so  that  color 
images  can  be  treated  and  operated  upon  as  piecewise 
continuous  fields  like  grey-level  intensity  images. 

There  are  three  main  aspects  to  this  work.  First,  a 
color  representation  scheme  and  a  related  color  difference 
measure  are  derived  that  respond  strongly  to  actual  hue 
differences  in  an  image  but  not  to  pure  intensity  changes. 
Both  notions  are  based  on  a  pigmentation  model  of  ma¬ 
terial  surfaces,  where  the  Lambertian  color  of  a  surface 
depends  on  the  composition  of  its  embedded  pigments. 
Second,  by  building  on  an  earlier  radiometric  technique 
by  Wolff  at  Columbia,  an  extended  method  for  extract- 
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ing  Lambettian  color  components  has  been  developed. 
One  can  use  this  technique  to  produce  reduced  specular¬ 
ity  images.  Finally,  Sung  has  extended  scalar  field  image 
processes  to  color,  by  deriving  color  analogues  for  some 
early  level  scalar  image  processes  and  concepts,  based  on 
his  chosen  color  representation  scheme.  The  list  includes: 

(1)  a  model  and  quantitative  measure  for  color  noise, 

(2)  noise  reduction  techniques,  for  example  smoothing, 
MRF  like  local  averaging,  and  median  window  filtering, 

(3)  boundary  detection,  and  (4)  uniformity  detection  as 
an  approach  to  region  growing.  All  the  above  mentioned 
color  image  processes  have  been  implemented  and  tested 
either  on  the  parallel  Connection  Machine  or  on  the  Sym¬ 
bolics  Lisp  machines. 

Working  jointly  with  Brian  Subirana,  Sung  has  used 
the  color  based  approach  to  region  growing  to  tackle 
the  problem  of  perceptual  organisation,  or  selection,  of 
salient  image  regions.  Using  the  technique  of  Curved 
Inertia  Frames,  developed  earlier  by  SubiranarVilanova, 
they  have  devised  a  perceptual  organisation  scheme  that 
works  without  the  need  of  explicit  edges.  Their  approach 
is  based  on  the  concept  of  finding  brightness  ridges  in  im¬ 
ages,  and  utilises  a  new  ridge  detector  which  is  indepen¬ 
dent  of  scale  and  which  can  be  applied  to  texture,  color 
and  brightness.  They  have  implemented  the  scheme  for 
perceptual  organisation  using  color  on  the  Connection 
Machine. 

4.3  Region  Based  Grouping 

An  alternative  approach  to  the  selection  problem  was 
explored  by  Dand  Clemens  [1991].  He  has  developed  a 
region-bas^  grouping  method  that  attempts  to  collect 
together  image  edges  likely  to  have  come  from  a  single 
object.  To  do  this,  Clemens’  method  uses  a  variation  on 
a  brushfire  technique  to  propagate  labels  from  edges  into 
regions  of  uniform  intensity.  Once  the  propagation  ter¬ 
minates  in  a  set  of  region  centers,  a  backtracking  process 
is  initiated  that  traces  the  lables  back  from  the  region 
center  to  identify  those  edges  that  “surround”  it.  The 
process  careful  to  allow  for  a  branching  of  the  boundaries 
between  regions  of  constant  label  in  this  backtracking 
stage,  thereby  avoiding  some  of  the  standard  problems 
associated  with  brushfire  methods.  This  method  finds 
sets  of  edges  that  bound  a  region  of  roughly  uniform  in¬ 
tensity.  Because  of  this,  one  can  use  the  center  of  the 
region  to  establish  an  origin  relative  to  which  an  ordering 
can  be  established  on  the  collection  of  edges.  A  similar 
ordering  can  be  imposed  on  model  edges,  and  this  order¬ 
ing  can  be  used  in  the  matching  process  to  dramatically 
reduce  the  cost  of  the  search. 

5  Indexing  and  invariants 

Indexing  remains  a  major  problem  for  recognition  sys¬ 
tems.  Since  the  cost  of  incorrectly  indexing  an  object 
model  is  high  (based  on  our  earlier  complexity  results), 
we  seek  indexing  methods  that  can  avoid  the  obvious 
linear  search  method.  In  previous  proceedings,  we  have 
reported  on  some  of  our  work  on  this  problem  [Breuel, 
1989].  In  the  past  year,  we  have  continued  to  explore 
other  approaches  to  this  problem,  with  particular  focus 


on  the  use  of  invariants,  the  use  of  non-accidental  prop¬ 
erties,  and  indexing  treated  as  the  problem  of  matching 
models  to  images  by  storing  markers  for  the  models  in  a 
hash  table  at  compile  time,  which  the  images  access  at 
run  time. 

Invariant  functions  have  been  widely  used  for  the 
recognition  of  planar  objects  from  arbitrary  views.  They 
form  an  attractive  option,  since  in  principle  one  can  sim¬ 
ply  compute  an  invariant  from  the  image  and  directly 
look  up  the  (unique)  object  that  could  produce  that 
value  in  a  precomputed  lookup  table.  Questions  of  the 
stability  of  such  schemes  are  unclear,  however.  Further¬ 
more,  Clemens  and  Jacobs  [1991]  have  recently  shown 
that  there  are  no  non-trivial  invariant  functions  when 
models  consist  of  arbitrary  collections  of  3D  points,  and 
projection  is  modeled  as  orthographic  with  scale  (a  simi¬ 
lar  result  has  been  shown  independently  by  Burns,  et  al., 
[1990]  and  by  Moses  and  UUman  [1991] ).  Jacobs  (in  this 
proceedings)  considers  in  what  restricted  3D  domains  in¬ 
variant  functions  might  be  possible.  That  paper  shows, 
for  example,  that  there  are  invariant  functions  that  do 
not  produce  false  positives  matches  only  when  the  set  of 
allowable  models  is  a  measure  0  subset  of  the  set  of  all 
models  containing  3D  point  features.  Moses  and  UUman 
explore  consider  the  effect  of  restricting  recognition  to 
certain  classes  of  objects,  but  still  show  that  for  some 
classes  there  are  no  non-trivial  invariant  functions.  De¬ 
spite  the  negative  tone  of  these  results,  we  are  continuing 
to  explore  the  role  of  invariants  in  indexing  and  recogni¬ 
tion. 

A  related  approach  to  indexing  is  to  consider  creating 
the  simplest  and  most  economical  description  possible  of 
the  set  of  aU  images  that  an  object  may  produce.  David 
Jacobs  has  developed  solutions  for  this  problem  in  a  va¬ 
riety  of  instances,  including  models  that  consist  of  point 
features,  models  that  consist  of  point  features  with  ro¬ 
tational  degrees  of  freedom,  and  models  that  consist  of 
points  with  tangent  or  directional  information.  These  re¬ 
sults  have  led  to  new  methods  of  implementing  indexing, 
as  weU  as  some  new  insights  into  approaches  that  use  in¬ 
variants  and  non-accidental  properties.  In  addition,  Ja¬ 
cobs  has  developed  a  partial  solution  to  the  problem  of 
accounting  for  sensing  error  when  implementing  align¬ 
ment,  indexing  or  invariant-based  approaches  to  object 
recognition  with  models  that  consist  of  point  features. 

0  Error  analysis 

Besides  building  and  testing  implementations  of  recog¬ 
nition  systems  on  real  data,  we  have  also  devoted  some 
of  our  effort  to  considering  formal  analysis  of  the  behav¬ 
ior  of  recognition  systems,  especiaUy  in  the  presence  of 
bounded  uncertainty  in  the  image  measurements. 

Eric  Crimson,  in  coUaboration  with  Dan  Huttenlocher 
(CorneU),  has  continued  to  develop  a  framework  for  such 
error  analysis,  based  on  the  following  model.  Assume 
that  an  object  is  modeled  by  a  set  of  simple  geometric 
features,  such  as  points  or  lines.  Assume  that  similar  ge¬ 
ometric  features  are  extracted  from  the  image,  but  that 
their  position  is  known  only  to  within  some  bounded  er¬ 
ror,  e.g.  a  point’s  position  is  known  only  to  lie  within 
a  disc  of  radius  e.  Under  these  conditions,  we  are  in- 
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terested  in  analysing  the  performance  of  a  recognition 
system  as  the  sensor  uncertainty  changes,  and  as  the 
amount  of  scene  clutter  changes.  We  are  typically  in¬ 
terested  in  the  probability  of  a  false  positive  result  (the 
system  incorrectly  identifies  an  instance  of  the  object  in 
the  scene),  and  the  conditions  needed  to  guarantee  that 
this  probability  is  less  than  some  bound. 

In  earlier  work,  we  developed  an  occupancy  model  of 
this  problem,  and  applied  it  to  an  analysis  of  the  Hough 
transform  [Crimson  and  Huttenlocher,  1990].  We  de¬ 
rived  conditions  under  which  the  Hough  transform  could 
be  guaranteed  (within  the  assumptions  of  the  analysis) 
to  provide  no  false  positives,  as  well  as  showing  how  the 
method  should  degrade  with  increasing  scene  clutter  (an 
observation  that  has  been  experimentally  independently 
verified  by  other  groups).  In  this  work,  we  considered 
the  recognition  of  2D  objects  in  a  known  suppor  plane 
from  single  images,  and  3D  objects  from  3D  data. 

More  recently,  we  have  extended  our  analysis  to  con¬ 
sider  the  case  of  recognising  2D  and  3D  objects  from  a 
single  2D  image,  using  the  weak  perspective  projection 
model  [Crimson  et  al.  1991].  For  the  case  of  2D  objects, 
the  effect  of  the  projection  can  be  modeled  as  an  affine 
transformation  of  the  plane.  Based  on  this,  a  number  of 
recognition  systems  use  the  idea  of  choosing  three  points 
as  a  basis  for  an  affine  coordinate  system,  and  rewrit- 
ting  the  coordinates  of  all  other  points  with  respect  to 
this  system.  Such  affine  coordinates  are  invariant  under 
any  affine  transformation,  and  can  in  principle  be  used 
to  drive  the  recognition  process.  Together  with  David 
Jacobs,  we  have  shown  that  under  the  bounded  error 
model,  the  range  of  values  for  the  affine  coordinates  of  a 
point,  written  with  respect  to  some  basis  triple  of  points, 
is  given  by  an  ellipticfd  region  in  affine  space,  where  the 
use,  orientation  and  position  of  the  region  depend  on  the 
actual  values  of  the  points  used  to  establish  the  affine  ba¬ 
sis.  Because  of  this,  we  have  shown  that  methods,  such 
as  Ceometric  Hashing,  which  rely  on  building  a  lookup 
table  of  invariant  coordinates  at  compile  time  can  no 
longer  do  so,  since  the  region  of  consistent  lookup  values 
associated  with  a  point  depends  on  information  that  is 
accesible  only  at  run  time.  At  the  same  time,  we  have 
shown  that  the  range  of  positions  of  a  fourth  point,  de¬ 
fined  by  some  afifne  coordinates  relative  to  an  affine  basis 
and  where  there  is  bounded  uncertainty  in  the  positions 
of  all  of  the  points,  is  simply  a  disc  whose  sise  depends 
only  on  the  affine  coordinates,  and  not  on  the  actual  val¬ 
ues  of  the  basis  points.  This  means  that  methods  such 
as  Alignment,  wUch  project  points  back  into  the  image 
for  future  verification,  do  not  suffer  from  the  problem 
mentioned  above.  As  well,  this  analysis  allows  us  to  pre¬ 
cisely  define  the  regions  over  which  a  verification  system 
should  look  for  supporting  evidence  of  an  object.  Us¬ 
ing  this,  we  have  been  able  to  analysis  the  false  positive 
rates  for  both  Geometric  Hashing  and  Alignment,  show¬ 
ing  how  both  will  degrade  with  increasing  noise  and  with 
increasing  clatter,  but  also  showing  how,  as  originally- 
defined,  the  alignment  method  tends  to  perform  better. 

Recently,  together  with  Tao  Alter,  we  have  extended 
this  analyns  to  the  case  of  an  arbitrary  3D  object,  with 
similar  conclusions.  The  analysis  derives  approximate 


expressions  for  the  range  of  values  of  the  parameters  of 
a  3D  transformation  that  are  consistent  with  the  pair¬ 
ing  of  triples  of  model  points  with  triples  of  uncertain 
image  points.  While  these  expressions  give  us  a  sense 
of  the  uncertainty  associated  with  individual  parame¬ 
ters  of  the  transformation,  and  allows  us  to  predict  over¬ 
estimates  for  the  range  of  possible  positions  associated 
with  a  project  model  point,  there  is  still  room  for  tighter 
bounds.  Along  these  lines.  Alter  has  been  developing  a 
second  scheme  for  alignment-based  verification  of  rigid 
objects  in  the  presence  of  error.  He  uses  a  Monte  Carlo 
sampling  method  to  empirically  derive  the  shapes  and 
sises  of  the  propagated  uncertainty  regions  when  there 
is  a  ’’least-commitment”  set  of  three  correspondences, 
and  has  been  extending  this  to  examine  how  the  regions 
shrink  when  more  correspondences  are  added,  and,  in 
particular,  how  many  additional  matches  are  needed  be¬ 
fore  the  regions  are  expected  to  vanish.  This  results 
of  this  analysis  should  enable  us  to  devise  methods  for 
conducting  a  probabilistic  search  for  a  complete  set  of 
correspondences,  so  as  to  more  efficiently  reach  the  goal. 

In  a  different  vein,  Ronen  Basri  has  examined  the  er¬ 
ror  characteristics  of  recognition  of  smooth  objects.  The 
recognition  of  objects  with  smooth  bounding  surfaces 
from  their  contour  images  is  considerably  more  compli¬ 
cated  than  that  of  objects  with  sharp  edges,  since  in  the 
former  case  the  set  of  object  points  that  generates  the 
silhouette  contours  changes  from  one  view  to  another. 
The  “curvature  method”,  developed  by  Basri  &  Ullman 
[1988],  provides  a  method  to  approximate  the  appear¬ 
ance  of  such  objects  from  different  viewpoints.  Basri 
has  derived  an  error  analysis  of  the  curvature  method. 
He  applies  the  method  to  ellipsoid  objects  and  computes 
an^ytically  the  error  obtained  for  different  rotations  of 
the  objects.  The  error  depends  on  the  exact  shape  of  the 
ellipsoid  (namely,  the  relative  lengths  of  its  axes),  and  it 
increases  as  the  ellipsoid  becomes  “deep”  (elongated  in 
the  Z-direction).  He  shows  that  the  errors  are  usually 
small,  and  that,  in  general,  a  small  number  of  models  is 
required  to  predict  the  appearance  of  an  ellipsoid  from 
all  possible  views.  Finally,  he  has  shown  experimentally 
that  the  curvature  method  applies  as  well  to  objects  with 
hyperbolic  surface  patches. 

7  Low  level  processing 

While  much  of  our  work  has  focused  on  recognition,  we 
have  also  continued  investigations  into  other  aspects  of 
visual  processing,  most  notably  low  level  measurement 
processes  that  supply  the  data  on  which  recognition  and 
navigation  methods  apply. 

7.1  Motion 

Work  on  exploiting  the  fixation  of  image  sequences  is 
continuing.  AL  Tsalebineihaad  [1990,  1991]  has  devel¬ 
oped  a  simple  method  for  creating  a  synthetic  fixated  se¬ 
quence  from  a  ‘normal’  (unfixated)  image  sequence.  The 
result  can  be  used  by  his  direct  motion  vision  algorithm 
for  recovering  translation  and  rotation  of  the  camera, 
without  requiring  the  explicit  computation  of  either  fea¬ 
ture  correspondence  or  optical  flow.  Work  is  currently 
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underway  to  test  an  implementation  of  the  method  on 
real  image  data. 

Non-linear  optimisation  methods  for  recovering  cam¬ 
era  motion  are  being  explored  by  David  Michael.  They 
yield  very  good  results  £tom  three-image  sequences.  Un¬ 
fortunately  they  are  computationally  very  expensive.  So 
they  mostly  serve  at  the  moment  as  a  way  of  getting 
‘ground’  truth  about  actual  camera  motion.  David  is 
also  working  on  better  methods  for  obtaining  calibrated 
image  sequences.  This  is  useful  since  it  is  very  difficult  to 
get  accurate  information  by  mechanical  means  (i.e.  ac¬ 
curate  enough  measurements  of  camera  translation  and 
rotation). 

Several  algorithms  for  finding  the  focus  of  expansion 
(FOE)  -  the  point  to  wards  which  the  camera  is  mov¬ 
ing  -  have  been  simulated  and  compared  by  Ignacio  S. 
McQuirk  [1991]  on  synthetic  and  real  image  sequences. 
He  was  able  to  make  a  selection  of  a  good  least  squares 
direct  motion  vision  algorithm.  This  is  now  being  im¬ 
plemented  in  analog  VLSI  in  research  sponsored  under 
another  contract  (&om  NSF  and  DARPA  jointly). 

7.2  Calibration 

Together  with  John  Wyatt,  Berthold  Horn  has  devel¬ 
oped  a  new  ‘camera  calibration’  method  (interior  ori¬ 
entation).  Before  a  camera  can  be  used  for  metric  pur¬ 
poses,  the  principal  distance  (a.k.a  effective  focal  length) 
and  the  principal  point  (where  the  perpendicular  from 
the  projection  center  pierces  the  image  plane)  must  be 
known.  Present  methods  for  calibrating  cameras  require 
expensive  fixtures,  granite  tables  and  temperature  con¬ 
trolled  rooms.  The  new  method,  implemented  by  Je¬ 
sus  Domingues  [1991],  depends  on  the  &ct  that  the  im¬ 
age  of  a  sphere  under  perspective  projection  is  an  ellipse 
with  major  axis  passing  through  the  principal  point,  and 
with  eccentricity  a  function  of  the  ratio  of  the  distance 
of  the  ellipse  from  the  principle  point  to  the  principal 
distance.  A  least  squares  procedures  obtains  good  es¬ 
timates  of  interior  orientation  from  a  large  number  of 
images  of  black  balloons  against  a  bright  background. 
Such  a  method  makes  it  possible  to  calibrate  cameras  in 
an  industrial  setting  where  the  complex  equipment  used 
by  photogrammetrists  b  not  available.  It  abo  does  not 
reqnbe  a  carefully  machines  calibration  object.  It  may 
prove  useful  in  mobile  robotics,  where  cameras  may  be 
jarred  and  need  to  be  recalibrated  from  time  to  time. 

7.3  Lines,  Edges  and  Regions 

David  Mercer  [1991]  has  implemented  and  improved  a 
line-finding  method  (as  opposed  to  edge  finding).  ‘Lines’ 
are  ridges  and  valleys  in  the  brightness  surface.  They 
often  carry  information  complementary  to  that  found  in 
edges,  and  also  carry  useful  information  in  the  case  of 
images  that  do  not  have  any  real  edges.  The  method 
depends  on  finding  the  sero-crossings  of  the  dot-product 
of  the  gradient  vector  and  the  direction  of  most  negative 
(or  most  positive)  second  directional  derivative.  To  be 
useful,  the  output  has  to  be  filtered  by  considering  both 
the  magnitude  of  the  gradient  and  the  magnitude  of  the 
Laplacian.  Line-drawings  -  as  opposed  to  edge  drawings 
-  carry  useful  information  in  the  case  of  images  of  natu¬ 


ral  objects  -  as  opposed  to  polyhedral  man-made  objects. 
They  provide  for  tremendous  data  reduction  and  hence 
band-width  reduction  in  video  transmbsion.  The  result 
is  presented  on  a  grey  background,  with  bright  lines  in 
white  and  dark  lines  in  black.  Interestingly,  an  approxi¬ 
mation  of  the  original  image  can  be  reconstructed  from 
an  a  line  drawing  combined  with  an  edge  drawing.  The 
result  b  nowhere  near  correct  photometrically,  but  looks 
very  similar  to  the  original  image. 

In  a  related  development,  Lisa  Dron  has  implemented 
a  new  segmentation  algorithm  that  allows  matching  of 
images  even  when  they  are  taken  under  quite  different 
lighting  or  climatic  conditions.  One  feature  of  the  algo¬ 
rithm  b  that  the  boundaries  between  regions  are  found 
accurately  -not  dbplaced  by  prior  smoothing  employed 
by  some  alternate  algorithms.  One  possible  application 
of  the  new  algorithm  b  in  image  compression,  since  an 
image  can  be  reconstructed  &om  the  segmented  image 
along  with  some  auxiliary  information  using,  for  exam¬ 
ple,  a  resistive  grid.  The  result  b  more  accurate  than  the 
method  described  above  -  i.e.  the  grey-level  are  actually 
close  to  those  of  the  original  image  -  but  requires  that 
more  information  be  transmitted.  Efforts  are  abo  un¬ 
derway  to  implement  thb  algorithm  in  an  analog  VLSI 
test  circuit. 

7.4  Contour  Texture 

An  oak  leaf  can  be  easily  distingubhed  from  an  elm  leaf, 
yet  the  same  oak  leaf  cannot  be  dbtinguished  &om  other 
oak  leafs  unless  a  detailed  inspection  b  performed.  Brian 
Subirana  has  been  examining  the  idea  that  complex  re¬ 
gion  boundaries  ate  processed  by  the  vbual  system  so  as 
to  obtain  a  simple  abstract  description,  called  Contour 
Texture,  which  contains  much  less  information  than  that 
provided  by  the  location  of  all  the  points  of  the  contour. 
The  frequency  of  protrusions  (regardless  of  their  shape), 
the  dominant  orientation  or  the  average  curvature  along 
the  curve  are  examples  of  contour  texture  properties. 
Contour  texture  may  be  useful  in  several  domains,  in¬ 
cluding  recognition,  indexing  and  grouping.  While  Con¬ 
tour  Texture  clearly  differs  from  traditional  image  tex¬ 
ture,  the  current  scheme  is  based  on  exbting  filter-based 
approaches  to  texture  (e.g.  [Malik  and  Perona  1990]). 

7.5  Invariants 

Berthold  Horn  has  begun  to  investigate  ‘Hnvaiiants”  that 
apply  in  the  case  of  perspective  projection  -  as  opposed 
to  colineation.  The  idea  b,  of  course,  to  be  able  to  recog¬ 
nise  object  without  explicitely  recovering  the  attitude  of 
the  object,  or  as  a  way  of  filtering  search  in  recognition 
and  attitude  finding.  An  example  of  thb  may  be  found  in 
the  Lbp  book  (Winston  &  Horn)  where  such  methods  are 
used  for  matching  images  of  star  fields.  While  allowing 
arbitrary  linear  transformations  leads  to  areas  of  math¬ 
ematics  that  have  been  explored  before,  it  gives  up  a  lot 
of  constrtunt  over  an  approach  that,  while  less  tractable 
mathematically,  modeb  the  actual  image  formation  pro¬ 
cess  more  accurately.  An  invariant  has  been  developed, 
for  example,  for  four  points  that  are  purported  to  be  the 
orthographic  projection  of  the  vertices  of  a  cube. 


7.6  Integratioa 

Euliei  leports  have  described  our  work  on  using  a 
Markov  Random  Field  model  for  integrating  different 
visual  cues.  Recently,  Horn  has  been  looking  at  an  al¬ 
ternative  methodology  for  intimate  integration  of  early 
vision  modules.  The  calculus-of-variations  approach  to 
machine  vision  can  be  extended  to  deal  with  multiple 
cues  by  building  more  complex  combinations  of  "penalty 
terms”.  This  approach  avoids  some  of  the  ad  hoc  meth¬ 
ods  used  to  combine  outputs  from  independent  modules. 
The  trick  is  to  find  a  functional  to  leads  to  a  stable,  con¬ 
vergent  scheme  for  solution. 

One  such  project  is  that  of  Clay  Thompson  on  inte¬ 
gration  of  binocular  stereo  and  shading  under  funding 
from  NASA.  Stereo  and  shading  have  complementary 
properties:  situations  where  one  is  weak  often  are  just 
those  where  the  other  is  strong.  Shading  cannot  give  ab¬ 
solute  height  or  very  low  spatial  frequency  components 
-  Stereo  is  of  no  use  in  featureless  areas,  smears  height 
features,  but  is  very  good  at  giving  absolute  height.  Un¬ 
fortunately,  finding  a  good  combined  functional  is  not 
easy,  we  have  looked  at  about  a  dosen  so  far. 

7.7  Early  vision  representations 

Steve  White  has  been  working  on  a  simple  nonlinear 
model  of  early  vision  which  preserves  important  prim¬ 
itive  edge  feature  information  such  as  location,  orienta¬ 
tion,  contrast  and  focus.  The  goal  has  been  to  integrate 
this  simple  model  in  such  problem  domains  as  stereo, 
motion  and  object  recognition.  Stereo  has  been  the  pri¬ 
mary  experimental  focus  in  the  past  year,  with  consid¬ 
erable  mccess.  Current  work  is  exploring  the  extension 
of  the  approach  to  model  matching.  A  special  area  of 
interest  is  i  t  the  possibility  of  this  model  being  consis¬ 
tent  with  cortical  processing.  A  significant  amount  of 
experimental  evidence  ranging  from  simple  cell  recep¬ 
tive  fields  and  shunting  inhibition  properties  to  stereo 
near /far  and  tuned  representations  and  other  complex 
cell  properties  seems  to  be  highly  encouraging  that  some¬ 
thing  very  much  like  the  proposed  model  of  early  vision 
processing  may  be  similar  to  cortical  processing.  Details 
of  the  approach  are  described  in  a  separate  paper  in  the 
proceedings. 

8  Using  visual  routines 

While  much  of  our  work  has  focused  on  specific  visual 
processing  modules,  we  are  also  interested  in  the  appli¬ 
cation  of  these  methods  to  realistic  tasks.  For  exam¬ 
ple,  some  of  our  earlier  work  on  object  recognition  has 
been  used  in  an  industrial  application,  inspecting  tub¬ 
ing  against  a  set  of  specifications.  Navigation  presents 
a  second  opportunity  for  connecting  vision  sytems  with 
real  world  tasks. 

In  this  vein,  Ian  Horswill  [1991]  has  been  focusing  on 
connecting  perception  to  action  in  the  context  of  con¬ 
crete,  routine  activities.  To  do  this,  he  have  been  exam¬ 
ining  particular  activities  in  particular  contexts,  such  as 
traveling  down  a  corridor  or  interacting  with  a  person. 
Given  specifications  of  these  activities,  one  can  define 
specific  pieces  of  perceptual  information  which  will  suf¬ 


fice  to  perform  them.  For  example,  the  direction  of  the 
corridor  relative  to  the  direction  of  travel  and  the  prox¬ 
imity  of  the  walls  of  the  corridor  are  sufficient  to  navigate 
the  corridor  provided  that  some  other  process  is  avoid¬ 
ing  obstacles.  The  extraction  of  each  of  these  pieces  of 
information  can  then  be  phrased  as  a  formal  computa¬ 
tional  problem  which  can  be  studied  using  the  standard 
tools  of  computational  vision. 

Focusing  on  the  extraction  of  these  relatively  high 
level  pieces  of  information  has  advantages.  A  broad 
range  of  systems  may  be  able  to  extract  the  same  piece 
of  information  in  very  different  ways.  For  example,  spe¬ 
cial  properties  of  the  domain  can  be  taken  into  account 
to  simplify  a  computational  problem.  These  properties 
can  be  analysed  and  cataloged  for  use  in  similar  prob¬ 
lems.  Another  advantage  is  that  the  activity  for  which 
a  piece  of  information  is  needed  provides  well-defined 
speed  and  accuracy  requirements,  and  an  operational 
evaluation  criterion  with  which  a  particular  system  may 
be  judged. 

To  date,  four  working  systems  have  been  developed. 
Three  are  useful  for  navigating  in  office  environments  -  a 
simple  corridor  follower,  a  system  which  follows  a  broad 
class  of  moving  objects,  and  a  simple  fail-safe  system 
for  preventing  collisions.  All  of  these  systems  use  tacit 
knowledge  of  their  domain  -  office  buildings  with  carpets 
having  uniform  coloration  -  to  simplify  the  process  of 
figure/ground  separation.  The  last  system  is  a  simple 
stereo  proximity  detector  for  use  in  collision  avoidance. 
The  system  is  quite  efficient  because  it  need  only  concern 
itself  with  a  specific  depth  plane,  rather  than  having  to 
create  a  dense  depth  map.  All  of  the  above  systents  run 
in  near  real  time  on  stock  hardware. 

Current  work  is  focused  on  developing  an  autonomous 
robot  platform  which  we  plan  to  program  to  give  prim¬ 
itive  "tours”  of  the  laboratory  using  a  number  of  these 
simple  perceptual  systems.  The  expectation  is  that  such 
a  problem  will  test  the  range  of  applicability  of  this  ap¬ 
proach  to  perception  atnd  the  degree  to  which  machinery 
and  analysis  can  be  reused  between  the  tasks  of  extract¬ 
ing  different  pieces  of  information. 

9  Learning 

We  have  continued  to  develop  the  theory  of  learning  from 
examples  that  we  had  described  in  the  last  Proceedings 
and  we  have  substantially  extended  its  applications.  As 
we  had  discussed,  our  approach  to  the  problem  of  learn¬ 
ing  from  examples  considers  it  within  a  mathematical 
framework  as  a  problem  of  approximating  a  multivariate 
function  from  sparse  data.  We  have  developed  a  learn¬ 
ing/approximation  scheme  which  builds  upon  classical 
approximation  theory  and  which  is  equivalent  to  a  class 
of  multilayer  networks  with  one  "hidden”  layer  that  we 
call  HyperBf  networks.  These  networks  (a)  are  based  on 
a  rigorous  theory  with  extensive  ties  to  a  large  body  of 
classical  results  in  applied  mathematics;  (b)  can  be  eas¬ 
ily  interpreted  in  terms  of  what  the  components  do;  and 
(c)  show  good  performance  in  several  different  domiuns. 

Our  main  research  directions  are  four: 

1.  extending  the  theory 
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2.  devdoping  efficient  optimisation  algorithms  for  the 
type  of  non-convex  problems  required  by  the  most 
general  versions  of  HyperBF  networks 

3.  applying  the  technique  to  several  different  domains 

4.  exploring  the  potential  implications  for  neurobiol- 
ogy,  especially  concerning  the  biological  substrate 
of  object  recognition 

We  will  review  some  highlights  in  the  first  three  of 
these  directions  of  research  after  a  brief  summary  of  the 
technique. 

9.1  The  HyperBF  technique 

HyperBf  networks  (Poggio  and  Girosi,  1990,  1990a, 
1990b,  1990c)  are  a  class  of  feedforward  networks  with 
one  layer  of  hidden  units  that  compute  functions  of  the 
form: 

n 

/(x)  =  5]c„G(||x-t„)l|,v)  +  p(x)  (3) 

0  =  1 

where  G  is  any  conditionally  definite  positive  function 
(Micchelli,  1986),  p(x)  is  a  polynomial  of  low  degree,  W 
is  a  square  matrix  and  ||  •  indicates  the  following 
weighted  norm: 

||x||^r  =  Wx-Wx.  (4) 

The  coefficients  Ca,  the  “centers”  ig  and  the  matrix 
W  are  found  during  the  learning  stage,  by  minimizing 
a  measure  of  the  error  between  the  network’s  prediction 
and  each  of  the  examples.  After  learning  the  centers  of 
the  basis  functions  are  similar  to  prototypes,  since  they 
are  points  in  the  multidimensional  input  space.  Updat¬ 
ing  the  centers  during  learning  is  therefore  equivalent 
to  modifying  the  corresponding  prototypes  and  corre¬ 
sponds  to  task-dependent  clustering.  Finding  the  opti¬ 
mal  weights  W  for  the  norm  is  equivalent  to  transform¬ 
ing  appropriately,  for  instance  scaling,  the  input  coordi¬ 
nates  and  corresponds  to  task-dependent  dimensionality 
reduction. 

9.2  Theory 

Our  main  line  of  investigation  was  devoted  to  the  prob¬ 
lem  of  selecting  a  network  architecture,  because  it  is  one 
of  the  choices  that  strongly  influence  the  final  perfor¬ 
mances.  However,  once  an  architecture  has  been  chosen 
there  are  other  relevant  problems  that  have  to  be  solved. 
One  of  these  is  related  to  the  fact  that  in  many  cases  the 
available  data  may  contain  outliers,  and  standard  pro¬ 
cedures,  as  least  square  estimation,  have  to  be  modffied 
in  this  case.  Here  we  show  some  results  on  these  two 
topics. 

9.2.1  Architecture  Selection 

Whenever  we  want  to  use  some  kind  of  network  to 
solve  a  problem,  two  fundamental  questions  arise:  a)  how 
many  hidden  units?  b)  which  activation  function  should 
the  hidden  units  compute?  We  considered  the  first  ques¬ 
tion  under  the  assumption  of  an  infinite  number  of  ex¬ 
amples.  The  number  of  units  needed  to  approximate  a 
function  within  a  certain  accuracy  depends  on  the  choice 


of  the  activation  function  and  on  some  characteristics  of 
the  function  that  has  to  be  approximated,  as  its  dimen¬ 
sionality  and  degree  of  smoothness.  For  many  classical 
spaces  of  functions  and  choices  of  the  activation  func¬ 
tion  the  dependence  of  the  number  of  hidden  units  on 
the  dimension  is  exponential,  leading  to  the  weU  known 
phenomenon  of  “curse  of  dimensionality”.  However,  if 
some  constraints  are  imposed  on  the  target  functions, 
better  rates  of  convergence  can  be  obtained.  Using  a 
result  by  Jones  (1990)  about  the  rate  of  convergence  of 
iterative  sequences  in  Hilbert  spaces,  we  proved  (Girosi 
and  Anzellotti,  1991)  that  there  exist  classes  of  functions 
that  can  be  approximated  by  a  network  of  n  radial  units 
with  an  error  of  order  O(^).  The  dimension  of  the 
space  influences  the  result  only  through  a  moltiplicative 
constant,  and  the  result  is  constructive,  in  the  sense  that 
it  shows  an  iterative  algorithm  that  can  achieve  this  rate 
of  convergence.  Similar  results  have  been  obtained  by 
Barron  (1991)  for  multilayer  perceptrons,  so  this  raises 
the  question  of  the  choice  of  the  activation  function,  on 
which  we  did  some  experimental  and  theoretical  work. 

Maruyama,  Girosi  and  Poggio  (1991a)  have  compared 
in  numerical  experiments  several  different  activation 
functions,  and  therefore  different  techniques  for  learn¬ 
ing  &om  examples,  considered  as  schemes  for  approxi¬ 
mating  multivariate  functions  from  sparse  data.  In  par¬ 
ticular  they  considered  multilayer  perceptrons  with  one 
layer  of  sigmoidal  hidden  units,  flexible  Fourier  series, 
multilayer  perceptron  with  exponential  activation  func¬ 
tions,  Radial  Basis  Functions,  and  different  forms  of  Hy- 
perBF  networks.  They  have  characterized  their  approxi¬ 
mation  performance  (equivalent  to  generalization  power) 
according  to  Lj  and  Lgo  measures  on  sparse  data  from 
several  different  continuous  functions  of  two  and  more 
variables,  using  several  different  training  techniques.  All 
the  techniques,  except  the  one  using  exponential  acti¬ 
vation  functions,  performed  well  in  the  average,  and 
this  led  to  investigate  possible  relations  between  mul¬ 
tilayer  perceptrons  and  Generalized  Radial  Basis  Func¬ 
tions  (GRBF). 

The  main  point  of  another  project  of  Maruyama, 
Girosi  and  Poggio  (1991b)  was  to  show  that  for  nor¬ 
malized  inputs,  multilayer  perceptron  networks  are  ra¬ 
dial  function  networks  (albeit  with  a  non-standard  radial 
function).  This  provides  an  interpretation  of  the  weights 
u>  as  centers  t  of  the  radial  function  network,  and  there¬ 
fore  as  equivalent  to  templatee.  This  insight  may  be  use¬ 
ful  for  practical  applications,  including  better  initialisa¬ 
tion  procedures  for  MLP.  Maruyama  et  al.  also  analyse 
the  relation  between  the  radial  functions  that  correspond 
to  the  sigmoid  for  normalized  inputs  and  well-behaved 
radial  basis  functions,  such  as  the  Gaussian.  In  partic¬ 
ular,  they  observed  that  the  radial  function  associated 
with  the  sigmoid  is  an  activation  function  that  is  good 
approximation  to  Gaussian  basis  functions  for  a  range 
of  values  of  the  bias  parameter.  The  implication  is  that 
a  MLP  network  can  always  simulate  a  Ga"ssian  GRBF 
network  (with  less  parameters)  but  the  converse  is  true 
only  for  certain  values  of  the  bias  parameter.  Numeri¬ 
cal  experiments  indicate  that  the  constraint  is  not  always 
satisfied  in  practice  by  MLP  networks  trained  with  back- 


propagation.  Multucale  GRBF  networks,  on  the  other 
hand,  can  approximate  MLP  networks  with  a  similar 
number  of  parameters. 

9.2.2  Dealing  with  outliers 

Given  n  noisy  observations  Qi  of  the  same  quantity  /, 
it  is  common  use  to  give  an  estimate  of  /  by  minimising 
the  function  —  /)’.  Prom  a  statistical  point 

of  view  this  corresponds  to  computing  the  Maximum 
Likelyhood  estimate,  under  the  assumption  of  Gaussian 
noise.  However,  it  is  well  known  that  this  choice  leads  to 
results  that  are  very  sensitive  to  the  presence  of  outliers 
in  the  data.  For  this  reason  it  has  been  proposed  to  min¬ 
imise  functions  of  the  form  V{gi  —  /),  where  V  is  a 
function  that  increases  less  rapidly  than  the  square.  Sev¬ 
eral  ch<Hces  for  V  have  been  proposed  and  successfully 
used  to  obtain  “robust”  estimates.  However,  a  justifi¬ 
cation  and  interpretation  for  their  use  is  still  lacking. 
We  have  shown  (Girosi,  1991;  Girosi,  Caprile  and  Pog- 
gio,  1991)  that,  for  a  class  of  functions  V,  that  we  call 
“effective  potentials”,  using  these  robust  estimators  cor¬ 
responds  to  assuming  that  our  measures  are  affected  by  a 
Gaussian  noise  whose  variance  is  a  random  variable  with 
given  probability  distribution.  Depending  on  the  proba¬ 
bility  distribution  of  the  variance  of  the  noise,  different 
shapes  for  V  are  obtained.  In  (Girosi,  1991)  a  character¬ 
isation  of  the  class  of  effective  potentials  has  been  given, 
in  terms  of  positive  definite  functions  in  Hilbert  spaces. 

0.3  Algorithms 

Learning  the  coefficients  c^,  the  W  matrix  and  the  t^, 
that  minimise  an  error  functional  of  the  type  on  the 
set  of  examples  is  a  non-convex  minimisation  problem. 
Gradient-descent  is  probably  the  simplest  approach  for 
attempting  to  find  the  solution  to  this  problem.  We  have 
explored  an  even  simpler  optimization  technique  that 
can  be  successfully  used  to  solve  this  class  of  problems 
(Caprile,  Girosi  and  Poggio,  1991).  Our  algorithm  com¬ 
bines  aspects  typical  of  many  genetic  algorithms  (Gold¬ 
berg,  1989),  with  others  typicri  of  random  descent  tech¬ 
niques  (Caprile  and  Girosi,  1990),  as  the  concept  of  adap¬ 
tive  noise.  We  have  tested  the  algorithm  numerically  in 
a  variety  of  cases,  and  the  results  have  been  compared  to 
the  ones  obtained  by  using  a  standard  gradient  descent 
with  adaptive  step  technique.  In  all  the  cases  considered, 
the  best  local  minima  were  found  by  the  nondeterminis- 
tic  algorithm,  and  preliminary  experiments  suggest  that 
this  may  hold  true  also  for  a  class  of  minimization  prob¬ 
lems  wider  than  the  one  we  have  considered. 

9.4  Applications 

We  are  applying  the  HyperBF  technique  to  several  dif¬ 
ferent  domains:  3D  object  recognition;  synthesis  of  algo¬ 
rithms  for  early  visual  tasks,  such  as  hyperacuity  tasks; 
computer  graphics;  time  series  analysis;  adaptive  con¬ 
trol;  indoor  vision-driven  autonomous  navigation.  We 
briefly  discuss  three  of  them: 

9.4.1  Object  Recognition 

Edelman  and  Poggio  (1990)  applied  the  HyperBf  tech¬ 
nique  to  the  problem  of  3D  object  recognition  with 
promising  results.  They  have  been  able  to  synthesise  a 


module  that  can  recognize  an  object  from  any  viewpoint, 
after  it  learns  its  3D  structure  &om  a  small  set  of  2D 
perspective  views,  using  the  HyperBF  network  scheme. 
Their  results  were  obtained  with  simulated  wireframe 
objects  and  assumed  that  the  problems  of  feature  extrac¬ 
tion  and  matching  were  already  solved.  The  problems  of 
occlusions  and  spurious  features  were  ignored.  We  have 
now  extended  successfully  the  technique  to  work  with 
gray  level  images  of  real  paper  clips  (Btunelli  and  Pog¬ 
gio,  1991). 

It  is  interesting  to  mention  that  psychophysical  ex¬ 
periments  by  Bulthoff  and  Edleman  carried  out  on  wire¬ 
frame  objects  and  other  objects  confirm  that  “immedi¬ 
ate”  3D  object  recognition  in  humans  seems  to  be  based 
on  a  process  of  2D  views  interpolation  rather  than  on 
the  use  of  3D  models. 

We  have  also  began  to  apply  HyperBF  networks  to 
the  problem  of  recognizing  faces,  using  a  small  set  of  im¬ 
ages  of  any  given  face  as  examples.  This  is  under  the 
assumptions  that  a  few  views  for  each  person  are  avail¬ 
able  to  train  the  network  (our  estimate  for  a  generic  3D 
object  is  between  20  and  100  2D  views).  The  theoreti¬ 
cal  low  limit  is  two  views  (for  the  visible  aspect)  (Basri 
and  Ullman,  1990;  Poggio,  1990a).  We  have  therefore 
begun  work  aimed  to  characterize  how  recognition  from 
just  one  2D  view  may  be  done  if  views  of  other  (“proto¬ 
typical”)  objects  of  the  same  class  are  available  (Poggio, 
1991).  Clearly  one  single  view  of  a  3D  object  (if  shading 
is  neglected)  does  not  contain  sufficient  3D  information. 
If,  however,  the  object  belongs  to  a  class  of  similar  ob¬ 
jects  (prototypes)  of  which  many  views  are  known,  it 
seems  possible  to  make  reasonable  extrapolations  and 
to  guess  correctly  other  views  of  the  specific  object  from 
just  one  2D  view  of  it.  We  are  certainly  able  to  recognise 
faces  turned  way  20-30  degrees  from  frontal  &om  just  one 
frontal  view,  presumably  because  we  exploit  our  exten¬ 
sive  knowledge  of  the  typical  3D  structure  of  faces.  At 
this  point  one  can  pose  the  following  problem;  from  one 
2D  view  of  a  3D  object  generate  other  views,  exploiting 
knowledge  of  views  of  other  objects  of  the  same  class.  If 
this  can  be  done,  we  can  then  use  Poggio  and  Edelman ’s 
technique  -  and  its  extensions  -  by  using  the  views  we 
have  generated  as  a  training  set.  The  point  is  to  generate 
artificial  examples  of  deformations  for  the  specific  object 
of  interest  by  extracting  information  about  allowed  de¬ 
formations  from  a  set  of  examples  of  objects  of  the  same 
class,  using  standard  approximation  techniques.  Poggio 
(1991)  discusses  under  which  conditions  and  definitions 
of  class  this  goal  can  be  achieved. 

9.5  Time  series  analysis 

Jim  Hutchinson  and  Tomaso  Poggio  are  engaged  in  the 
study  of  learning  architectures,  their  parallel  implemen¬ 
tations,  and  their  applications  to  large,  real  world  prob¬ 
lems  in  time  series  prediction.  The  goals  of  this  work 
are  to  investigate  the  potential  of  parallel  implementa¬ 
tions  to  help  with  problems  of  parameter  estimation, 
handling  of  large  problems,  and  use  of  previously  in¬ 
tractable  methods;  to  assess  the  applicability  and  useful¬ 
ness  of  various  learning  architectures  to  the  problem  of 
time  series  prediction;  to  determine  appropriate  ways  of 
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achieving  domain  specific  goals  in  time  series  modeling, 
especially  obtaining  estimates  of  model  fit  (i.e.  variance 
of  outputs)  and  methods  for  iterating  predictions;  and  to 
determine  appropriate  ways  of  handling  domain  specific 
problems  in  time  series  modeling,  such  as  limited  sample 
sise,  embedding  a  priori  structure  into  the  learning  ar¬ 
chitecture,  and  selecting  and  transforming  useful  inputs 
from  a  collection. 

Results  to  date  Rom  this  work  are  fairly  preliminary. 
We  have  shown  that  the  Radial  Ba»s  Function  class  of 
learning  methods  can  efficiently  be  implemented  on  the 
Connection  Machine  for  solving  large  problems.  We  have 
investigated  various  mechanisms  for  embedding  the  time 
series  prediction  problem  in  the  Radial  Basis  Function 
framework,  and  have  preliminary  results  that  such  sys¬ 
tems  outperform  corresponding  traditional  linear  models 
on  an  interesting  class  of  financial  time  series. 

9.5.1  Early  visual  tasks 

We  are  beginning  to  apply  the  HyperBF  technique  to 
expl^  the  fast  acquisition  of  visual  abilities  in  simple 
tasks  from  a  few  examples  of  the  task.  Poggio,  Fahle  and 
Edelman  (1991)  were  able  to  show  that  networks  which 
solve  specific  visual  tasks,  such  as  the  evaluation  of  spa¬ 
tial  relations  with  hyperacuity  precision,  can  be  easily 
synthesised  from  a  small  set  of  examples,  using  the  Hy¬ 
per  bf  technique.  This  may  have  significant  implications 
for  the  interpretations  of  many  psychophysical  results  in 
terms  of  neuronal  models.  It  has  also  potential  practi¬ 
cal  implications  in  terms  of  vision  architectures  that  can 
learn  &om  a  set  of  examples  to  perform  specific  visual 
tasks  such  as  inspection  tasks  without  explicit  ad  hoc 
programming. 
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1  INTRODUCTION 

The  past  year  has  been  a  very  productive 
one  for  the  researchers  in  the  Vision/ Robotics 
Laboratory  at  Columbia  University.  Since  our 
last  report,  we  have  added  a  new  faculty  mem¬ 
ber,  Shree  K.  Nayar,  who  has  continued  his 
work  in  physics-based  vision  at  Columbia.  The 
coming  year  heralds  a  new  location  and  affilia¬ 
tion  for  the  Vision/Robotics  laboratory.  We  will 
be  moving  to  a  new  building  on  the  Columbia 
Campus,  the  Schapiro  Center  for  Engineering 
and  Physical  Sciences  Research.  We  wiU  effec¬ 
tively  be  doubling  our  current  space,  and  mov¬ 
ing  to  entirely  new  and  up-to-date  Laboratory 
space.  A  part  of  this  move,  we  have  formed 
the  new  Center  for  Research  in  Intelligent  Sys¬ 
tems  (CRTS)  which  will  be  focusing  on  research 
in  perception  and  cognition,  with  Image  Under¬ 
standing  being  a  key  component  of  the  Center’s 
research  agenda.  Our  fellow  faculty  members 
in  the  Center  include  vision  researchers,  roboti¬ 
cists,  AI  researchers,  and  Computer  Graphics 
researchers.  This  is  a  multi-disciplinary  center 
that  will  strive  to  integrate  results  from  each  of 
these  fields. 

In  the  sections  below,  we  will  outline  the 
work  that  has  occurred  over  the  last  year  in  the 
following  areas  of  research: 

1.  Physics-Based  Vision 

2.  Low-Level  and  Middle- Level  Vision 

3.  Shape  Recovery  Methods 

4.  Real-Time  Vision 

‘This  work  was  supported  in  part  by  DARPA  contract 
N00039-84-C-0165. 


5.  Sensor  Planning  and  Modeling 

6.  Topological  Navigation 

7.  Dexterous  Robotic  Hands 

Technical  details  can  be  found  in  the  ref¬ 
erences  and  in  the  papers  in  this  volume  from 
Columbia’s  researchers. 

2  PHYSICS-BASED  VISION 
2.1  Extending  POLARIS 

The  POLARIS  system,  which  uses  polar¬ 
ization  information  for  various  vision  tasks,  has 
been  under  development  at  Columbia  for  the 
past  3  years.  Last  year  we  extended  POLARIS 
to  include  edge  labeUng.  This  was  added  to  the 
system  previous  abilities  which  were:  near-real¬ 
time  material  surface  classification,  separation 
of  highlight /diffuse  components  of  an  scene,  and 
local  surface  orientation.  The  physical  under¬ 
pinnings  of  POLARIS,  as  well  as  some  of  its 
capabilities  can  be  found  in  [Wolff  and  Boult, 
1991]. 

In  [Boult  and  Wolff,  1991]  we  presented  a 
technique  that  relies  on  polarization  informa¬ 
tion  to  distinguish  among  3  types  of  commonly 
occurring  edges: 

occluding  or  limb  edges  which  are  edges  pro¬ 
duced  when  the  surface  turns  smoothly  away, 
(i.e.  the  surface  normals  become  nearly  or¬ 
thogonal  to  viewing  direction  as  in  the  sides 
of  a  cybnder), 

specular  edges  which  are  produced  from  specu¬ 
lar  reflections  of  an  incident  intensity  edge, 
e.g.,  reflection  of  light  source  boundaries  or 
“edges”  in  secondary  light  source.s,  and 
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albedo/physical  edges  which  are  a  combination 
of  all  other  edges  including  image  edges  pro¬ 
duced  from  albedo  variations,  sharp  surface 
discontinuities,  and  shadows. 

The  method  has  been  demonstrated  on  labo¬ 
ratory  images,  and  an  example  can  be  seen  in 
figure  1. 

2.2  Large-Scale  Reflectance  Model 

The  Lambertian  reflectance  assumption  is 
one  of  the  most  widely  used  assumptions  in  ma¬ 
chine  vision.  We  have  discovered  that  Lamber¬ 
tian  surfaces  with  geometric  variations  exhibit 
different  reflectance  characteristics  depending 
on  the  resolution  of  the  sensor  used  to  image  the 
surface  [Nayar  91c].  When  the  area  viewed  by  a 
sensor  element  is  large  compared  to  the  dimen¬ 
sions  of  the  surface  variations,  the  reflectance 
properties  are,  in  general,  non-Lambertian.  We 
have  developed  a  reflectance  model  that  de¬ 
scribes  large  scale  reflectance  from  Lambertian 
surfaces.  The  surface  is  modeled  as  a  collection 
of  facets,  where  the  area  of  each  facet  is  small 
compared  to  the  surface  area  viewed  by  individ¬ 
ual  sensor  elements.  The  geometrical  structure 
of  the  surface  is  then  described  by  a  probability 
distribution  function  for  the  facet  orientations. 
Using  this  surface  model,  we  have  derived  a  re¬ 
flectance  model  that  accounts  for  geometrical 
effects  such  as  masking  and  shadowing  by  ad¬ 
jacent  facets  as  well  as  radiometric  effects  such 
as  interreflections.  We  are  currently  conducting 
experiments  to  evaluate  the  performance  of  this 
reflectance  model. 

2.3  Interreflections  and  Shape  Recovery 

Points  in  the  scene,  when  illuminated,  re¬ 
flect  light  not  only  in  the  direction  of  the  sen¬ 
sor  but  also  between  themselves.  These  inter¬ 
reflections  can  appreciably  alter  the  appearance 
of  the  scene.  AH  shape-from-intensity  meth¬ 
ods  are  based  on  the  assumption  that  points  in 
the  scene  are  illuminated  only  by  the  sources  of 
light;  interreflections  are  assumed  not  to  exist. 
Consequently,  these  methods  produce  erroneous 
results  when  applied  to  concave  surfaces. 

We  have  developed  an  algorithm  that  recov¬ 
ers  accurate  shape  information  in  the  presence 
of  interreflections.  This  solution  to  the  inter¬ 


reflection  problem  is  valid  for  Lambertian  sur¬ 
faces  with  possibly  varying  and  unknown  color. 
First,  the  photometric  stereo  method  is  applied 
to  the  concave  surface  to  obtain  pseudo  (erro¬ 
neous)  estimates  of  shape  and  color.  We  have 
shown  [Nayar  92b]  that  the  pseudo  shape  and 
color,  though  erroneous,  can  be  mathematically 
related  to  the  actual  shape  and  color  of  the  sur¬ 
face.  A  recovery  algorithm  has  been  developed 
that  uses  this  relation  to  iteratively  recover  the 
actual  shape  and  color  from  the  pseudo  esti¬ 
mates.  Figure  2a  shows  a  gray-colored  con¬ 
cave  Lambertian  surface  of  constant  reflectance 
(albedo  =  0.75),  and  Figure  2b  shows  the  erro¬ 
neous  shape  extracted  using  photometric  stereo. 
Figure  2c  shows  the  iterative  recovery  of  the  ac¬ 
tual  shape  from  the  pseudo  shape. 

2.4  Chromatic  Aberration  Correction 
via  Image  Warping 

The  problem  of  chromatic  aberration  arises 
because  different  wavelengths  of  light  are  re¬ 
fracted  differently  by  the  elements  of  a  lens.  Un¬ 
fortunately,  this  means  that  the  image  of  an  ob¬ 
ject  is  blurred  and  distorted.  In  color  imaging 
these  distortions  cause  measurable  differences 
between  the  images.  We  have  recently  show  how 
to  use  our  new  image  reconstruction  filters  and 
previous  work  on  image  warping  to  deed  with 
chromatic  aberration  correction.  The  technique 
has  been  demonstrated  and  analyzed  on  2  test 
cases.  The  technique  has  been  directly  com¬ 
pared  to  the  active  optics  approach  from  CMU 
and  does  quite  well,  see  [Boult  and  Wolberg, 
1991],  and  our  paper  in  these  proceedings  [Boult 
and  Wolberg,  1992]. 

3  LOW  AND  MID-LEVEL  VISION 

3.1  Motion  Segmentation 

Our  work  in  this  area  addresses  the  prob¬ 
lem  of  motion  segmentation  using  the  Singular 
Value  Decomposition  of  a  feature  track  matrix. 
The  work  builds  upon  the  SVD  factorization 
approach  pioneered  by  Tomasi  and  Kanade  at 
CMU  [Tomasi  and  Kanade,  1990].  In  a  recent 
paper  [Boult  and  Brown,  1991]  we  have  shown 
that,  under  general  assumptions,  the  number 
of  numerically  nonzero  singular  values  can  be 
used  to  determine  the  number  of  motions.  Fur- 
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Figure  1:  POLARIS  edge-labeling  for  a  simple  scene.  The  mug  has  a  paper  label  on  it  and  the 
glass  plate  is  specularly  reflecting  a  pattern  of  circles.  Upper  Left  shows  the  initial  intensity  image, 
upper  right  shows  the  percent-polarization  (PP)  image  (white=0%,  black  >  30%).  Lower  left  shows 
RMS-error  edges  in  black  and  albedo/physical  edges  in  gray.  Lower  right  shows  occluding  edges  in 
black  and  specular  edges  in  gray. 


thermore,  the  motions  can  be  separated  using 
the  right  singular  vectors  associated  with  the 
nonzero  singular  values.  We  also  derived  a  re¬ 
lationship  between  a  good  segmentation,  the 
number  of  nonzero  singular  values  in  the  input 
and  the  sum  of  the  number  of  nonzero  singu¬ 
lar  values  in  the  segments.  The  approach  has 
been  demonstrated  on  real  and  synthetic  exam¬ 
ples  (see  the  paper  in  these  proceedings  [Boult 
and  Brown,  1992]. 

3.2  Energy-Based  Segmentation 

Our  work  in  this  area  is  continuing  with 
our  most  recent  results  described  in  [Boult  and 
Lerner,  1990]  and  [Boult  and  Lerner,  1991]. 
Briefly,  the  technique  computes  a  functional 
form  of  the  minimal  energy  surface  interpolat¬ 
ing  a  number  of  data  points  and  also  computes 


a  closed  form  approximation  to  the  bending  en¬ 
ergy  of  this  surface.  It  uses  heuristics  to  grou]) 
points  into  initial  clusters  and  fits  minimal  en¬ 
ergy  surfaces  to  these  seed  surfaces.  It  then 
grows  these  surfaces  so  as  to  maintain  (approxi¬ 
mately)  minimal  energy  surfaces,  putting  points 
onto  those  surfaces  for  which  it  causes  minimal 
perturbation  in  the  bending-energy.  The  tech¬ 
nique  handles  overlapping  and  transparent  sur¬ 
faces;  in  fact,  these  are  often  among  the  easiest 
test  case  examples  for  it.  The  technique  has 
now  been  demonstrated  on  numerous  real  and 
synthetic  examples  and  its  being  written  in  the 
dissertation  of  M.  Lerner[Lerner,  1992]. 
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Figure  2:  (a)  A  concave  surface;  (b)  Its  pseudo  shape  extracted  using  photometric;  (c)  Recovery  of 
the  actual  shape  from  the  pseudo  shape. 


3.3  New  models  for  Image  restora¬ 
tion  /reconstruction 

Using  the  mathematics  of  Information- 
Based  Complexity  we  have  recently  de¬ 
veloped  new  models  for  image  reconstruc¬ 
tion/restoration.  These  models  assume  knowl¬ 
edge  of  the  point  spread  function  and  the  spa¬ 
tial  sampUng  characteristic  of  the  imaging  sen¬ 
sor.  This  spatial  sampling  results  in  a  blurring 
of  the  signal  within  a  single  pixel.  We  show  how 
to  use  a  model  of  this  blurring  to  derive  new  lo¬ 
cal  reconstruction  algorithms  with  performance 
equal  to  or  better  (depending  on  your  choice  of 
metrics)  than  previous  methods  including  some 
global  methods  like  cubic-spline  interpolation. 
The  techniques  are  described  in  [Wolberg  and 
Boult,  1991]. 

4  SHAPE  RECOVERY  METHODS 

4.1  Shape  from  Darkness 

The  error  behavior  of  the  discrete  shape 
from  darkness  problem  [Render  and  Smith, 
1987]  has  been  explored  and  formalized.  Er¬ 
rorful  shadows  are  shown,  under  some  condi¬ 
tions,  to  prevent  the  highly  non-linear  recovery 
algorithm  from  converging.  Specifically,  non- 
convergence  is  seen  to  be  caused  by  closed  loops 
of  shadow  constraints  that  attempt  to  modify 
the  position  of  a  global  reference  point.  Such 
loops  can  be  as  small  as  two  shadow  observa¬ 


tions  in  number,  and  they  have  a  peculiar  gram¬ 
mar  defining  their  character.  These  loops  can 
be  detected  in  the  course  of  the  attempted  sur¬ 
face  recovery,  and  the  extent  (although  not  the 
cause)  of  their  error  quantified.  Algorithms  that 
attempt  to  adjust  shadow  positions  minimally, 
under  various  definitions  of  "minimal”,  in  order 
to  reestablish  convergence  have  been  devised 
and  are  being  tested.  The  resultant  surfaces 
are  therefore  near  to  the  expected  ideal  surface; 
formalization  of  expected  error  in  the  surface 
recovery  due  to  these  adjustments  is  being  pur¬ 
sued. 

4.2  Flexible  Extruded  Objects 

A  difificult  problem  in  vision  is  modeling 
and  recovering  the  shape  of  flexible  extruded 
objects  such  as  wires  and  cables.  A  Hough-like 
parameter  space  technique  for  modeling  flexible 
extruded  objects  as  piecewise  toroidal  has  been 
analyzed,  and  a  novel  transform  has  been  im¬ 
plemented  that  derives  their  three-space  curved 
axes  from  position  and  surface  normal  informa¬ 
tion.  The  method  is  purely  local,  and  succeeds 
where  attempts  to  model  objects  as  being  piece- 
wise  cylindrical  fail.  Although  the  local  compu¬ 
tation  involves  15  free  variables  (for  three  points 
each:  three  of  position,  two  of  orientation),  does 
not  involve  the  iterative  solution  of  non-linear 
equations.  It  has  been  demonstrated  on  syn- 
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thetic  and  real  range  data. 

Because  the  torus  is  an  object  with  seven 
free  parameters,  this  work  also  has  demon¬ 
strated  the  robustness  of  the  parameter  space 
approach,  even  for  high  order  objects.  Better, 
it  has  demonstrated  that  the  structure  of  the 
parameter  spaces  themselves  can  be  chosen  to 
counteract  the  triangulation  error.  Errors  that 
occur  in  trying  to  find  tori  in  objects  that  are 
unusually  large  or  small,  or  unusually  straight 
or  flexed,  can  be  made  self-limiting. 

This  work  required  the  extensive  use  of  a 
symbolic  mathematical  analysis  system  (IBM’s 
proprietary  Scratchpad  II):  the  resulting  trans¬ 
form  is  based  on  a  quadratic  equation  whose  co¬ 
efficients  incorporate  12  inner  products  of  three- 
space  vectors.  Along  the  way,  it  was  discovered 
that,  under  some  fairly  general  conditions,  ev¬ 
ery  torus  has  a  large  family  of  anti-tori.  Their 
hallucinatory  appearance  in  the  image  must  be 
explicitly  ignored. 

Most  recently,  the  method  has  been  exam¬ 
ined  for  an  extension  to  cases  where  the  under¬ 
lying  space  curve  exhibits  torsion,  since  the  ab¬ 
sence  of  considerations  for  torsion  in  the  method 
appear  to  be  the  principal  cause  of  errors  in  its 
practical  application.  If  a  torus  is  viewed  as  a 
circle  swept  over  the  simplest  space  curve  ex¬ 
hibiting  curvature,  then  a  ”fat  helix”  (techni¬ 
cally,  a  canal  surface)  is  what  results  when  a 
circle  is  swept  over  the  simplest  space  curve  ex¬ 
hibiting  torsion.  Unfortunately,  it  currently  ap¬ 
pears  that  the  mathematical  approach  used  in 
the  existing  method  depends  so  heavily  on  the 
theorem  of  Meusnier  in  differential  geometry, 
that  no  simple  extension  with  straightforward 
geometric  explanation  is  possible.  Partly  the 
problem  is  exacerbated  by  the  fact  that  under 
general  conditions  helices  are  not  uniquely  de¬ 
fined  by  a  subset  of  the  points  they  pass  through 
(unlike  circles,  which  are  unique  given  three 
member  points).  Less  intuitive  approaches, 
based  on  heuristic  restraints  on  the  types  of  he¬ 
lices  that  an  algorithm  should  recover,  are  being 
pursued  instead  [Kender  and  Kjeldsen,  1991, 
Render  and  Kjeldsen,  1992]. 


4.3  Recovery  of  SHGCs  and  Symmetry 
Analysis 

In  the  previous  Image  Understanding 
Workshop,  we  reported  initial  results  for  recov¬ 
ering  generalized  cylinders;  in  particular  we  de¬ 
rived  the  mathematical  constraints  obtainable 
from  contours.  We  also  showed  how  to  recover 
the  remaining  free  parameters  from  intensity 
images.  This  work  is  described  more  completely 
in  [Gross,  1991].  Recently,  we  have  been  exam¬ 
ining  the  use  of  heuristics  to  recover  the  free 
parameters.  Chief  among  these  heuristics  was 
symmetry  of  the  cross-section  because  symme¬ 
try  is  pervasive  in  both  man-made  objects  and 
nature.  Since  symmetries  project  to  skew  sym¬ 
metries,  finding  axes  of  skew  symmetry  is  an  im¬ 
portant  vision  task.  Thus  we  recently  developed 
SYMAN,  a  SYMmetry  ANalyzer,  a  brief  de¬ 
scription  of  which  appeared  in  [Gross  and  Boult, 
1991].  In  that  paper,  we  motivated  SYMAN’s 
combination  of  global  and  local  methods.  We 
showed  how  to  derive  a  global  analytic  solution 
for  the  skew  axes  when  the  degree  of  skew  sym¬ 
metry  is  known.  We  also  briefly  presented  a  new 
local  tangent-based  method  which  has  advan¬ 
tages  over  previous  methods,  and  gave  examples 
using  SYMAN  on  both  real  and  synthetic  im¬ 
ages.  A  more  complete  description  of  SYMAN, 
and  derivation  of  the  formulas  can  be  found  in 
[Gross,  1991].  Recent  work  on  SYMAN  includes 
a  critical  comparison  of  4  different  techniques 
for  the  recovery  of  skew-symmetry  axes  [Boult 
et  al.,  1991],  and  the  recovery  of  the  orientation 
of  symmetric  objects  (i.e.  complete  resolution  of 
skew-axes  ambiguity)  by  the  “perceptual  group¬ 
ing”  of  potential  solutions  for  the  underlying 
multiple  orthogonal  skew-axes  pairs.  For  exam¬ 
ple,  the  pose  of  the  car  in  figure  3  was  obtained 
using  the  contours  shown,  computing  their  po¬ 
tential  skew  axes,  grouping  on  these  axes  and 
resolving  the  ambiguity  to  obtain  a  global  coor¬ 
dinate  system,  the  cars  pose. 

4.4  Shape  from  Focus 

Shape  recovery  methods  such  as  stereo, 
structured  light,  and  shape  from  shading  are  ef¬ 
fective  when  applied  to  smooth  diffuse  surfaces 
but  produce  sparse  shape  information  when  ap¬ 
plied  to  rough  textured  surfaces.  Surfaces  that 
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Figure  3;  Figure  showing  the  determined  pose 
of  a  car.  The  edges  shown  were  used  in  deter¬ 
mining  this  pose. 


appear  smooth  on  a  macroscopic  scale  are  of¬ 
ten  rough  at  a  microscope  scale.  Images  of 
rough  surfaces  are  characterized  by  high  fre¬ 
quency  intensity  variations,  and  it  is  difficult  to 
perceive  the  shapes  of  these  surfaces  from  their 
images.  We  have  developed  a  shape-from-focus 
method  [Nayar  92a]  that  uses  different  focus  lev¬ 
els  to  obtain  a  sequence  of  object  images.  The 
sum-modified-Laplacian  (SML)  operator  is  de¬ 
veloped  to  compute  local  measures  of  the  qual¬ 
ity  of  image  focus.  The  SML  operator  is  applied 
to  the  image  sequence,  and  the  focus  measures 
obtained  at  each  image  point  are  used  to  com¬ 
pute  local  depth  estimates.  A  depth  estimation 
algorithm  uses  a  mathematical  model  to  inter¬ 
polate  the  focus  measures  to  obtain  accurate 
depth  estimates. 

We  have  recently  completed  the  implemen¬ 
tation  of  a  fully  automated  shape-from-focus 
system  for  the  recovery  of  microscopic  objects 
[Nayar  92a].  Figure  4  shows  the  camera  im¬ 
age  of  a  50  micron  via-hole  filling  on  a  ceramic 
substrate  and  its  depth  maps  recovered  by  the 
shape  from  focus  system.  We  are  currently  test¬ 
ing  the  system  on  biological  samples  such  as 
micro-organisms  and  chromosomes. 


5  REAL-TIME  VISION 

The  focus  of  this  work  is  to  achieve  a  high 
level  of  interaction  between  a  real-time  vision 
systeni  capable  of  tracking  moving  objects  in  3- 
D  and  a  robot  arm  equipped  with  a  dexterous 
hand  that  can  be  used  pick  up  a  moving  object. 
We  are  interested  in  exploring  the  interplay 
of  hand-eye  coordination  for  dynamic  grasping 
tasks  such  as  grasping  of  parts  on  a  moving 
conveyor  system,  assembly  of  articulated  parts 
or  for  navigation  and  grasping  from  a  mobile 
robotic  system.  Coordination  between  an  or¬ 
ganism’s  sensing  modalities  and  motor  control 
system  is  a  hallmark  of  intelligent  behavior,  and 
we  are  pursuing  the  goal  of  building  an  inte¬ 
grated  sensing  and  actuation  system  that  can 
operate  in  dynamic  as  opposed  to  static  envi¬ 
ronments.  The  system  we  have  built  addresses 
three  distinct  problems  in  robotic  hand-eye  co¬ 
ordination  for  grasping  moving  objects:  fast 
computation  of  3-D  motion  parameters  from 
vision,  predictive  control  of  a  moving  robotic 
arm  to  track  a  moving  object,  and  grasp  plan¬ 
ning.  The  system  is  able  to  operate  at  ap¬ 
proximately  human  arm  movement  rates,  and 
has  been  demonstrated  experimentally  by  track¬ 
ing  a  moving  model  train,  stably  grasping  it, 
and  picking  it  up.  The  algorithms  we  have 
developed  that  relate  sensing  to  actuation  are 
quite  general  and  applicable  to  a  variety  of  com¬ 
plex  robotic  tasks  that  require  visual  feedback 
for  arm  and  hand  control  [Allen  et  al.^  1990b, 
Allen  et  ai,  1991,  Allen  et  a/.,  1992]. 

The  vision  system  uses  calibrated  but  un¬ 
registered  stereo  cameras  to  track  moving  ob¬ 
jects  in  3-D.  The  images  from  the  cameras  are 
processed  by  a  PIPE  image  processing  engine 
that  performs  an  optic-flow  computation  in  real¬ 
time.  The  computed  velocity  fields  are  morpho¬ 
logically  thinned  and  thresholded  on  velocity  to 
find  regions  of  object  motion.  These  regions  are 
then  triangulated  to  give  a  3-D  position  vector 
in  less  than  100  milliseconds.  This  3-D  position 
is  sent  to  the  motion  tracking  algorithm  which 
uses  a  second  order  digital  filter  to  smooth  the 
trajectory  and  a  predictive  alpha-beta-gamma 
filter  to  account  for  video  processing  delays.  Us¬ 
ing  this  system,  a  moving  object  with  an  ar- 
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(b)  Depth  maps. 


Figure  4:  (a)  Camera  image  of  a  50  micron  via-hole;  (b)  Depth  map  recovered  by  shape  from  focus. 


Figure  5:  MVP  Sensor  Planning  System. 
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bitrary  trajectory  can  be  tracked  in  real-time. 
Once  tracking  is  stable,  the  system  commands 
the  arm  to  intercept  the  moving  object  and  the 
hand  is  used  to  grasp  the  object  stably  and  pick 
it  up.  The  system  has  been  demonstrated  with 
a  variety  of  trajectories  and  a  video  of  its  per¬ 
formance  is  available. 

6  SENSOR  PLANNING  AND 
MODELING 

6.1  MVP:  Model-Based  Sensor  Plan¬ 
ning  for  Machine  Vision  Tasks 

In  this  research  we  have  developed  tech¬ 
niques  to  analytically  determine  the  complete 
locus  of  camera  poses  and  optical  settings  that 
satisfy  visibility,  field-of-view,  resolution  and  fo¬ 
cus  requirements  of  a  machine  vision  task  for 
given  features  of  interest.  This  work  is  part 
of  more  extensive  research  that  we  are  pursu¬ 
ing,  as  part  of  our  MVP  (Machine  Vision  Plan¬ 
ning)  system  [Tarabanis,  1991a],  on  the  prob¬ 
lem  of  sensor  planning  for  satisfaction  of  sev¬ 
eral  generic  machine  vision  requirements.  We 
take  a  synthesis  approach  to  this  problem;  that 
is,  the  admissible  domain  of  sensor  locations 
and  settings  is  determined  for  each  task  con¬ 
straint  and  then  these  component  results  are 
combined  in  order  to  find  globally  admissible 
sensor  parameter  values.  This  approach  im¬ 
proves  on  the  generate-and-test  techniques  cur¬ 
rently  employed  in  which  sensor  configurations 
are  generated  and  then  tested  for  satisfaction 
of  the  task  criteria.  In  this  work,  all  five  de¬ 
grees  of  freedom  of  camera  placement  are  con¬ 
sidered  and  thus  the  results  are  applicable  to 
a  general  three-dimensional  viewing  configura¬ 
tion.  Camera  placement  experiments  have  been 
shown  that  demonstrate  the  method  in  an  ac¬ 
tual  robotic  setup.  A  camera  equipped  with 
a  programmable  zoom  lens  is  mounted  on  a 
robotic  arm  and  placed  and  focused  according 
to  the  results  of  the  new  technique.  Camera 
views  are  taken  to  verify  that  the  feature  of  in¬ 
terest  is  visible,  within  the  camera  field  of  view 
and  resolvable  to  the  given  specification.  Re- 
stUts  of  this  research  will  help  automate  the  vi¬ 
sion  system  design  process,  assist  in  program¬ 
ming  the  vision  system  itself  and  lead  to  intel¬ 
ligent  automated  robot  imaging  systems.  The 


object  models  and  a  new  visibility  volumes  al¬ 
gorithms  have  been  implemented  with  ACIS,  a 
commercial  solid  modeling  system.  Current  ef¬ 
forts  are  underway  to  improve  the  optimization 
of  camera  settings  using  three  different  tech¬ 
niques;  1)  interval  analysis,  2)  non-linear  opti¬ 
mization  routines  and  3)  tree-annealing  [Tara¬ 
banis  et  al.,  1991a,  Tarabanis  et  al.,  1991b, 
Tarabanis,  1991b,  Tarabanis  and  Tsai,  1991]. 

6.2  Extending  the  MVP  System 

The  object  and  environment  models  which 
MVP  uses  are  static  CAD  models,  which  means 
MVP  can  not  plan  viewpoints  in  the  face  of 
moving  objects  in  the  environment.  We  are  ex¬ 
ploring  methods  of  extending  MVP  to  handle 
motion.  The  first  approach,  currently  limited 
to  the  case  of  moving  obstacles  (the  target,  or 
features  to  view,  are  stationary),  is  to  sweep  the 
model  of  aU  moving  objects  along  their  trajec¬ 
tories  and  to  plan  around  the  swept  volumes,  as 
opposed  to  the  actual  objects.  This  approach 
has  successfully  been  implemented.  A  second 
approach  which  has  not  yet  been  implemented, 
is  to  make  time  an  explicit  parameter  in  the 
CAD  models  we  are  using,  and  to  plan  a  view¬ 
point  that  moves  in  4  dimensions  (monotonic 
in  time),  maintaining  a  robust  viewpoint  at  ail 
times  [Abrams  and  AJlen,  1992]. 

6.3  PROVER 

The  application  of  numeric  methods  to  the 
minimization  of  error  has  become  an  emerging 
paradigm  for  object  recovery.  Typically,  a  para¬ 
metric  representation  describing  the  object  is 
postulated.  Its  parameters  are  then  adjusted 
to  minimize  some  mesisurement  of  the  distance 
between  the  representation  and  the  datapoints 
(the  error-of-fit  model).  Characteristics  of  the 
sensor  used  to  recover  the  points  may  be  im¬ 
plicit  in  this  formulation  or  may  not  be  included 
at  all.  While  sensors  may  be  precise  for  a  spe¬ 
cific  field  of  view  no  sensor  is  everywhere  ex¬ 
act.  A  laser  range  finder  for  example,  yields 
very  sharp  x-  and  y-coordinate  values;  however, 
its  z-coordinate  is  less  trustworthy.  It  becomes 
important  to  capture  the  strengths  and  weak¬ 
nesses  of  a  sensor  and  incorporate  them  into  the 
recovery  process.  We  seek  to  make  explicit  the 
contribution  of  a  particular  sensor  by  introduc- 
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ing  a  sensor  model.  This  partitioning  facilitates 
the  development  of  an  appropriate  description 
of  a  sensor’s  characteristics.  Also,  it  helps  clar¬ 
ify  interactions  among  different  aspects  of  the 
recovery  process  (i.e.  error-of-fit  model,  sen¬ 
sor  model,  and  parametric  object  representa¬ 
tion).  The  sensor  model  is  reflected  in  the  cer¬ 
tainty  of  sensed  quantities  (position,  color,  in¬ 
tensity)  associated  with  a  datapoint.  We  ex¬ 
plore  whether  the  introduction  of  an  explicit 
sensor  model  yields  an  improvement  in  the  re¬ 
covery  process.  The  PROVER  (Parametric  Re¬ 
covery  Of  Volumes:  Experimental  Recovery) 
system  is  a  testbed  used  in  the  development  of 
sensor  models  as  described  in  [O’Donnell  and 
Boult,  1991].  As  shown  in  figure  6,  this  ob¬ 
ject  oriented  distributed  system  takes  a  number 
of  modules  and  uses  them  to  recover  paramet¬ 
ric  representations.  We  can  then  compare  the 
results  of  these  recoveries  to  analyze  the  advan¬ 
tages  of  different  sensor  models. 

7  TOPOLOGICAL  NAVIGATION 

The  model  for  topological  visual  navigation 
in  two-dimensional  spaces,  already  formalized 
and  demonstrated,  has  been  extended  by  fur¬ 
ther  analyses  of  the  components  of  qualitative 
direction-giving. 

The  efficiency  with  which  an  object  can  be 
used  as  part  of  a  series  of  directions  (that  is,  the 
extent  to  which  an  object  is  a  "landmark”)  de¬ 
pends  heavily  on  several  properties.  An  impor¬ 
tant  one  is  the  relative  object  clutter  of  the  in¬ 
dividual  images  taken  of  the  environment;  land¬ 
marks  thus  depend  upon  the  size  of  the  viewing 
aperture. 

At  least  four  such  aperture-dependent  prop¬ 
erties  have  been  formalized,  and  their  naviga¬ 
tion  and  description  costs  are  being  evaluated 
in  a  full  navigation  system.  First,  a  landmark 
is  considered  "obvious”  if  it  uniquely  occurs  in 
an  image  of  the  navigational  world  (under  a 
fairly  wide  set  of  viewing  circumstances).  Sec¬ 
ond,  a  landmark  is  "confusable”  with  another 
landmark  if  an  arbitrary  choice  of  landmark  in 
an  image  has  no  effect  on  the  result  of  the  navi¬ 
gation.  Third,  a  landmark  is  "new”  if  it  can  be 
distinguished  from  identical  landmarks  in  prior 


images;  landmarks  can  thus  be  path-dependent. 
Lastly,  a  landmark  may  be  defined  in  terms  of 
its  spatial  configuration  with  other  landmarks, 
including  such  virtual  landmarks  as  the  bound¬ 
aries  of  the  viewing  window.  Each  of  these  prop¬ 
erties  can  be  defined  without  reference  to  posi¬ 
tion,  orientation,  scale,  or  magic  numbers,  and 
each  leads  to  particularly  simple  sets  of  naviga¬ 
tional  direction-giving. 

Further  investigation  of  the  properties  of 
the  directions  themselves  reveal  at  least  two 
general  strategies  for  navigation.  These  two 
strategies  appear  to  be  members  of  a  continuum 
of  strategies  that  can  be  mixed  and  matched 
at  will  in  order  to  minimize  the  various  costs 
of  navigation:  sensing  cost,  travel  cost,  and 
direction- giving  cost.  The  two  strategies  that 
have  been  formalized  are  based  on  the  con¬ 
cept  of  a  navigational  "parkway,”  a  particularly 
straightforward  spatial  arrangement  of  land¬ 
marks.  (Common  physical  streets  and  high¬ 
ways  are  special  cases  of  parkways.)  Naviga¬ 
tion  may  require  the  controlled  departure  from 
one  parkway  to  another;  the  careful  selection  of 
such  "trajectories"  establishes  a  second  strat¬ 
egy.  (Common  ocean  crossings  are  a  special 
case  of  trajectories.)  The  definition,  analy¬ 
sis,  generalization,  implementation,  and  evalua¬ 
tion  of  these  and  other  high-level  strategies  for 
metric-free  qualitative  navigation  is  proceeding 
[Render  et  ai,  1990]. 

Related  to  the  above  work,  mathematical 
techniques  from  computational  geometry  and 
fuzzy  set  theory  have  been  used  to  develop  a 
method  for  the  qualitative  description  of  the 
topological  properties  of  a  set  of  significant  ob¬ 
jects  or  landmarks.  The  method  permits  the 
fuzzy  description  of  a  spatial  array  of  objects, 
independent  of  metric  properties  or  magic  num¬ 
ber  thresholds. 

The  method  constructs  the  convex  hull  of 
a  set  of  points,  and  uses  membership  functions 
from  fuzzy  set  theory  to  classify  the  spatial  ar¬ 
ray  into  such  basic  geometric  shapes  as  triangles 
or  squares,  or  into  other  useful  spatial  configu¬ 
rations  that  have  no  simple  English  names,  such 
as  "pointed  polygon”.  The  method  is  generic, 
recursively  applicable,  extensible  to  three  di- 
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Figure  6:  The  PROVER  system 


mensions,  and  strongly  suggests  other  descrip¬ 
tive  categories  that  are  useful  even  if  not  recog¬ 
nized  by  the  nouns  of  a  given  natural  language. 
The  determination  of  the  stability  of  such  de¬ 
scriptions  under  small  perturbations  of  the  spa¬ 
tial  configuration  is  under  investigation. 

Results  from  this  effort  are  being  integrated 
with  the  topological  navigation  techniques,  with 
which  they  form  a  continuum.  At  one  extreme, 
the  world  is  seen  in  a  series  of  sensations,  and  a 
goal  must  be  attained  by  a  series  of  directions. 
At  the  other  extreme,  the  entire  world  can  seen 
at  a  glance,  and  a  goal  need  only  be  described 
((AbeUa,  1992]). 

8  DEXTEROUS  ROBOTIC  HANDS 

The  focus  of  this  work  is  in  exploiting  the 
capabilities  of  the  Utah- MIT  dexterous  hand. 
This  is  a  4-fingered,  16  DOF  device  with  rich 
sensing  (force,  position,  tactile)  and  high  dex¬ 
terity.  [Allen  et  ai,  1990a,  Allen,  1990a].  We 
include  this  work  in  the  Image  Understanding 
report  because  of  our  desire  to  integrate  both 
vision  and  touch  in  an  active  sensing  paradigm. 
8.1  Haptic  Object  Recognition 

Object  recognition  has  focused  on  vision 
based  methods.  Humans  also  have  a  very  highly 
developed  haptic  perception  system.  By  haptic, 
we  mean  the  interplay  of  both  the  cutaneous 


system  (skin,  tactile  receptors)  and  the  kinaes- 
thetic  system  (joints,  muscle  and  bone).  We 
have  developed  robotic  analogs  of  3  human  hap¬ 
tic  sensing  strategies  to  recover  the  3D  shape  of 
objects  using  our  hand/arm  system  [Allen  and 
Michelman,  I990].  Vision  sensing  is  also  being 
added  to  allow  fully  autonomous  object  recog¬ 
nition.  The  idea  here  is  that  complementary 
sensing  form  both  vision  and  touch  allows  for 
more  robust  and  stable  recognition.  The  vision 
consists  of  a  real-time  linear  feature  extractor 
that  is  input  to  a  robust  line-based  stereo  sys¬ 
tem  that  recovers  3-D  axes  of  surfaces  of  revo¬ 
lution.  These  axes  are  then  used  by  the  hand 
system  to  explore  the  object’s  contour  and  re¬ 
cover  the  shape  of  the  object  [Allen,  1990b, 
Singh  and  Shneier,  1990].  A  related  work  by 
Kenneth  Roberts  [Roberts,  1991]  uses  active 
touch  exploration  to  recognize  a  3-D  object 
taken  from  a  known  set  of  models.  This  work 
combines  three  approaches:  (1)  using  geomet¬ 
ric  constraints  between  components  to  eliminate 
interpretations,  (2)  interpretation  tree  methods 
for  choosing  the  best  active  sensing  move,  (3) 
exploratory  moves  made  by  tracing  continually 
along  the  surface  of  the  object  (and  not  through 
free  space).  This  method  uses  a  set  of  poly¬ 
hedral  object  models  and  exploits  a  set  of  geo¬ 
metric  constraints  tailored  for  matching  compo- 
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nents  acquired  from  haptic  exploration  against 
components  in  the  models.  A  new  constraint 
using  pairs  of  line  segments  has  been  developed 
and  this  constraint  is  also  relevant  to  problems 
in  computer  vision.  An  interesting  aspect  of 
this  work  is  that  active  touch  sensing  moves  are 
associated  with  costs,  and  this  allows  the  deter¬ 
mination  of  sensing  strategies  and  methods  for 
choosing  the  next  sensor  move  for  recognition 
tasks. 

8.2  Tool  Usage  with  a  Dexterous  Hand 

This  research  investigates  the  ways  in  which 
tools  can  be  used  by  dexterous  robot  hands. 
As  robots  are  called  upon  to  perform  finer  and 
more  complex  tasks,  they  will  begin  to  use  ma¬ 
nipulators  with  a  large  number  of  degrees  of 
freedom  and  sensors.  For  example,  in  hazardous 
environments,  such  as  nuclear  reactors,  teleop- 
erated  robot  hands  will  be  used  for  grasping  and 
manipulation.  As  the  application  of  robotics 
extends  more  toward  space,  the  long  commu¬ 
nication  time  delays  will  necessitate  imparting 
greater  autonomy  to  robots  in  fine  manipulation 
tasks. 

The  tasks  we  have  been  studying  are  pre¬ 
cision  tool  tasks,  such  as  using  a  fine  screw¬ 
driver,  a  pair  of  tweezers,  a  pencil  or  an  eraser. 
The  significance  of  these  precision  tasks  is  that 
the  tool — the  object  being  manipulated — is  con¬ 
trolled  strictly  by  the  fingers  of  the  hand,  rather 
than  by  a  wrist  or  arm.  Although  much  of  our 
work  has  been  in  analyzing  the  requirements 
and  mechanics  of  manipulation  tasks,  our  ul¬ 
timate  goal  is  the  implementation  of  the  the 
too]  tasks  cited  above  on  a  Utah/MIT  dexterous 
hand.  Therefore,  we  have  also  worked  on  the 
development  of  digital  position  and  force  con¬ 
trollers  for  a  robot  hand,  kinematic  calibration 
of  the  hand,  and  tactile  and  force  sensing  for 
the  robot’s  fingers  [Michelman  and  Allen,  1991, 
Jiang  et  ai,  1991] 
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Abstract 

Research  in  the  Computer  Vision  Laboratory 
at  Maryland  is  focused  on  problems  whose  so¬ 
lutions  would  constitute  significant  progress  in 
image  understanding.  This  paper  reviews  our 
work  on  these  problems  during  the  period  Au¬ 
gust  1990  -  September  1991.  The  areas  covered 
include  low-level  vision,  scene  recovery,  purpo¬ 
sive  vision,  navigation,  object  recognition,  and 
parallel  algorithms,  as  well  as  a  few  other  top¬ 
ics. 

1  Low-level  vision 

Two  major  goals  of  low-level  vision  are  the  Mentiflca- 
tion  of  regions  and  region  boundaries  in  an  image  that 
arise  from  surfaces  or  surface  discontinuities  in  the  scene. 
These  tasks  of  segmentation  and  edge  detection  are  not 
straightforward  because  of  the  presence  of  noise,  which 
may  arise  either  in  the  scene  itself  or  in  the  imaging 
process.  Smoothing  of  the  image  to  reduce  noise,  es¬ 
timation  to  identify  image  regions  that  are  good  fits  to 
simple  functions,  and  differentiation  to  aid  in  identifying 
image  discontinuities,  are  commonly  used  preprocessing 
operations. 

Our  research  on  low-level  vision  over  the  past  year 
has  involved  the  development  of  new  methods  of  dif¬ 
ferentiation,  smoothing,  estimation,  edge  detection,  and 
segmentation.  Approaches  investigated  include  Markov 
random  field  and  neural  network  techniques;  multires¬ 
olution  (“pyramid”)  techniques;  robust  estimation  and 
“consensus”  techniques;  Bayesian  and  maximum  a  pos¬ 
teriori  (MAP)  techniques;  and  optimization  techniques. 
We  have  applied  these  approaches  to  various  types  of 
images,  including  aerial  photographs,  synthetic-aperture 
radar  images,  and  range  images. 

1.1  Differentiation  [30] 

Reliable  derivatives  of  digital  images  have  always  been 
hard  to  obtain,  especially  (but  not  only)  at  high  orders. 
We  have  developed  new  filters  that  give  more  accurate 
derivatives  than  the  traditional  Gaussian  ones.  We  have 
shown  that  the  traditional  filters  give  incorrect  deriva¬ 
tives  even  for  an  analytic,  noiseless,  infinite  image,  be¬ 
cause  they  smooth  the  image  too  much.  For  a  finite 


interval,  the  effects  of  truncating  the  filter  become  in¬ 
tolerable  for  high  derivatives.  We  have  derived  filters 
that  allow  a  higher  amount  of  noise  suppression  with 
less  compromise  of  accuracy  than  the  Gaussian.  The  fil¬ 
ters  are  easy  to  compute  at  arbitrary  size.  In  addition, 
a  general  analytic  (non-filter)  solution  has  been  derived 
for  the  regularization  problem  on  a  finite  interval. 

1.2  Smoothing  [3,  13] 

A  non-iterative,  parameter-free  image  smoothing 
method  has  been  developed  which  handles  both  addi¬ 
tive  and  m.ultiplicative  noise.  First  for  every  pixel  the 
largest  centered  neighborhood  (7  x  7,5  x  5  or  3  x  3) 
containing  no  discontinuity  is  sought.  The  selection  is 
made  by  comparing  a  local  discontinuity  measure  with 
its  robust  global  estimate  corresponding  to  homogeneous 
neighborhoods.  If  no  discontinuity  is  present,  the  pixel 
is  assigned  the  spatial  mean  computed  in  the  neighbor¬ 
hood.  Around  discontinuities  an  adaptive  least  squares 
smoothing  method  is  applied  in  3  x  3  neighborhoods. 
The  performance  of  the  multiresolution  algorithm  has 
been  compared  with  flat-facet  smoothing  and  adaptive 
smoothing  for  additive  noise.  The  smoothing  of  syn¬ 
thetic  aperture  radar  images  has  been  used  to  demon¬ 
strate  the  effectiveness  of  the  algorithm  for  multiplica¬ 
tive  noise. 

Further  work  on  image  smoothing  has  led  to  a  hierar¬ 
chical  implementation  of  an  edge-preserving  smoothing 
algorithm  on  a  2  x  2  pyramid  structure.  The  smoothed 
pixel  values  are  chosen  from  the  first  three  levels  of  the 
pyramid  and  the  original  image.  The  reduced  resolu¬ 
tion  representations  are  analyzed  in  a  top-down  fashion 
by  comparing  the  local  variances  with  the  corresponding 
global  noise  variance  estimates.  The  global  estimates  are 
also  computed  in  the  pyramid.  Close  to  edges  the  pixel 
values  are  obtained  by  adaptive  least  squares.  The  ar¬ 
tifacts  of  region-based  smoothing  are  eliminated  by  pix- 
elwise  averaging  over  a  set  of  outputs  obtained  with  the 
input  image  shifted  within  the  8x8  block  of  the  level- 
three  parent. 

1.3  Robust  estimation  and  consensus  methods 
[12,  31,  55] 

When  the  images  of  interest  can  be  modeled  as  simple 
functions  corrupted  by  noise,  estimation  of  the  parame¬ 
ters  of  these  functions  provides  a  powerful  approach  to 
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image  smoothing  and  segmentation.  For  estimation  pur¬ 
poses,  robust  high  breakdown  point  techniques  are  gain¬ 
ing  increasing  popularity  in  computer  vision  since  they 
can  tolerate  up  to  half  the  data  being  severely  corrupted. 
The  least  median  of  squares  (LMedS)  estimator  is  the 
best  known  example  of  such  techniques.  We  have  shown 
that  the  attractive  properties  of  the  LMedS  estimator 
do  not  hold  when  all  the  data  is  corrupted  by  zero  mean 
noise.  We  have  developed  a  “consensus  by  decomposi¬ 
tion”  (CBD)  algorithm  which  preserves  the  properties 
of  LMedS  up  to  low  signal-to-noise  ratios  while  achiev¬ 
ing  a  significant  speed-up  relative  to  LMedS.  The  CBD 
estimator  uses  a  different  paradigm  than  LMedS.  The 
data  is  decomposed  in  both  the  spatial  and  parameter 
domains.  A  separate  distribution  is  built  for  every  pa¬ 
rameter  and  the  distribution  is  analyzed  with  a  new, 
enhanced  mode  detection  procedure.  The  superiority  of 
the  CBD  ^timator  has  been  demonstrated  by  extensive 
simulations. 

The  consensus  approach  constitutes  a  paradigm  for 
performing  robust  model  fitting.  Model  fitting  is  an 
essential  part  of  many  algorithms  that  try  to  interpret 
data.  Such  algorithms  normally  use  least  squares  based 
estimation  for  fitting  purposes.  The  least  squares  proce¬ 
dure  is  very  sensitive  to  noise  and  variations  in  the  data. 
Furthermore,  it  is  frequently  the  case  that  a  model  dis¬ 
continuity  exists  in  the  data  and  it  cannot  be  reliably 
detected  by  conventional  methods. 

Recently,  new  robust  estimation  methods  have  been 
introduced  that  deal  with  several  types  of  variations  and 
model  discontinuities.  Unfortunately,  as  already  men¬ 
tioned,  analysis  shows  that  some  of  these  methods  are 
not  effective  in  the  presence  of  noise  such  as  is  usually 
found  in  image  and  signal  data.  The  consensus  paradigm 
integrates  basic  robust  estimation  methods,  a  decompo¬ 
sition  of  the  problem  and  a  relative  majority  analysis 
to  perform  robust  estimation.  The  resulting  robust  esti¬ 
mate  is  a  consensus  among  independent,  invariant  esti¬ 
mates  of  sub-problems. 

The  consensus  paradigm  provides  effective  algorithms 
for  signal  and  image  estimation,  edge  detection,  segmen¬ 
tation,  and  contour  matching.  The  performance  of  these 
algorithms  has  been  analyzed  and  compared  to  existing 
methods. 

As  an  example  of  the  consensus  approach,  we  have 
developed  a  new  robust  algorithm  for  edge  detection. 
The  algorithm  detects  both  roof  and  step  type  edges.  A 
pixel  is  declared  as  an  edge  pixel  if  there  is  a  consen¬ 
sus  between  different  processes  that  try  to  determine  if 
the  pixel  lies  on  a  discontinuity.  We  use  robust  estima¬ 
tion  methods  to  estimate  local  fits  to  windows  in  the 
pixel’s  neighborhood  and  accumulate  votes  from  each 
fit.  The  use  of  robust  estimators  enables  us  to  transform 
any  window  possibly  containing  a  discontinuity  to  a  bi¬ 
nary  window  containing  a  step  edge  in  the  location  of 
the  discontinuity.  We  then  employ  conventional  meth¬ 
ods  to  detect  this  step  edge.  For  simulated  edges  oc¬ 
curring  in  synthetic  images  with  varying  Gaussian  and 
random  noise  levels,  we  have  analyzed  the  probability  of 
detection.  The  algorithm  has  also  been  applied  to  real 
intensity  and  range  images  and  shown  to  perform  well 


in  comparison  with  standard  edge  detectors.  Further 
details  can  be  found  in  a  paper  in  these  proceedings. 

1.4  Bayesian  estimation  and  MAP  methods 
[27,  50,  54,  58] 

When  the  ensemble  of  ideal  scenes  and  the  noise  can 
be  modeled  probabilistically,  Bayesian  estimation  tech¬ 
niques  can  in  principle  be  used  to  find  the  ideal  scene 
that  most  likely  gave  rise  to  the  observed  noisy  image; 
this  is  the  Maximum  A  Posteriori  (MAP)  estimate  of 
the  scene.  A  common  objection  to  Bayesian  estimation 
is  that  the  probability  density  functions  (pdf’s)  involved 
are  usually  not  exactly  known.  In  fact,  however,  ex¬ 
act  knowledge  of  the  pdf’s  is  not  important;  it  often 
suffices  to  know  the  pdf’s  approximately.  Furthermore, 
it  may  even  suffice  if  we  have  a  family  of  pdf’s  one  of 
which  approximates  the  actual  pdf,  provided  we  specify 
a  “second-stage”  pdf  on  the  family  such  that  the  approx¬ 
imation  of  the  actual  pdf  has  high  probability. 

Bayesian  estimation  of  digital  signals  (or  images)  is 
ordinarily  concerned  with  the  estimation  of  ideal  signal, 
given  a  noisy  signal.  The  computational  cost  of  this  pro¬ 
cess  is  greatly  reduced  if  the  objective  is  only  partial  or 
“qualitative”  Bayesian  description,  rather  than  complete 
estimation,  of  the  ideal  signal.  For  example,  in  the  case 
of  a  piecewise  constant  signal,  instead  of  estimating  the 
value  of  the  ideal  signal,  we  can  require  only  a  piecewise 
symbolic  description  of  the  signal — e.g.,  is  the  value  high 
or  low,  where  these  descriptors  are  defined  by  probability 
densities  on  the  possible  signal  values.  This  task  is  com¬ 
putationally  less  costly  than  that  of  complete  Bayesian 
estimation  of  the  signal;  moreover,  the  descriptions  can 
be  estimated  robustly.  We  have  demonstrated  this  both 
for  digital  signals  and  for  a  simple  class  of  digital  images. 

The  problem  of  estimation  using  partial  (e.g.,  com¬ 
pressed)  information  about  the  observations  is  impor¬ 
tant  in  practice.  One  reason  for  its  importance  is  that 
we  might  be  interested  in  communicating  data  from  the 
sensor(s)  to  the  place  where  decisions  are  made  (e.g., 
remote  sensing  data).  Another  reason  is  that  estima¬ 
tion  using  compressed  information  might  be  less  costly 
in  terms  of  computation.  We  have  studied  the  problem  of 
estimating  the  parameters  of  a  signal  of  known  form  (e.g. 
polynomial  of  degree  r)  using  a  Bayesian  approach  to 
estimation.  Conditions  can  be  formulated  under  which 
the  estimates  obtained  using  partial  information  are  the 
same  as  those  obtained  using  full  information.  This  ap¬ 
proach  has  applications  to  distributed  detection  (sensor 
fusion).  Partial  information  about  the  observations  can 
also  be  used  effectively  to  obtain  partial  estimates  of  the 
signal. 

Bayesian  methods  can  be  used  not  only  for  conven¬ 
tional  signal  estimation  tasks,  but  also  for  identification 
of  data  generated  by  formal  models.  For  example,  it  can 
be  used  to  recover,  from  a  finite  set  of  candidate  gram¬ 
mars,  the  most  probable  grammar  (and  derivation)  that 
generated  the  non-noisy  version  of  an  observed  noisy 
string,  where  the  noise  process  is  iid  and  defined  by 
an  arbitrary  stochastic  matrix.  We  have  shown  that  if 
the  grammars  are  context-free  or  stochastic  context-free, 
this  problem  is  solvable  in  polynomial  time. 
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1.5  Optimization-based  methods  [35,  42] 

Optimality  criteria  are  commonly  used  for  the  segmen¬ 
tation  of  images  into  homogeneous  regions.  In  this  tradi¬ 
tion,  we  have  developed  several  algorithms  for  segment¬ 
ing  highly  speckled  high  resolution  Synthetic  Aperture 
Radar  (SAR)  complex  data  into  spatially  and  radiomet- 
rically  homogeneous  regions.  Our  approach  is  based  on 
two  models,  one  for  the  speckled  complex  amplitudes, 
and  the  other  for  the  regions.  The  first  model  uses  the 
physics  of  the  SAR  imaging  and  processing  system  to 
characterize  the  statistics  of  speckle  while  the  second 
model  uses  a  Markov  random  field  to  describe  the  statis¬ 
tics  of  the  regions.  Based  on  the  combination  of  these 
two  models  using  Bayes’  theorem,  two  possible  optimal¬ 
ity  criteria  are  used  for  the  segmentation  of  the  complex 
data  into  regions.  The  resulting  algorithms  are  imple¬ 
mented  on  a  parallel  optimization  network.  Results  us¬ 
ing  both  simulated  and  actual  SAR  complex  data  have 
been  used  to  compare  the  algorithms  and  evaluate  their 
performance. 

Current  object  recognition  systems  can  only  “recog¬ 
nize”  a  limited  class  of  objects.  Objects  having  variable 
numbers  of  parts  and  only  loosely  constrained  shapes 
cannot  be  modeled  and  recognized  by  these  systems.  We 
developed  a  data  structure  called  the  VAPOR  (Variable 
APpearance  Object  Representation)  model  to  represent 
objects  with  these  kinds  of  variable  appearances  and 
have  developed  a  search  procedure  called  MOSS  (MOdel 
Space  Search)  to  find  instances  of  these  models  in  im¬ 
age  data.  The  VAPOR  model  is  an  idealization  of  the 
object;  all  instances  of  the  model  in  the  image  are  vari¬ 
ations  from  the  ideal  appearance.  The  variations  are 
evaluated  by  the  description  length  of  the  model,  mea¬ 
sured  in  information-theoretic  bits.  MOSS  selects  the 
best  model  for  the  given  image  data  by  choosing  the 
minimal  length  description.  We  have  demonstrated  how 
this  approach  performs  in  a  simple  domain  of  circles  and 
polygons  and  in  the  complex  domain  of  finding  cloverleaf 
intersections  in  aerial  images  of  roads. 

1.6  Cluster  analysis  [9,  49] 

Both  consensus  and  MAP  methods  can  be  applied  to 
the  detection  of  clusters  in  sparse  data.  As  an  illus¬ 
tration  of  the  consensus  approach,  we  have  investigated 
the  use  of  pyramid  algorithms  for  compact  region  detec¬ 
tion  and  delineation  to  detect  compact  dot  clusters  on  a 
sparsely  dotted  background.  When  the  delineation  pro¬ 
cess  is  applied  to  a  detected  cluster,  it  yields  “ragged” 
results  which  are  sensitive  to  the  position  of  the  cluster 
in  the  image.  But  if  the  process  is  applied  to  a  set  of 
shifted  versions  of  the  cluster,  the  results  can  be  com¬ 
bined  into  an  acceptable  delineation.  This  provides  a 
further  illustration  of  the  general  “consensus”  approach 
to  combining  multiple  techniques,  or  multiple  versions  of 
the  same  technique,  to  obtain  more  reliable  results. 

The  problem  of  dot  clustering  can  also  be  studied  from 
a  model-based  viewpoint.  In  this  approach,  a  set  of 
scatter  processes  (in  brief;  scatters)  is  chosen,  each  of 
which  associates  a  probability  with  each  location  in  a 
discrete  space;  in  other  words,  a  scatter  is  a  probabil¬ 
ity  mass  function  (pmf)  on  the  space.  Some  number  of 


dots  is  then  distributed  in  accordance  with  each  scatter. 
A  scatter  and  an  associated  numerosity  define  a  sub¬ 
population  of  dots.  This  model  is  extremely  general;  the 
scatter  pmf ’s  are  arbitrary.  Given  a  set  of  dots  generated 
by  such  a  model,  we  can  apply  Maximum  A  Posteriori 
(MAP)  methods  to  recover  the  most  likely  set  of  scat¬ 
ters  and  numerosities  that  could  have  given  rise  to  the 
dots.  This  identification  problem  is  different  from  the 
partitioning  problem,  which  asks  for  the  most  likely  par¬ 
tition  of  the  dot  population  into  subpopulations.  MAP 
methods  are  especially  useful  in  cluster  analysis  when 
the  scatters  are  non-Gaussian.  The  general  identifica¬ 
tion  problem  is  intractable,  but  it  has  a  polynomial  time 
solution  if  the  number  of  clusters  is  bounded,  and  a  simi¬ 
lar  result  holds  for  the  partitioning  problem.  The  details 
of  this  work  can  be  found  in  a  paper  in  these  proceedings. 

1.7  Texture  analysis  [8,  34] 

During  the  past  25  years  considerable  effort  has  been 
expended  attempting  to  formulate  a  theory  of  texture 
segregation  and  description.  Attempts  have  been  made 
to  characterize  the  information  that  yields  texture  seg¬ 
regation  in  terms  of  a  small  set  of  properties  or  primi¬ 
tives.  These  attempts  have  provided  important  informa¬ 
tion  concerning  the  visual  processing  involved  but  have 
not  succeeded  in  establishing  a  psychophysical  theory 
of  texture  segregation.  Specifying  the  features  yielding 
texture  segregation  has  proved  to  be  difficult.  An  al¬ 
ternative  approach  is  to  characterize  texture  segregation 
in  terms  of  processing  mechanisms.  Two  mechanisms 
that  explain  much  of  the  experimental  data  are  spatial- 
frequency  channels  and  preattentive  grouping  processes. 
We  have  conducted  experiments  showing  that  texture 
segregation  can  be  explained  in  terms  of  the  differen¬ 
tial  stimulation  of  spatial-frequency  channels  operating 
on  intensity  values.  However,  texture  segregation  also 
occurs  as  a  result  of  preattentive  grouping  processes. 
The  grouping  of  discrete  elements  into  a  line-like  pattern 
through  edge  alignment  segregates  the  pattern  from  sur¬ 
rounding  elements.  The  grouping  of  intermixed  light  and 
dark  elements  through  lightness  similarity  segregates  the 
pattern  into  subpopulations.  Perceived  population  seg¬ 
regation  is  approximately  a  single- valued  function  of  the 
differences  in  the  perceived  lightnesses  of  the  elements 
whereas  the  relevant  variable  for  spatial-frequency  chan¬ 
nels  is  stimulus  contrast.  Recent  experiments  indicate 
that  texture  segregation  may  be  affected  by  the  stim¬ 
ulus  representation.  We  have  performed  experiments 
showing  that  a  change  in  the  orientation  of  a  stimulus 
that  keeps  the  slopes  of  the  component  features  constant 
yields  stronger  texture  segregation  in  a  3D  representa¬ 
tion  (i.e.,  the  figures  were  seen  as  three-dimensional) 
than  in  a  2D  representation  (i.e.,  the  figures  that  were 
seen  as  two-dimensional).  We  believe  that  the  greater 
texture  segregation  with  a  3D  representation  is  a  con¬ 
sequence  of  grouping  processes.  A  3D  representation 
makes  evident  the  orientation  of  object  surfaces  enabling 
the  grouping  of  objects  by  their  similarity  of  surface  ori¬ 
entation,  e.g.,  the  direction  of  their  surface  normals.  We 
conjecture  that  these  grouping  processes,  because  they 
are  based  on  the  3D  interpretation  of  projected  shapes, 
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require  attention. 

We  have  also  investigated  the  problem  of  texture  dis¬ 
crimination  in  range  images.  The  range  data  is  trans¬ 
formed  to  a  common  coordinate  system,  and  then  re¬ 
sampled  to  generate  a  regular  grid  of  points  in  order  to 
eliminate  the  effect  of  different  sampling  rates.  Statisti¬ 
cal  textural  features  based  on  co-occurrence  matrices  are 
computed  on  the  resampled  data.  These  features  can  be 
used  to  discriminate  between  classes  of  natural  surfaces. 
Our  experiments  used  surfaces  composed  of  pebbles  of 
different  sizes  lying  on  a  plane.  Many  of  the  standard 
statistical  textural  features  have  been  found  to  be  useful 
in  discriminating  between  such  surfaces.  The  experi¬ 
ments  also  confirmed  the  importance  of  resampling  the 
data  before  computing  the  textural  features. 

1.8  Markov  random  fields  and  neural  networks 
[28] 

There  are  significant  relationships  between  neural  net¬ 
work  and  Markov  Random  Field  approaches  to  low-level 
vision  problems.  We  have  investigated  connections  be¬ 
tween  these  areas  as  they  apply  to  image  estimation  and 
segmentation,  as  well  as  to  texture  classification  and  seg¬ 
mentation  of  textured  images. 

2  Recovery 

We  are  continuing  our  work  on  recovery,  i.e.  on  the 
theoretical  study  of  the  range  of  possible  capabilities  that 
vision  systems  could  have.  Our  work  has  concentrated  on 
the  analysis  of  visual  motion  and  on  shape  reconstruction 
from  cues  such  as  stereo,  shading,  etc. 

2.1  Analysis  of  visual  motion  [4,  19,  38,  57] 

It  has  been  argued  that  identifiable  features  such  as 
points  and  lines  are  the  only  thing  needed  for  comput¬ 
ing  motion  and  scene  structure  because  they  carry  re¬ 
liable  information,  there  exist  mathematical  and  com¬ 
putational  tools  to  treat  them,  and  the  extensive  expe¬ 
rience  gained  in  photogrammetry  can  be  tapped.  On 
the  other  hand,  the  number  of  image  pixels  covered  by 
these  features  is  only  a  tiny  fraction  of  the  total  number 
of  pixels  in  the  image.  If  one  uses  only  the  point  and 
line  features,  the  vast  majority  of  the  image  remains  un¬ 
used.  Besides  this  underutilization,  two  other  problems 
arise.  First,  there  is  no  consensus  among  researchers  in 
computer  vision  as  to  what  a  feature  is,  either  in  a  rigor¬ 
ous  mathematical  sense  or  even  in  an  intuitive  practical 
sense,  judging  from  what  different  detectors  detect  as 
features.  There  is  no  general  algorithm  to  detect  features 
and  match  them.  The  second  problem  is  that  there  ex¬ 
ists  no  algorithm  that  works  with  both  points  and  lines 
at  the  same  time  and  guarantees  a  unique  solution,  al¬ 
though  there  are  algorithms  for  each  of  them  separately. 
A  special  case  of  the  theory  we  have  developed  is  an 
algorithm  that  can  treat  both  of  them  at  the  same  time. 

We  are  also  continuing  our  efforts  on  examining  the 
stability  of  the  structure  from  motion  problem  when  we 
employ  correspondence  as  input,  using  a  statistical  ap¬ 
proach.  In  particular,  we  have  studied  the  inherent  am¬ 
biguities  in  recovering  3-D  motion  information  from  a 


single  optical  flow  field.  These  ambiguities  are  quanti¬ 
fied  using  the  Cramer-Rao  lower  bound  (CRLB),  which 
is  a  lower  bound  for  the  error  variances  of  motion  param¬ 
eter  estimates.  This  performance  bound  is  independent 
of  the  motion  estimation  algorithm,  and  can  always  be 
computed  for  any  arbitrary  3-D  motion  of  a  rigid  surface 
by  inverting  a  5  x  5  matrix.  As  a  special  case,  the  perfor¬ 
mance  bound  for  the  motion  of  3-D  rigid  planar  surfaces 
has  been  studied  in  detail.  The  dependence  of  the  bound 
on  several  factors,  such  as  the  underlying  motion,  surface 
position,  surface  orientation,  field  of  view,  and  density 
of  available  pixels,  can  be  derived  as  closed  form  expres¬ 
sions.  A  subset  of  these  results  supports  Adiv’s  recent 
analysis  of  the  inherent  ambiguities  of  motion  param¬ 
eters.  For  the  general  motion  of  an  arbitrary  surface, 
it  turns  out  that  not  every  pixel  gives  information  re¬ 
garding  3-D  motion  estimation.  We  have  shown  that 
the  aperture  problem  in  computing  the  optical  flow  re¬ 
stricts  the  nontrivial  information  about  the  3-D  motion 
to  a  sparse  set  of  pixels  at  which  both  components  of 
the  flow  velocity  are  observable.  Using  computer  simu¬ 
lations,  we  have  studied  the  depend*  nee  of  the  inherent 
ambiguities  on  the  underlying  motion,  the  field  of  view, 
and  the  number  of  feature  points  for  motion  in  front  of  a 
nonplanar  environment.  Also,  effects  of  two  smoothing 
schemes  on  estimation  accuracy  have  been  analyzed.  We 
have  shown  that  introducing  a  smoothness  constraint  by 
fitting  local  patches  to  3-D  depths  gives  lower  CRLBs. 
Not  surprisingly,  this  reduction  of  CRLBs  is  very  small. 
Further,  fitting  local  patches  also  relaxes  the  aperture 
problem  since  the  motion  information  is  not  restricted 
to  the  points  at  which  both  optical  flow  components  are 
observable.  In  contrast,  imposing  smoothness  on  the  op¬ 
tical  flow  by  regularization  does  not  lower  the  CRLBs. 
All  in  all,  our  results  indicate  that  robust  computation 
of  3-D  motion  from  noisy  optical  flow  is  still  far  from 
reality. 

We  have  also  developed  an  algorithm  for  estimating 
image  motion  with  motion  segmentation  in  mind.  How¬ 
ever,  computing  optical  flow  remains  an  ill-posed  prob¬ 
lem  and  estimating  it  very  accurately  in  a  general  and 
noisy  situation  could  be  impossible.  In  any  case,  we  need 
to  estimate  a  visual  quantity  with  an  accuracy  that  will 
allow  us  to  perform  the  task  at  hand.  Various  tasks 
may  require  different  accuracies.  We  have  developed  an 
algorithm  for  estimating  optical  flow  with  motion  seg¬ 
mentation  in  mind,  i.e.  the  algorithm  is  partly  guided 
by  a  motion  segmentation  process.  Whether  its  accuracy 
is  enough  for  tasks  like  shape  from  flow  or  3-D  motion 
detection  is  a  research  issue  we  are  currently  investigat¬ 
ing. 

Image  motion  is  estimated  by  matching  feature  “inter¬ 
est”  points  in  different  frames  of  video  image  sequences. 
The  matching  is  based  on  local  similarity  of  the  dis¬ 
placement  vectors.  Clustering  in  the  displacement  vec¬ 
tor  space  is  used  to  determine  the  set  of  plausible  match 
vectors.  Subsequently,  a  similarity  based  algorithm  per¬ 
forms  the  actual  matching.  The  feature  points  are  com¬ 
puted  using  a  multiple  filter  image  decomposition  op¬ 
erator.  The  algorithm  has  been  tested  on  synthetic  as 
well  as  real  video  images.  The  novelty  of  the  approach 
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consists  of  the  fact  that  it  handies  multiple  motions  and 
performs  motion  segmentation. 

Over  the  last  five  years  we  have  been  developing  fea¬ 
ture  based  motion  estimation  methods  using  a  long  se¬ 
quence  of  image  frames.  In  our  previous  work  we  used 
quaternions  to  represent  the  rotational  motion.  The  oc¬ 
clusion  of  feature  points  was  not  addressed.  The  non¬ 
linear  plant  model  used  in  the  recursive  equation  also 
required  a  time-consuming  numerical  integration  step 
to  update  the  covariances  of  the  estimated  motion  and 
structure  parameters.  Recently,  we  have  developed  a 
simplified  kinematic  model  based  approach  with  an  eye 
towards  developing  robust  but  faster  algorithms  for  3-D 
motion  estimation.  This  approach  is  based  on  repre¬ 
senting  the  constant  translational  velocity  and  constant 
angular  velocity  motion  using  nine  rectilinear  motion  pa¬ 
rameters,  which  are  the  3-D  vectors  of  initial  position, 
linear  and  angular  velocities.  The  structure  of  the  mov¬ 
ing  object  is  represented  by  the  coordinates  of  feature 
points  in  a  3-D  coordinate  system  fixed  on  the  object. 
The  measurements  are  noisy  perturbations  of  2-D  image 
locations  of  feature  points.  A  nonlinear  least  squares 
method  has  been  developed  to  compute  both  estimates 
of  motion  and  structure  parameters  using  the  first  few 
frames.  Nonlinear  Kalman  filters  are  then  used  for  the 
incrementational  updating  of  motion  and  structure  pa¬ 
rameters.  Since  the  plant  models  are  linear  in  the  new 
formulation,  closed  form  solutions  are  possible  for  fea¬ 
ture  tracking,  thus  avoiding  the  time-consuming  numer¬ 
ical  integration  steps  required  in  the  implementation  of 
nonlinear  filters.  Problems  due  to  occlusion  and  tempo¬ 
rally  nonuniform  images  have  also  been  addressed. 

2.2  Transparency  and  texture  [2,  32] 

VVe  have  continued  our  work  on  transparency,  motivated 
by  the  fact  that  the  extraction  of  image  velocity  in  the 
presence  of  transparency  is  not  accounted  for  by  current 
motion  theories  and  by  their  limited  ability  to  discrimi¬ 
nate  of  3-D  texture  patterns. 

Perceptual  transparency,  which  occurs  whenever  (i) 
two  or  more  patterns  are  perceived  as  lying  at  different 
depth  levels,  and  (ii)  one  pattern  is  seen  through  other 
patterns,  can  be  generated  by  different  visual  attributes, 
such  as  depth  from  stereo  luminance,  motion  or  texture. 
According  to  this  definition,  perceptual  transparency 
can  still  occur  even  if  the  patterns  are  not  physically 
transparent,  or  if  they  partially  occlude.  We  have  ana¬ 
lyzed  how  geometrical  information  can  affect  the  percep¬ 
tion  of  motion  transparency  and  coherence.  This  infor¬ 
mation  is  given  by  contours  or  features  such  as  line  end¬ 
points  or  corners.  Contours,  edges  or  intensity  gradients 
can  correspond  to  the  projection  into  the  image  plane 
of  occluding  object  boundaries,  regions  of  surface  curva¬ 
ture  inflection,  or  surface  markings,  and  they  constitute 
an  important  source  of  visual  information.  We  have  pro¬ 
posed  a  model  for  the  perception  of  motion  transparency 
and  coherence  which  is  based  on  a  three-stage  process 
for  the  extraction  of  image  velocity.  Our  model  is  dif¬ 
ferent  from  Hildreth’s  model  for  the  extraction  of  image 
velocity  which  enforces  a  smoothness  constraint  on  the 
image  velocity  along  contours.  The  three-stage  process 


which  describes  our  model  corresponds  to  a  generaliza¬ 
tion  of  the  two-stage  process  proposed  by  Adelson  and 
Movshon;  the  third  stage  is  given  by  the  accumulation 
of  votes  for  each  velocity  vector  in  the  velocity  space, 
and  this  generates  a  velocity  histogram.  The  number 
of  prominent  peaks  in  this  velocity  histogram  is  related 
to  specific  types  of  motion  perception;  for  two  superim¬ 
posed  patterns,  we  perceive  motion  transparency,  coher¬ 
ence,  or  mixed  motion  perception,  depending  on  whether 
the  velocity  histogram  is  bimodal,  unimodal  or  trimodal. 
The  conditions  for  perceptual  motion  transparency  or 
coherence,  which  are  given  by  the  ratios  between  the 
spread  or  heights  of  the  velocity  histogram  peaks,  in¬ 
volve  the  relation  between  geometrical  information,  like 
contour  curvature,  and  (system)  noise. 

We  have  also  proposed  a  method  for  the  discrimina¬ 
tion  of  3-D  texture  patterns.  The  visual  discrimination 
of  objects  made  up  of  small  densely  distributed  opaque 
elements,  e.g.  trees  and  bushes  in  a  forest,  is  a  common 
ttisk  in  natural  environments.  As  we  walk  through  a  for¬ 
est  we  may  need  to  determine  the  relative  positions  of 
the  trees  and  bushes  for  purposes  of  navigation  or  ma¬ 
nipulation.  Trees  and  bushes  are  a  subset  of  what  we 
call  3-D  texture,  that  is,  objects  defined  by  the  3-D  dis¬ 
tribution  of  texels.  These  texels  can  be  solid  or  planar, 
opaque  or  transparent. 

Current  models  of  texture  describe  conditions  for  the 
discriminability  of  texture  boundaries  and  grouping,  and 
also  treat  the  reconstruction  of  surface  shape  from  tex¬ 
ture.  Nearly  all  of  these  models  describe  texture  in  terms 
of  the  2-D  distribution  of  texels  on  smooth,  opaque  sur¬ 
faces.  3-D  texture  poses  a  challenge  to  current  low-level 
vision  models  because  it  makes  it  necessary  to  cope  with 
texel  occlusion  and  transparency.  The  reconstruction  of 
the  shape  of  a  3-D  texture,  e.g.  a  tree  or  a  bush,  as 
well  as  the  computation  of  the  relative  distances  and 
depths  of  3-D  textures,  requires  the  use  of  prior  knowl¬ 
edge.  Conventional  etssumptions,  such  as  the  smooth¬ 
ness  constraint  in  regularization  theory,  or  the  assump¬ 
tion  of  local  planarity  in  shape-from-texture  models,  do 
not  apply  to  this  problem.  Trees  and  bushes  are  de¬ 
scribed  by  three-dimensional  (volumetric)  distributions 
of  leaves  and  branches  and  not  by  smooth  surfaces.  Al¬ 
though  leaves  can  be  assumed  to  be  approximately  flat, 
and  thus  to  have  a  given  tilt  and  slant,  they  are  not  reg¬ 
ularly  oriented  in  space.  Tilt  and  slant  are  not  appropri¬ 
ate  variables  for  describing  the  shape  of  a  tree  or  a  bush. 
Motion  cues,  generated  by  a  moving  observer,  are  impor¬ 
tant  for  the  discrimination  of  trees  and  bushes  in  a  for¬ 
est.  Ordinarily,  as  described  by  structure-from-motion 
models,  there  exists  a  one-to-one  relation  between  the 
3-D  velocity  of  a  (rigid)  object  in  space  and  its  image 
velocity.  On  the  other  hand,  when  we  observe  a  tree  or 
bush  in  relative  motion  most  of  its  leaves  are  partially 
or  totally  occluded.  This  makes  the  computation  of  the 
velocities  of  individual  leaves  difficult,  if  not  impossible. 

We  have  implemented  a  method  for  the  discrimination 
of  tree-like  objects  by  analyzing  the  relative  motions  of 
piecewise  opaque  patterns  in  the  fronto-parallel  plane. 
These  patterns  represent  the  projections  onto  the  im¬ 
age  plane  of  tree-like  objects.  For  two  piecewise  opaque 


patterns  moving  across  one  another  the  (global)  velocity 
histogram  is  bi-modal.  We  achieve  “discrimination”  by 
counting  the  number  of  prominent  peaks  in  the  veloc¬ 
ity  histogram;  this  should  correspond  to  the  number  of 
tree-like  objects  in  space,  provided  the  objects  are  not 
too  close  to  each  other. 

We  assume  that  the  observer  has  only  translational 
(horizontal  or  vertical)  motion  relative  to  a  static  scene. 
By  using  perspective  projection  it  is  easy  to  verify  that 
the  image  velocity  of  a  point  in  the  image  is  equal  to 
the  observer  velocity.  Therefore  the  difference  between 
the  image  velocities  of  leaves  and  branches  belonging  to 
different  patterns  (trees)  is  larger  than  variability  of  the 
velocities  of  leaves  and  branches  belonging  to  the  same 
pattern.  Consequently,  the  velocity  histogram  for  two 
superimposed  patterns  will  have  prominent  peaks;  the 
spread  of  these  two  peaks  will  be  generated,  in  part, 
by  the  difference  in  depth  between  the  frontmost  and 
rearmost  leaves  and  branches  within  each  tree.  The  ratio 
of  the  positions  of  the  two  peaks  is  equal  to  the  ratio  of 
the  (average)  depths  of  the  two  trees. 

2.3  Shape  from  z;  stereo  [25,  33,  36,  52] 

We  have  developed  a  new  approach  to  the  problem  of 
shape  from  shading.  Assuming  uniform  albedo  and  a 
Lambertian  surface  for  the  imaging  model,  we  first  esti¬ 
mate  the  illuminant  direction  and  surface  albedo.  With 
the  estimated  reflectance  map  parameters,  we  then  com¬ 
pute  the  surface  shape  using  a  new  procedure,  which 
implements  the  smoothness  constraint  by  requiring  the 
gradients  of  reconstructed  intensity  to  be  close  to  the 
gradients  of  the  input  image.  The  new  algorithm  is  data 
driven,  stable,  updates  the  surface  slope  and  height  maps 
simultaneously,  and  significantly  reduces  the  residual  er¬ 
rors  in  irradiance  and  integrability  terms. 

In  addition,  we  have  designed  a  feature-based  stereo 
matching  system.  We  use  a  hierarchical  grouping  pro¬ 
cess  that  groups  line  segments  into  more  complex  struc¬ 
tures  that  are  easier  to  match.  The  hierarchy  consists 
of  lines,  vertices,  edges  and  surfaces.  Matching  starts 
at  the  highest  level  of  the  hierarchy  (surfaces)  and  pro¬ 
ceeds  to  the  lowest  (lines).  Higher  level  features  are 
easier  to  match,  because  they  are  fewer  in  number  and 
more  distinct  in  form.  These  matches  then  constraint 
the  matches  at  lower  levels.  Perceptual  and  structural 
relations  are  used  to  group  matches  into  islands  of  cer¬ 
tainty.  A  Truth  Maintenance  System  (TMS)  is  used  to 
enforce  grouping  constraints  and  eliminate  inconsistent 
match  groupings.  The  TMS  is  also  used  for  reasoning 
in  the  presence  of  uncertainty  and  to  carry  out  belief 
revisions  necessitated  by  additions,  deletions  and  confir¬ 
mations  of  hypotheses. 

3  Purposive  vision  [20,  21,  40,  43] 

Our  efforts  on  building  the  Medusa  system  are  continu¬ 
ing  on  a  theoretical  basis.  We  have  developed  a  set  of 
navigational  techniques  in  which  the  input  is  the  normal 
optical  flow  (the  spatiotemporal  derivatives  of  the  image 
intensity  function)  and  where  full  or  accurate  reconstruc¬ 
tion  of  the  visible  world  is  not  needed.  Inferences  about 
the  three-dimensional  environment  are  based  on  simple 


image  measurements  and  qualitative  strategies  employed 
by  an  active  observer.  In  particular,  we  have  developed 
solutions  to  the  problems  of  egomotion  estimation,  ob¬ 
stacle  avoidance,  relative  depth  computation,  estimation 
of  3-D  motion,  tracking  and  detection  of  independent 
motion  by  a  moving  observer.  These  solutions  are  de¬ 
scribed  in  a  separate  paper  in  these  Proceedings. 

4  Navigation 

Our  research  on  navigation  has  emphasized  two  areas: 
“stealth”  path  planning  on  terrain  and  navigation  in  an 
uncertain,  dynamic  environment. 

4.1  Stealth  terrain  path  planning  [18,  41,  56] 

We  are  developing  data  parallel  algorithms  for  planning 
paths  for  groups  of  ground  vehicles  over  natural  terrain 
subject  to  constraints  on  navigability,  visibility  (from  ad¬ 
versaries  moving  through  the  terrain)  and  motion  pro¬ 
tocols  (e.g.,  maintaining  line  of  sight  communication  be¬ 
tween  pairs  of  vehicles).  The  planning  algorithms  have 
been  implemented  on  a  CM2.  The  application  of  this 
research  to  the  development  of  a  “bounding  overwatch” 
path  planner  is  described  in  a  paper  in  these  proceedings. 
Related  research  includes  the  development  of  a  data  par¬ 
allel  algorithm  for  constructing  a  constrained  Delaunay 
triangulation  of  a  digital  terrain  model.  This  algorithm 
has  also  been  implemented  on  the  CM2  and  can  produce 
a  Delaunay  triangulation  of  a  512  x  512  digital  terrain 
map  in  under  70  seconds. 

4.2  Navigation  with  uncertainty  [44,  47,  48] 

We  have  developed  a  probabilistic  method  for  noisy  sen¬ 
sor  based  robotic  navigation  in  dynamic  environments. 
The  method  generates  an  optimal  trajectory  by  consider¬ 
ing  as  optimality  criteria  the  probability  of  not  colliding 
with  the  obstacles  and  the  probability  of  accessing  an 
operational  position  with  respect  to  a  moving  target  ob¬ 
ject.  Estimates  of  the  obstacles’  kinematic  parameters 
and  measures  of  confidence  in  these  estimates  are  used 
to  produce  the  probability  of  collision  associated  with 
any  robot  displacement.  The  probability  of  collision  is 
derived  in  two  steps:  a  stochastic  model  is  defined  in 
the  kinematic  state  space  of  the  obstacles,  and  collision 
events  are  given  simple  geometric  characterizations  in 
this  state  space. 

We  have  also  considered  the  problem  of  efficient  path 
planning  for  a  point  robot  in  a  partially  known  dynamic 
environment,  where  the  static,  known  part  of  the  envi¬ 
ronment  consists  of  point  shelters  distributed  in  planar 
terrain,  and  the  dynamic,  unknown  part  is  abstracted  in 
the  form  of  alarms  that  cause  the  robot  to  leave  its  cur¬ 
rent  (pre-planned)  path  and  divert  to  the  nearest  shelter. 
We  have  carried  out  a  probabilistic  analysis  of  the  ex¬ 
pected  times  for  the  dynamic  paths  generated  when  the 
alarms  follow  a  Pois.son  distribution  with  parameter  A. 
A  case  study  with  three  shelters  has  been  performed  to 
illustrate  the  dependence  of  the  expected  travel  times  on 
A  for  two  alternate  static  paths.  Two  different  strategies 
have  been  developed  for  the  general  case  of  n  shelters 
and  have  been  shown  to  be  superior  for  different  ranges 
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of  values  of  the  alarm  rate  A  (very  low  and  very  high 
values,  respectively). 

5  Recognition 

5.1  Invariants  [14,  23] 

Invariants  of  shapes  are  of  major  importance  in  object 
recognition  because  they  are  independent  of  the  point  of 
view  from  which  the  shape  is  seen.  However,  they  require 
a  relatively  high  number  of  parameters  that  need  to  be 
extracted  from  the  shape,  which  raises  concerns  about 
the  reliability  of  such  descriptors.  We  have  addressed 
these  accuracy  and  reliability  issues  and  have  shown  that 
invariants  can  be  made  robust  to  noise.  We  have  found 
new  differential  invariants  requiring  fewer  derivatives;  we 
use  smoothing  techniques  which  give  much  better  re¬ 
sults  than  the  Gaussian.  Our  experiments  show  that 
the  derivatives  involved  do  not  pose  a  serious  problem, 
provided  we  make  the  right  choice  of  smoothing  method 
and  parameters.  Further  details  can  be  found  in  a  paper 
in  these  proceedings. 

For  two-dimensional  shapes,  it  turns  out  to  be  pos¬ 
sible  to  define  approximate  invariants  that  are  insen¬ 
sitive  to  slants  of  70*  or  more.  This  makes  it  possi¬ 
ble  to  recognize  a  two-dimensional  shape  from  a  sin¬ 
gle  perspective  projection  image  taken  from  an  un¬ 
known  (three-dimensional)  viewpoint.  Our  method  is 
based  on  quadratic  approximations  to  the  effect  of  the 
slant  of  an  inverse  perspective  transformation  on  angles 
and  lengths.  These  approximations  allow  us  to  define 
contour-based  properties  that  are  nearly  invariant  under 
perspective  transformation.  The  method  can  be  used 
to  recognize  partially  occluded  shapes,  as  well  as  shapes 
that  are  not  exactly  related  by  perspective  transforma¬ 
tions.  When  a  shape  is  recogniz^,  the  method  also  pro¬ 
vides  estimates  of  its  tilt  and  slant. 

5.2  Pose  estimation  (24) 

Our  research  on  object  recognition  in  range  images  has 
focused  on  a  variant  of  a  generalized  Hough  transform. 
To  obtain  data  points  in  the  rotation  parameter  space, 
pairs  of  surface  normals  are  chosen  at  random  from  the 
range  image.  Each  pair  of  surface  normals  is  matched  to 
all  pairs  of  surface  normals  from  the  models,  provided 
they  make  similar  angles.  Each  match  defines  a  rota¬ 
tion  axis  and  a  rotation  angle  between  the  model  and 
the  scene  object.  To  obtain  data  points  in  the  transla¬ 
tion  parameter  space,  triples  of  range  pixels  are  chosen 
at  random.  If  the  corresponding  patches  are  found  to  be 
non-coplanar,  these  triples  of  planar  patches  are  matched 
to  triples  of  faces  for  the  models,  provided  their  surface 
normals  make  compatible  angles.  Each  match  defines 
a  new  point  in  translation  space.  A  Least  Median  of 
Squares  (LMS)  method  is  used  for  clustering  in  the  pa¬ 
rameter  spaces.  The  LMS  method  can  accurately  esti¬ 
mate  the  centers  of  clusters  of  data  points  in  the  presence 
of  outliers  on  condition  that  the  number  of  clusters  is 
known,  but  in  this  problem  the  number  of  clusters  is  un¬ 
known.  Therefore,  in  each  parameter  space,  spheres  are 
considered  with  their  centers  at  the  nodes  of  a  3-D  grid. 
The  LMS  method  is  applied  to  the  data  points  within 


the  spheres  to  find  cluster  centers.  The  spheres  are  then 
recentered  at  these  centers,  and  the  process  is  repeated 
until  the  spheres  do  not  move.  The  spheres  that  contain 
the  largest  numbers  of  data  points  are  considered  to  be 
centered  at  the  most  significant  clusters.  The  final  cen¬ 
ters  are  kept  as  hypotheses  for  the  pose  parameters.  The 
hypotheses  are  verified  by  a  similarity  measure. 

5.3  Recognition  by  parts  [51] 

We  have  developed  an  approach  to  the  recovery  and 
recognition  of  3-D  objects  from  a  single  2-D  image.  The 
approach  is  motivated  by  the  need  for  more  powerful 
indexing  primitives,  and  shifts  the  burden  of  recogni¬ 
tion  from  the  model-based  verification  of  simple  image 
features  to  the  bottom-up  recovery  of  complex  volumet¬ 
ric  primitives.  Given  a  recognition  domain  consisting 
of  a  database  of  objects,  we  first  select  a  set  of  object- 
centered  3-D  volumetric  modeling  primitives  that  can 
be  used  to  construct  the  objects.  Next,  using  a  CAD 
system,  we  generate  the  set  of  aspects  of  the  primitives. 
Unlike  typical  aspect-based  recognition  systems  that  use 
aspects  to  model  entire  objects,  we  use  aspects  to  model 
the  finite  set  of  parts  from  which  the  objects  are  con¬ 
structed.  Consequently,  the  number  of  aspects  is  fixed 
and  independent  of  the  size  of  the  object  database.  To  ac¬ 
commodate  the  matching  of  partial  aspects  due  to  prim¬ 
itive  occlusion,  we  introduce  a  hierarchical  aspect  repre¬ 
sentation  based  on  the  projected  surfaces  of  the  primi¬ 
tives;  a  set  of  conditional  probabilities  captures  the  am¬ 
biguity  of  mappings  between  the  levels  of  the  hierarchy. 

From  a  region  segmentation  of  the  input  image,  we 
formulate  primitive  recovery  as  the  problem  of  grouping 
the  regions  into  aspects.  No  domain  dependent  heuris¬ 
tics  are  used;  we  exploit  only  the  probabilities  inherent 
in  the  aspect  hierarchy.  Once  the  aspects  are  recovered, 
we  use  the  aspect  hierarchy  to  infer  a  set  of  volumetric 
primitives  and  their  connectivity  relations.  Subgraphs  of 
the  resulting  graph,  in  which  nodes  represent  3-D  prim¬ 
itives  and  arcs  represent  primitive  connections,  are  used 
as  indices  to  the  object  database.  The  verification  of 
object  hypotheses  consists  of  a  topological  verification 
of  the  recovered  graph,  rather  than  a  geometrical  veri¬ 
fication  of  image  features.  A  system  has  been  built  to 
demonstrate  the  approach,  and  it  has  been  successfully 
applied  to  both  synthetic  and  real  images. 

6  Parallel  algorithms  for  vision  and 
planning  [7,  22,  26,  39] 

IVaditionally,  vision  and  visual  navigation  have  been 
driving  forces  behind  the  development  of  high  perfor¬ 
mance  parallel  processing  systems.  Our  research  in  par¬ 
allel  algorithms  for  vision  and  planning  focuses  on  data 
parallel  algorithms  with  two  complementary  emphases: 
First,  we  are  concerned  with  developing  algorithms  hav¬ 
ing  optimal  or  near  optimal  computational  and  com¬ 
munication  complexity,  with  additional  emphasis  on  the 
scalability  o{ algorithms  with  respect  to  architectural  fac¬ 
tors  such  as  processor  set  size,  memory  size,  etc.  Second, 
we  want  our  algorithms  to  be  practically  efficient,  which 
means,  on  the  one  hand,  that  they  can  be  easily  imple- 
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mented  using  standard  data  parallel  programming  con- 
structs  (e.g.,  prefix  operations,  basic  data  reorderings, 
etc.)  and,  on  the  other  hand,  that  the  algorithms  run 
fast  when  implemented  on  scalable  parallel  processors. 
In  the  following  subsections,  we  describe  our  research  on 
massively  parallel  algorithms  for  focus  of  attention  vision 
and  contour  data  processing. 

6.1  Replicated  image  processing 

We  are  interested  in  developing  data  parallel  algorithms 
that  will  allow  us  to  efficiently  process  small  images  on 
massively  parallel  computers.  Such  small  images  might 
correspond  to  tracking  windows  in  a  time-varying  image 
sequence,  prediction  windows  in  a  hypothesize-and-test 
object  recognition  system,  or  high  levels  in  a  multiresolu¬ 
tion  image  data  structure.  We  have  developed  a  general 
methodology  for  designing  such  data  parallel  algorithms, 
and  have  illustrated  its  application  to  a  variety  of  prob¬ 
lems  in  low  level  vision  including  convolution,  rank-order 
filtering,  and  morphological  operations.  The  methodol¬ 
ogy  is  based  on  replicating  the  image  to  be  processed, 
and  then  decomposing  the  operation  to  be  performed 
into  data  parallel  components.  We  have  carried  out  an 
analysis  of  replicated  image  analysis  algorithms  on  a  va¬ 
riety  of  common  interconnection  networks  (2  and  3-D 
meshes,  hypercubes)  and  identified  the  conditions  (spec¬ 
ified  in  terms  of  basic  architectural  parameters)  in  which 
replicated  algorithms  will  lead  to  speedups  over  conven¬ 
tional  data  parallel  solutions.  We  have  also  developed  a 
replicated  image  analysis  algorithm  for  rank  order  filter¬ 
ing  and  have  tested  it  experimentally  on  both  the  Con¬ 
nection  Machine  and  Maspar. 

6.2  Contoiir  data  processing 

The  analysis  of  contours  plays  a  fundamental  role  in 
many  computer  vision  algorithms  (e.g.,  object  recogni¬ 
tion,  stereo,  motion  analysis).  We  have  studied  prob¬ 
lems  associated  with  efficiently  processing  image  con¬ 
tours  using  data  parallel  algorithms  on  massively  parallel 
machines.  Traditionally,  parallel  algorithms  for  contour 
processing  have  operated  on  the  original  two  dimensional 
image  plane  representation.  The  principal  advantage  of 
this  is  that  the  spatial  relationships  between  contours  are 
implicitly  encoded  in  the  representation.  However,  there 
are  also  significant  disadvantages  to  processing  contours 
embedded  in  the  2-D  image.  One  disadvantage  is  that 
standard  data  parallel  algorithms  will  spend  machine  cy¬ 
cles  analyzing  all  of  the  image  pixels  that  do  not  lie 
on  contours.  Since  the  number  of  physical  processors 
in  even  large  parallel  machines  is  far  smaller  than  the 
number  of  pixels  in  even  a  television  frame,  this  leads 
to  very  inefficient  algorithms.  A  second  disadvantage  is 
that  it  is  generally  not  possible  to  develop  asymptotically 
efficient  algorithms  for  contour  processing  in  the  array 
representation.  Specifically,  one  cannot  employ  parallel 
prefix  operations,  which  would  result  in  algorithms  hav¬ 
ing  computational  complexity  that  is  logarithmic  in  the 
length  of  image  contours;  instead,  most  algorithms  will 
be  linear  in  the  length  of  the  contours. 

We  have  developed  a  log  N  EREW  algorithm  (where 
N  is  the  length  of  the  longest  image  contour)  for  trans¬ 


forming  image  contours  from  their  image  plane  embed¬ 
ding  to  a  packed,  one  dimensional  data  structure  and 
have  illustrated  the  advantages  of  this  representation 
by  presenting  log  N  algorithms  for  piecewise  approxima¬ 
tion,  point-in-polygon,  and  basic  local  contour  feature 
extraction  algorithms  (e.g.,  curvature  estimation,  corner 
detection).  These  algorithms  have  been  implemented  on 
a  CM2.  We  have  also  developed  an  0(k  log  N)  algorithm 
for  computing  the  visibility  graph  of  a  simple  polygon, 
and  have  presented  experimental  results  of  implementing 
the  algorithm  on  both  the  CM2  and  the  Maspar.  Here, 
N  is  the  length  of  a  contour  and  k  is  the  so-called  link 
diameter  of  the  polygon  represented  by  the  contour — 
the  maximum  number  of  straight  line  segments  needed 
to  connect  any  two  points  in  the  polygon. 

7  Other  topics 

7.1  Robot  hand/eye  coordination  [6,  53] 

Traditional  approaches  to  robot  hand/eye  coordination 
require  that  various  components  of  the  system  be  cal¬ 
ibrated  with  respect  to  a  common  reference,  but  cali¬ 
bration  is  difficult  and  error-prone  and  may  invalidate 
the  complex,  high-precision  inverse  kinematic  computa¬ 
tions  that  are  also  a  feature  of  these  approaches.  We 
have  developed  a  fundmentally  new  control  technique 
that  does  not  require  any  calibration  and  closely  inte¬ 
grates  visual  feedback  into  the  control  mechanism.  This 
is  made  possible  by  the  introduction  of  a  mapping,  called 
the  Perceptual  Kinematic  Map,  from  the  control  space 
of  the  manipulator  directly  onto  a  space  defined  by  a  set 
of  measurable  image  parameters.  Our  strategy  achieves 
robustness  by  monitoring  qualitative  rather  than  quan¬ 
titative  changes  as  it  explores  the  surface  defined  by  this 
mapping.  Furthermore,  it  employs  a  Kalman-Bucy  filter 
for  additional  robustness  in  measuring  image  parame¬ 
ters. 

This  approach  can  be  applied  to  dynamic  robot  hand 
positioning  tasks,  such  as  catching,  hitting,  interception, 
etc.,  that  involve  an  object  (target)  moving  in  the  vicin¬ 
ity  of  the  robot.  We  have  examined  the  different  lev¬ 
els  at  which  visual  input  is  involved  in  pursuing  the 
dynamically-defined  goal.  A  given  task  is  transformed 
into  one  of  constrained  trajectory  planning  on  the  per¬ 
ceptual  control  surface  defined  by  the  PKM. 

We  have  successfully  demonstrated  the  feasibility  of 
our  approach  by  implementing  the  individual  elements 
of  the  hand/eye  system  on  available  hardware — a  5- DOF 
Mitsubishi  Movemaster  II  manipulator,  a  CCD  camera, 
and  a  Macintosh  II  for  processing  and  control.  In  partic¬ 
ular,  we  have  developed  a  module  for  the  visual  tracking 
of  the  hand  using  a  Kalman  filter.  Although  neither  the 
camera  nor  the  manipulator  was  calibrated — their  rela¬ 
tive  position  and  orientation  were  unknown — our  exper¬ 
iments  indicated  that  the  trajectories  of  the  parameters 
we  chose  agreed  closely  with  the  predicted  ones,  and  the 
mapping  (PKM)  was  quite  smooth. 

7.2  Aerial  image  analysis  [37,  46] 

We  have  developed  a  straight  line  extractor  that  pro¬ 
duces  high  quality  line  descriptions  from  aerial  images. 
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The  input  to  the  line  extractor  is  in  the  form  of  an  edge 
image,  where  the  contrast  and  direction  of  each  edge 
pixel  is  specified.  The  system  first  scans  the  edge  im¬ 
age  left  to  right,  top  to  bottom  and  assigns  a  line  label 
for  each  scanned  edge  pixel,  thereby  generating  a  label 
image.  At  the  end  of  this  process  each  edge  pixel  has  a 
line  label  associated  with  it  and  edge  pixels  that  belong 
to  the  same  line  will  be  assigned  the  same  line  label. 
Also,  with  each  line  label,  a  record  that  stores  the  end 
points,  the  average  contrast  and  the  pixel  support  of  the 
line  is  generated.  The  label  image  is  used  as  a  spatial 
index  to  further  link  fragmented  lines.  We  have  also 
developed  techniques  for  eliminating  many  of  the  physi¬ 
cally  insignificant  lines  from  the  database,  given  that  the 
domain  of  interpretation  is  aerial  images,  dominated  by 
man-made  objects. 

A  system  for  the  detection  of  a  class  of  buildings  in 
aeri^li  images  has  also  been  developed.  The  process  of 
building  detection  is  carried  out  in  a  hierarchical  man¬ 
ner.  Pairs  of  line  segments  are  grouped  to  form  vertices. 
Vertices  are  then  grouped  to  form  edges.  Edges  are  com¬ 
posed  into  edge-rings.  The  process  of  building  detection 
is  posed  as  a  graph  search  problem,  where  the  nodes  of 
the  graph  are  edges  and  the  links  are  connectivity  rela¬ 
tions.  Closed  edge-rings  in  this  graph  may  correspond 
to  roofs.  Shadow  analysis  on  closed-rings  is  used  to  gen¬ 
erate  building  hypotheses.  Some  edge-rings  are  not  fully 
closed.  Heuristics  based  on  the  knowledge  of  the  shapes 
of  roofs  are  used  to  hypothesize  edges  top-down  to  close 
these  edge-rings.  A  dynamic  search  scheme  based  on  an 
Assumption  based  Truth  Maintenance  System  (ATMS) 
helps  in  integrating  bottom-up  and  top-down  searches. 
The  ATMS  assists  in  enforcing  constraints  on  the  shapes 
of  roofs  extracted  and  in  dealing  with  belief  revisions  as¬ 
sociated  with  incremental  additions,  deletions  and  con¬ 
firmations  of  edge  hypotheses. 

In  Section  1  we  described  a  search-based  system  for 
recognizing  “objects”  having  variable  numbers  of  parts 
which  have  only  loosely  constrained  shapes;  as  men¬ 
tioned  there,  this  system  has  been  successfully  applied 
to  finding  cloverleaf  intersections  in  aerial  images  of  road 
networks. 

7.3  Image  matching  and  registration  [16,  59] 

A  standard  technique  in  image  matching  is  to  apply 
relaxation  techniques  to  establish  correspondences  be¬ 
tween  patterns  of  localizable  (point-like)  features  in  the 
two  images.  We  have  extended  point-pattern  matching 
relaxation  techniques  to  allow  matching  of  both  point¬ 
like  and  linceu*  features.  Our  approach  makes  use  of  a 
compatibility  function  that  relies  on  relative  orientation 
information,  which  is  translation  and  rotation  invariant 
and  which  can  be  more  reliably  extracted  from  noisy  im¬ 
ages  than  can  positional  information.  The  function  has 
been  used  to  generalize  the  standard  relaxation  matching 
technique  based  on  point  patterns.  It  has  been  success¬ 
fully  applied  to  object  recognition  in  synthetic  aperture 
radar  (SAR)  imagery. 

Another  feature-based  image  matching  method  has 
been  applied,  with  excellent  results,  to  the  registration 
of  pairs  of  partially  overlapping  images  in  which  large 


amounts  of  rotation  and  scaling  have  occurred  between 
the  two  images  and  the  images  are  devoid  of  significant 
features.  An  illuminant  direction  estimation  method  is 
first  used  to  obtain  an  initial  estimate  of  camera  rota¬ 
tion.  A  small  number  of  feature  points  are  then  located 
based  on  a  Gabor  wavelet  model  for  detecting  local  cur¬ 
vature  discontinuities.  An  initial  estimate  of  scale  and 
translation  is  obtained  by  pairwise  matching  of  the  fea¬ 
ture  points  detected  in  two  images.  Finally,  hierarchical 
feature  matching  is  performed  to  obtain  an  accurate  es¬ 
timate  of  translation,  rotation  and  scale.  Experiments 
with  synthetic  and  real  images  show  that  this  algorithm 
yields  accurate  results  when  the  scales  of  the  pair  of  im¬ 
ages  differ  by  up  to  10%,  their  overlap  is  as  small  as 
35%,  and  the  camera  rotation  between  the  two  images 
is  significant.  Further  details  can  be  found  in  a  paper  in 
these  proceedings. 

7.4  Discrete  and  fuzzy  geometry 
[10,  11,  29,  45] 

In  dealing  with  digital  images  in  two  and  three  dimen¬ 
sions,  a  rigorous  understanding  of  geometric  properties 
of  such  images  is  important  for  establishing  the  correct¬ 
ness  of  geometric  algorithms.  We  have  continued  to  in¬ 
vestigate  discrete  geometric  structures  on  various  levels. 
Two  contributions  to  this  area  during  the  past  year  in¬ 
clude  the  study  of  polygonal  arcs  and  polygons  in  three 
dimensions — in  particular,  their  compact  representation 
and  the  determination  of  their  topological  properties, 
particularly  knottedness  and  linkedness;  and  the  study  of 
connectedness  in  three-dimensional  digital  images,  lead¬ 
ing  to  a  proof  of  the  long  unresolved  conjecture  that  3D 
connectedness  is  not  recognizable  by  a  finite-state  array 
automaton. 

Geometrical  properties  of  “objects”  in  unsegmented 
(gray  level)  images  can  be  defined  by  regarding  the  ob¬ 
jects  as  fuzzy  subsets  of  the  image.  The  study  of  geomet¬ 
ric  properties  of  fuzzy  sets  has  begun  to  attract  atten¬ 
tion  in  the  fuzzy  set  community.  During  the  past  year 
we  have  made  two  contributions  to  this  area:  we  have 
developed  a  new  method  of  defining  the  medial  axis  of 
a  fuzzy  set  as  a  set  of  fuzzy  disks  whose  sup  is  the  set; 
and  we  have  developed  optimal  algorithms  for  computing 
connectedness  properties  of  fuzzy  graphs  (e.g.,  of  graphs 
representing  quantitative  relations  between  pairs  of  ob¬ 
jects). 
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Abstract 

Work  in  animate  vision  continues  at  all  levels 
from  low-level,  preattentive  object  tracking  and 
segmentation  utilities  to  Bayesian  techniques 
and  decision  theory  for  controlling  and  reason¬ 
ing  about  the  vision  process.  We  have  had 
success  with  pose-invariant  object  recognition, 
which  we  pursued  using  color  and  projectively 
invariant  geometric  features.  Learning  is  tak¬ 
ing  on  a  position  of  increasing  importance  in 
our  work  as  we  strive  for  more  adaptive  behav¬ 
ior  that  needs  less  a  priori  structuring. 

1  Laboratory  and  Aims 

Our  laboratory  continues  to  expand  to  meet  our  needs 
for  more  sophisticated  analysis,  data-gathering,  and  con¬ 
trol  hardware  for  our  work  in  animate  vision  [Ballard 
1991a, b].  The  laboratory  has  been  enhanced  by  a  Data- 
Glove  input  device  for  obtaining  input  from  humans  for 
a  variety  of  recognition  projects  and  Utah  hand  training 
projects.  Recognition  strategies  for  dynamic  trajectories 
in  various  spaces  (angle  spaces,  Cartesian  space)  we  are 
investigating  include  hidden  Markov  models  and  recur¬ 
rent  networks  [Cortes  et  al;  Simard  1991;  Simard  et  al. 
1990a, b;  1991a,b].  The  dataglove  can  also  be  worn  by 
the  Ut2di  hand,  and  self-calibration  is  being  investigated. 

A  bidirectional  communications  library  for  the  Puma 
robot  (ROBOCOM)  was  written  by  Brian  Yamauchi, 
John  Soong,  Tim  Becker,  and  Prakash  Das  for  use  in  the 
Juggler  [Yamauchi  1991],  Checkers  [Marsh  et  al.  1991], 
and  open-loop  gaze  stabilization  projects  [Soong  and 
Brown  1991].  ROBOCOM  is  much  faster  than  the  earlier 
BOTLIB  package  since  it  does  not  use  the  multi-layered 
ISO-standard  structure  for  communication.  Robot  head 
positions  can  be  read  back  into  a  host  computer  at  be¬ 
tween  15  and  40  Hz.,  depending  on  how  much  other  com¬ 
putation  is  being  done  by  VAL. 

As  more  intimate  connection  is  made  between  frame- 
rate  computations  in  special  hardware  and  the  memory 
of  general  purpose  computers,  the  visual  computations 
that  can  be  performed  in  real  time  grow  more  complex. 
We  have  expanded  our  real-time  computing  resources 
with  two  more  MIMD  multi-computers.  The  Silicon 
Graphics  computer  is  a  modern  resource  with  eight  40 
MHz  MIPS  chips.  We  plan  to  use  its  graphics  capa¬ 


bilities  to  produce  realistic  scenes  for  computer  analy¬ 
sis  and  for  input  to  psychophysical  experiments,  the 
Transputer  array  consists  of  eight  4MByte  T805  TRAMS 
and  a  smart  VME  interface  that  lets  the  Transputers  di¬ 
rectly  read  and  write  VME  memory  and  hence  control 
the  MaxVideo  and  hand.  The  interface  also  has  serial 
ports  that  allow  robot  control  and  dataglove  input.  The 
KiwiVision  product  will  allow  Transputers  to  access  the 
MaxBus  directly,  speeding  higher-level  visual  processing. 
A  Sun4/330  workstation  acts  as  a  host  terminal  system. 

2  Low-level  Gaze  Control  and 
Segmentation 

We  use  cooperating  low-level  gaze  controls  to  improve 
tracking  and  segmentation  performance.  We  have  suc¬ 
cessfully  combined  vergence  and  tracking  capabilities. 
Vergence  [Olson  and  Coombs  1990]  limits  the  amount 
of  visual  information  coming  from  a  scene,  and  can  aug¬ 
ment  other  segmentation  techniques.  If  only  features 
having  zero  disparity  are  passed  through  a  filter  [von 
Kaenel  1991],  then  vergence  implements  a  disparity- 
based  filter  that  restricts  attention  to  the  “horopter”, 
or  3-D  locus  of  points  that  have  zero  disparity  at  the 
current  vergence.  Thus  disparity  filtering  can  separate 
a  spatially  coherent  target  from  foreground  and  back¬ 
ground  distractors.  Used  with  a  fovea  or  window  filter 
that  limits  visual  information  to  a  particular  solid  angle, 
disparity  filtering  limits  visual  information  to  a  small 
volume  of  space.  We  use  the  horopter  and  foveal  con¬ 
straints  in  a  tracking  system  that  pursues  a  moving  ob¬ 
ject  through  a  field  of  distractors  [Coombs  and  Brown, 
1990,  1991;  Coombs  et  al.  1990]. 

Object  tracking  continues  to  be  an  important  area  of 
our  work  [Bandyopadhyay  and  Ballard  1991].  We  have 
successfully  used  a—^—y  filters  as  predictors  to  improve 
the  performance  of  visual  tracking  loops  that  move  the 
camera  to  keep  a  moving  object  centered.  We  have  stud¬ 
ied  various  approaches  to  the  problem  of  delay  in  con¬ 
trol  loops  [Brown  and  Coombs  1991].  We  have  also  im¬ 
plemented  open-loop  gaze  stabilization  capabilities  that 
keep  the  camera  pointed  at  a  3-space  point  despite  ar¬ 
bitrary  head  motion  by  making  compensating  pans  and 
tilts.  The  crux  of  the  computation  is  an  inverse  kine¬ 
matic  model  of  the  robot  head,  and  we  have  also  derived 
the  inverse  Jacobian  so  we  can  use  a  position  or  veloc¬ 
ity  formulation  [Soong  and  Brown  1991].  The  open-loop 


stabilization  capability  has  been  extended  io  maintain 
the  gMe  on  a  point  moving  along  a  known  trajectory. 

3  Visual  Attention 

Eyemovements  are  an  important  aspect  of  animate  vi¬ 
sion  [Ballard  1991].  Augmented  hidden  Markov  mod¬ 
els  (AHMMs)  can  learn  graph  structures  representing 
either  action  sequences,  adjacency  and  connectivity  in¬ 
formation  about  objects,  or  even  control  structures  of 
observed  algorithms  [Rimey  and  Brown  1990,  1991a, b,c] 
The  models  are  augmented  in  the  sense  that  they  can, 
at  run-time,  modify  their  output  based  on  scene  infor¬ 
mation.  Their  learning  capabilities  allow  them  to  adapt 
to  slowly-varying  scene  characteristics.  They  are  used  in 
a  generative  mode  to  output  learned  behavior.  Behavior 
can  be  learned  either  in  a  “where”  mode,  in  which  (say) 
specific  visual  locations  are  foveated  in  sequence,  or  in 
a  “what”  mode,  which  sequentially  foveates  a  desired 
sequence  of  features  in  the  image,  regardless  of  their  lo¬ 
cation.  The  training  can  come  from  an  instructor,  or 
from  another  program  that  is  producing  the  behavior  as 
the  result  of  a  cognitive  process.  The  latter  case  is  like 
learning  a  skill,  in  that  a  visuo-motor  skill  structure  can 
be  developed  that  does  not  have  to  repeat  the  reasoning 
process,  and  thus  can  run  more  efficiently. 

Using  the  augmentation  feature  of  the  hidden  Markov 
model,  the  output  of  the  “where”  and  “what”  systems 
each  can  be  fed  back  to  the  other  system,  in  order  to 
train  the  system  to  output  a  sequence  of  desired  fea¬ 
tures  or  objects  in  their  expected  locations  in  the  image. 
The  system  also  incorporates  the  adaptive  visual  fe^- 
back  cues  and  a  control  scheme  for  verifying  expectations 
using  foveal  image  data. 

The  what-where-AHMM  has  a  “what”  part,  a  “where” 
part,  and  an  output  combiner.  The  what-part  contains 
two  stages.  The  first  stage  is  an  AHMM  whose  output 
symbols  Ft,  called  whai-symbols,  are  feature  vectors  in¬ 
tended  to  describe  an  object  or  characteristics  of  objects. 
Such  feature  vectors  are  assumed  to  have  been  computed 
for  each  pixel  in  the  (peripheral)  image.  The  second 
stage  of  the  what-part  performs  a  “what-to-where”  map¬ 
ping,  meaning  that  it  maps  a  feature  vector  into  the 
set  Gj  of  camera  movement  commands,  called  whert- 
symboia,  that  would  cause  those  locations  to  be  foveated 
(centered  in  the  image). 

If  each  contains  exactly  one  element,  the  output  se¬ 
quence  will  fixate  the  desired  objects  in  the  scene.  Each 
Gj  does  not  generally  contain  exactly  one  element,  so 
some  method  must  be  developed  to  select  among  the 
choices.  One  option  is  to  use  a  where-AHMM  to  help 
pick  among  the  choices.  In  fact,  the  what-AHMM  can 
be  made  to  help  the  where-AHMM  with  its  own  choices. 

The  where-part  contains  two  stages,  similar  to  those 
in  the  what-part.  First  it  has  an  AHMM,  which  out¬ 
puts  a  sequence  of  where-symbols  Ot-  Secondly  it  has 
a  “where-to-what”  mapping  that  determines  for  each 
where-symbol  Ot  the  location  in  the  current  (peripheral) 
image  it  corresponds  to,  and  outputs  the  set  of  feature 
vectors  in  that  local  area  of  the  image.  The  AHMM 
in  the  where-part  uses  as  feedback  a  sequence  of  sets  of 
where-symbols  Gj,  which  is  the  output  of  the  what-part. 


Finally,  the  output  combiner  determines  the  overall 
output  of  the  what-where-AHMM.  The  overall  output  at 
time  step  t  is  Zt,  a  where-symbol  (i.e.  a  camera  move¬ 
ment  command),  selected  as  the  element  of  the  set  Gj 
that  has  the  smallest  distance  to  the  symbol  Ot  ■ 

The  what-where-AHMM  operates  as  follows.  At  each 
time  step,  each  of  the  two  parts  produces  a  set  of  feed¬ 
back  symbols  that  reflects  its  own  preference  for  action. 
Each  then  updates  its  own  preferences  taking  the  other’s 
into  account,  and  then  generates  its  own  final  preference 
for  action  at  that  time  step.  The  set  of  final  preferences 
is  reduced  to  a  single  output  symbol  by  the  output  com¬ 
biner. 

4  Motion  Detection  and  Recognition 

Motion  analysis  is  an  important  process  in  animate  vi¬ 
sion  systems.  In  particular,  a  system  that  interacts  visu¬ 
ally  with  a  dynamic  world  must  necessarily  make  deci¬ 
sions  about  moving  objects.  Recent  work  has  addressed 
the  problem  of  detecting  independently  moving  objects 
from  a  moving  sensor  [Nelson  1991],  and  identifying  the 
source  of  the  motion  once  it  has  been  detected  [Polana 
and  Nelson  1991]. 

Two  complementary  methods  for  the  detection  of 
moving  objects  by  a  moving  observer  have  been  devel¬ 
oped.  The  first  is  based  on  the  fact  that,  in  a  rigid 
environment,  the  projected  velocity  at  any  point  in  the 
image  is  constrained  to  lie  on  a  1-D  locus  in  velocity 
space  whose  parameters  depend  only  on  the  observer  mo¬ 
tion.  If  the  observer  motion  is  known,  an  independently 
moving  object  can,  in  principle,  be  detected  because  its 
projected  velocity  is  unlikely  to  fall  on  this  locus.  This 
principle  was  adapted  to  use  partial  information  about 
the  motion  field  and  observer  motion  that  can  be  rapidly 
computed  from  real  image  sequences  (e.g.  gradient  par¬ 
allel  image  flow).  The  method  was  implemented  to  run 
in  real  time  (approximately  1/15  second  latency  and  up¬ 
date)  using  parallel  pipelined  MAXVIDEO  image  pro¬ 
cessing  hardware.  The  second  method  utilizes  the  fact 
that  the  apparent  motion  of  a  fixed  point  due  to  smooth 
observer  motion  changes  slowly,  while  the  apparent  mo¬ 
tion  of  many  moving  objects  such  as  animals  or  maneu¬ 
vering  vehicles  may  change  rapidly.  The  motion  field  at 
a  given  time  can  thus  be  used  to  place  constraints  on  the 
future  motion  field  which,  if  violated,  indicate  the  pres¬ 
ence  of  an  autonomously  maneuvering  object.  In  both 
cases,  the  qualitative  nature  of  the  constraints  allows  the 
methods  to  be  used  with  the  inexact  motion  information 
typically  available  from  real  image  sequences.  This  was 
also  implemented  to  run  in  real  time  on  our  parallel  pro¬ 
cessing  testbed. 

The  goal  of  the  ongoing  research  in  motion  recognition 
is  to  demonstrate  that  robustly  computable  motion  fea¬ 
tures  can  be  used  directly  as  a  means  of  recognition.  This 
contrasts  with  the  more  traditional  approach  to  motion 
analysis,  which  has  emphasized  instead,  a  reconstructive 
approach.  Specifically,  the  goal  is  to  design,  implement, 
and  test  a  general  framework  for  recognizing  both  dis¬ 
tributed  motion  activity  on  the  basis  of  temporal  texture, 
and  complexly  moving,  compact  objects  on  the  basis  of 
their  action  .  This  recognition  approach  contrasts  with 
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the  reconstructive  approach  that  has  typified  most  prior 
work  on  motion.  The  underlying  motivation  is  the  ob¬ 
servation  that,  for  objects  that  typically  move,  it  is  fre¬ 
quently  easier  to  identify  them  when  they  are  moving 
than  when  they  are  stationary.  Specifically,  in  tlie  case 
of  temporal  texture,  we  extract  statistical  spatial  and 
temporal  features  from  approximations  to  the  motion 
field  and  use  techniques  analogous  to  those  developed 
for  gray-scale  texture  analysis  to  clcissify  regional  activi¬ 
ties  such  as  windblown  trees,  ripples  on  water,  or  chaotic 
fluid  flow,  that  are  characterized  by  complex,  non-rigid 
motion.  For  action  identification,  the  spatial  and  tempo¬ 
ral  arrangement  of  motion  features  can  be  used  in  con¬ 
junction  with  simple  geometric  image  analysis  to  iden¬ 
tify  complexly  moving  objects  such  as  machinery  and 
locomoting  people  and  animals.  This  work  has  practi¬ 
cal  applications  in  monitoring  and  surveillance,  and  as  a 
component  of  a  sophisticated  visual  system. 

Current  research  has  concentrated  on  temporal  tex¬ 
ture  identification.  Temporal  textures  represent  move¬ 
ments  that  are  characterized  by  some  degree  of  spatial 
invariance.  Such  motion  is  commonplace  in  the  natu¬ 
ral  world.  Examples  include  ripples  on  a  pool,  leaves  or 
grass  blowing  in  the  wind,  turbulence  in  cloud  patterns, 
and  the  behavior  of  a  flock  of  birds  or  a  crowd  of  people. 
The  basic  approach  rests  on  the  computation  of  robust 
motion  features.  Some  examples  are  differential  quanti¬ 
ties  such  as  divergence  and  curl,  motion  discontinuities 
of  various  types,  and  temporal  frequency  information. 
The  variety  of  possible  features  is  much  greater  than  for 
a  gray-scale  image,  both  because  the  motion  field  is  a 
vector  rather  than  a  scalar  array,  and  because  temporal 
as  well  as  spatial  relations  can  be  considered. 

The  main  challenge  (as  in  gray-level  texture  classifi¬ 
cation)  is  to  come  up  with  a  generally  useful  set  of  re¬ 
gional  motion  features  that  can  be  used  for  recognition. 
Various  invariances,  e.g.  to  scale,  rotation  or  temporal 
compression  may  or  may  not  be  desirable,  depending  on 
the  application.  We  have  acquired  and  digitized  a  mo¬ 
tion  database  containing  a  number  of  real-world  tempo¬ 
ral  textures,  and  identified  a  number  of  useful  features. 
Most  of  these  are  based  on  the  normal  flow,  which  is 
easier  to  obtain  than  the  full  motion  field. 

The  notion  of  temporal  texture  is  analogous  to  that 
employed  in  statistical  methods  of  gray-level  texture 
classification:  we  want  to  compute  temporal  features 
that  can  be  used  to  characterize  a  viewing  region  filled 
by  the  texture.  These  features  should  at  least  be  invari¬ 
ant  under  translation  of  the  texture  with  respect  to  the 
viewing  region.  In  some  cases,  invariance  with  respect  to 
rotation,  illumination,  and  spatia'  and  temporal  scaling 
is  also  important.  We  have  investigated  a  number  of  fea¬ 
tures  based  on  local  properties  of  the  normal  flow  field. 
In  general,  all  these  features  involve  spatial  and  some¬ 
times  temporal  averaging  of  local  properties.  Useful  fea¬ 
ture  include  spatial  and  temporal  uniformity  measures, 
local  curl  and  divergence,  and  cooccurence  statistics  on 
the  direction  histogram.  A  small  set  of  these  allows  ac¬ 
curate  classification  of  a  number  of  temporal  textures. 


5  Color  for  Robust  Recognition  and 
Location 

Color  histograms  can  be  robust  to  object  obscuration, 
pose,  or  configuration  (for  nonrigid  objects)  [Swain  1990; 
Swain  et  al.  1989,  1990a,b].  The  algorithms  described 
here  were  implemented  and  run  at  approximately  5Hz 
to  recognize  objects  reliably  from  a  catalog  of  75  rigid 
and  nonrigid  objects.  Given  a  pair  of  histograms,  I  and 
M,  each  containing  n  buckets,  the  intersection  of  the 
histograms  is  defined  to  be 

n 

i=i 

The  result  of  the  intersection  o.  i  model  histogram  with 
an  image  histogram  is  the  number  of  pixels  from  the 
model  that  have  corresponding  pixels  of  the  same  color 
in  the  image.  To  obtain  a  fractional  match  value  between 
0  and  1  the  intersection  is  normalized  by  the  number  of 
pixels  in  the  model  histogram.  The  match  value  is  then 

Ei=imin(/y,Mj) 

EU 

This  basic  matching  algorithm  runs  in  time  linear  in  the 
number  of  items  in  the  database.  Ingenious  use  of  se¬ 
lected  features  and  a  pre-indexing  of  the  catalog  of  ob¬ 
jects  allows  the  matching  to  take  place  in  constant  time 
(for  up  to  75  objects). 

The  normalized  histogram  intersection  match  value  is 
a  metric  in  feature-space,  and  so  can  be  used  for  nearest 
neighbor  classification.  It  has  the  useful  property  that  it 
is  not  reduced  by  distracting  pixels  in  the  background. 
The  histogram  intersection  match  value  is  only  increased 
by  a  pixel  in  the  background  if 

•  the  pixel  has  the  same  color  as  one  of  the  colors  in 
the  model,  and 

•  the  number  of  pixels  of  that  color  in  the  object  is 
less  than  the  number  of  pixels  of  that  color  in  the 
model. 

Object  location  is  possible  with  color  histograms  if 
we  know  their  approximate  sizes.  A  histogram  back- 
projection  algorithm  answers  the  question  “Where  are 
the  colors  in  the  image  that  belong  to  the  object  being 
looked  for  (the  target)?”  The  algorithm  deemphasizes 
colors  that  appear  in  other  objects  having  different  his¬ 
tograms,  so  that  they  are  less  likely  to  distract  the  search 
mechanism.  Experiments  show  that  the  technique  works 
for  objects  in  cluttered  scenes  under  realistic  conditions. 

6  Projective  Invariants 

Although  geometric  features  vary  with  the  imaging  ge¬ 
ometry  and  hence  viewpoint,  invariant  measures  (such 
as  moments),  features  (such  as  conics),  and  representa¬ 
tions  (such  as  invariant  representations  of  planar  curves) 
can  be  derived  [Forsyth  et  al.  1990a,b;  Brown  1991a,b]. 
Recent  work  in  geometric  (affine  and  projective)  invari¬ 
ants  shows  promise  of  discovering  feature-extraction  pro¬ 
cesses  for  planar  and  three-dimensional  features  that 
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yield  results  not  affected  by  perspective  distortion.  "I'liis 
work  draws  on  the  substantial  tradition  and  techniques 
of  projective  geometry.  This  approach  has  certain  tecli- 
nical  difficulties,  but  is  promising  for  model-based  vi¬ 
sion  since  it  leads  to  pose-independent  recognition.  A 
DARPA-sponsored  workshop  (jointly  funded  by  the  Eu¬ 
ropean  ESPRIT  project)  brought  together  a  number  of 
researchers  from  Europe  2md  the  US;  the  proceedings 
will  appear  as  a  book  published  by  MIT  Press. 

7  Multi-model  Parallel  Programming 

We  have  been  studying  and  using  parallel  computation 
for  many  years  [Growl  thesis,  Quiroz  thesis].  Our  expe¬ 
rience  indicates  that  animate  and  integrated  vision  sys¬ 
tems  are  inherently  parallel  and  require  multiple  models 
of  par2Jlelism  [Weems  et  al.  1991].  A  system  that  in¬ 
tegrates  perception,  planning,  and  acting  will  utilize  a 
wide  variety  of  data  structures,  algorithms,  and  models 
of  processes  and  communication.  One  way  to  achieve 
this  multiplicity  is  through  heterogeneous  hardware  sys¬ 
tems.  However,  it  is  often  desirable  to  include  many 
intimately-cooperating  models  of  parallel  computation 
(say  programs  written  in  different  languages)  in  one  com¬ 
puter. 

The  functional  demands  of  integrated  parallel  com¬ 
puter  vision  systems  within  one  multiprocessor  imply  the 
following  technical  requirements. 

•  Support  is  needed  for  to  facilitate  the  integration  of 
modules  developed  by  different  people  at  different 
times  in  different  languages  under  different  models 
of  parallelism. 

-  Easily  understood,  standard  interaction  mech¬ 
anisms,  such  as  shared  memory  and  procedure 
call,  are  needed.  On  the  other  hand,  one  should 
have  some  choice  in  procedure  calls  —  highly 
protected  RPC-like  mechanisms  as  well  as  effi¬ 
cient  and  fast  calls  should  be  supported. 

-  Translation  between  model  semantics  is 
needed. 

•  Support  is  needed  to  allow  the  appropriate  parallel 
computation  models  to  function  and  communicate 
effectively. 

-  A  range  of  protection  mechanisms  is  needed, 
tailored  to  the  degree  of  trust  between  modules, 
allowing  different  levels  of  speed  and  safety  to 
coexist  within  an  application. 

-  Shared  software  modules  and  active  data  struc¬ 
tures  should  be  easily  created,  incorporated, 
and  removed  dynamically. 

—  The  kernel  and  the  user  should  share  data,  al¬ 
lowing  for  efficient  user-level  interactions  with 
the  system  (for  Jocks,  interrupts,  etc.)  and  sys¬ 
tem  performance  statistics  gathered  by  the  ker¬ 
nel.  This  facility  is  useful  for  time-critical  ap¬ 
plications  as  well. 

-  There  should  be  extensive  scheduling  capability 
in  user  space,  both  for  the  sake  of  performance 
and  to  allow  applications  to  implement  differ¬ 
ent  kinds  of  processes. 


An  example  of  an  integrated  multi-model  sensori¬ 
motor  system  is  the  Rochester  Checkers  player  [Marsh 
et  al.  1992,  Karlsson  1991].  Four  models  of  parallelism 
are  used,  each  embodied  in  a  corresponding  language  or 
library  package.  Uthread  is  a  Presto-like  parallel  pro¬ 
gramming  package  that  we  call  from  C-f- f-  programs. 
Lynx  is  a  message-passing  language.  The  Uniform  Sys¬ 
tem  is  a  data-parallel  programming  tool  furnished  by 
BBN-ACI.  Muitilisp  is  parallel  LISP.  The  Psyche  op¬ 
erating  system  developed  at  Rochester  provides  the  effi¬ 
cient  multi-model  substrate. 

In  our  experience  with  Checkers,  Psyche  proved  to 
be  a  very  practical  substrate.  For  example,  the  Lynx 
checkers-playing  program  and  the  uniform  system  image- 
analysis  library  were  resurrected  after  several  years  of 
disuse  and  plugged  in  essentially  unmodified.  The  serial 
robot  communications  interface  was  written  in  1988  as 
an  exercise  and  also  was  plugged  in  unmodified.  The 
vision,  move  planning,  board  module,  and  move  recog¬ 
nition  modules,  as  well  as  necessary  Psyche  support  for 
the  particular  models  we  used,  were  developed  in  par¬ 
allel.  Coding  these  modules  was  a  part-time  activity 
extending  over  several  weeks.  Integration  was  a  full¬ 
time  activity  that  only  took  a  few  days.  At  integra¬ 
tion  time,  we  made  (and  changed)  many  decisions  about 
which  modules  should  communicate  directly  with  each 
other  and  which  should  use  the  indirect  shared  memory 
mechanism. 

The  Checkers  program  responds  imniediately  to  its 
competition.  Genern”'.  another  aspect  of  parallel  sys¬ 
tems  software  f-'T  animate  vision  is  that  it  should  be  real¬ 
time  -  dial  is,  it  should  compute  fast  enough  to  survive. 
We  feel  that  the  current  generation  of  real-time  oper¬ 
ating  systems  have  a  fatal  flaw:  the  requirement  that 
computational  demands  be  known  in  advance.  Given 
such  knowledge,  current  real-time  systems  allow  users  to 
meet  the  anticipated  deadlines  and  demands.  But  a  true 
animate  system  will  not  have  the  luxury  of  a  predictable 
universe.  It  will  have  to  engage  in  “satisficing”  behavior 
that  chooses  what  to  do,  how  far  to  pursue  goals  that 
are  not  working  out,  when  to  preempt  current  tasks  to 
deal  with  emergencies,  etc.  This  mixture  of  real-time 
Al,  resource  allocation  decision  making,  and  operating 
system  design  seems  vital  to  the  practical  deployment  of 
animate  systems. 

8  Vision  and  Task  Learning 

Learning  systems  are  being  built  using  several  different 
techniques  [Jain  thesis,  Simard  thesis,  Whitehead  1991, 
Wixson  1991,  Rimey  and  Brown  1991]].  Current  rein¬ 
forcement  learning  techniques  are  only  practical  when 
the  amount  of  state  to  be  represented  is  small  (less  than 
100  bits,  say).  One  way  to  reduce  the  burden  of  rep¬ 
resentation  to  just  that  essential  for  the  task  is  to  use 
markers  tJ  iuporary  variables  that  record  partial  compu¬ 
tational  r">  j'--  I'liis  notion  of  markers  was  introduced 
as  a  genet  '  tu.  thod  of  object-centered  computation  The 
notion  «  t  markers  provided  a  local  context  to  re¬ 

solve  refi  r  li.  •  ambiguity.  In  routine  activity,  long  causal 
chains  are  not  necessary.  It  turns  out  that  the  fixation 
point  strategy  can  be  thought  of  as  a  kind  of  marker  that 
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has  the  right  kind  of  transfer  for  learning  many  tasks. 

A  simulated  block-stacking  system  at  Rochester  pro¬ 
vides  an  example  [Whitehead  et  al.  1990a, b,  1991; 
Whitehead  199ia,b,c;  Ballard  and  Whitehead  1992]  On 
each  trial,  the  system  is  presented  with  a  pile  of  colored 
blocks.  A  pile  consists  of  any  number  of  blocks  arbitrar¬ 
ily  arranged.  Each  block  is  uniformly  colored  either  red, 
green,  or  blue.  The  system  can  manipulate  the  pile  by 
picking  and  placing  objects.  When  the  system  arranges 
the  blocks  into  a  successful  configuration,  it  receives  a 
positive  reward  and  the  trial  ends.  For  example,  one  ex¬ 
tremely  simple  block  stacking  task  is  for  the  system  to 
learn  to  pick  up  a  green  block.  In  this  case,  the  success¬ 
ful  configurations  consist  just  of  those  states  where  the 
system  is  holding  a  green  object.  The  system  learns  to 
arrange  arbitrary  configurations  of  blocks  into  success¬ 
ful  configurations.  The  key  point  here  is  that  the  marker 
encoding  obviates  the  need  for  explicit  coordinates. 

We  are  also  exploring  the  paradigm  of  visual  object 
search,  which  uses  prior  knowledge  about  the  likelihood 
of  finding  objects  in  the  proximity  of  other  objects  and 
incorporates  a  learning  component  [Wixson  and  Ballard 
1991;  Wixson  1991] 

9  Vision  and  Planning 

We  are  currently  considering  the  “Where  to  look  next?” 
problem,  and  are  applying  Bayesian  and  planning  tech¬ 
niques  to  the  sequential  decision-making  aspects  of  cog¬ 
nitively  directed  visual  activity.  The  work  by  Hartman 
[Hartman  1990,  1991a,b]  addresses  the  difficult  “cost  of 
planning”  problem  and  will  be  useful  in  a  practical  sys¬ 
tem  in  which  difficult  planning  problems  could  lead  to 
dysfunctional  use  of  scarce  resources  of  computation  and 
time.  Previous  work  in  integrating  low-level  segmenta¬ 
tion  with  high-level  object  recognition  used  a  Bayesian 
approach  and  Markov  random  fields  [Chou  et  al  1991, 
Swain  1990]. 

10  Related  Theses,  1990-1991 

Crowl,  L.A.,  “Architectural  adaptability  in  parallel  pro¬ 
gramming,”  Ph.D.  Thesis  (and  TR  381),  Computer  Sci¬ 
ence  Dept.  (Advisor:  T.J.  LeBlanc),  May  1991. 

To  create  a  parallel  program,  programmers  must  de¬ 
cide  what  parallelism  to  exploit,  and  choose  the  asso¬ 
ciated  data  distribution  and  communication.  Since  a 
typical  algorithm  has  much  more  potential  parallelism 
than  any  single  architecture  can  effectively  exploit,  pro¬ 
grammers  usually  express  only  the  exploitation  of  par¬ 
allelism  appropriate  to  a  single  machine.  Unfortunately, 
parallel  architectures  vary  widely.  A  program  that  exe¬ 
cutes  efficiently  on  one  architecture  may  execute  badly, 
if  at  all,  on  another  architecture.  To  port  such  a  pro¬ 
gram  to  a  new  architecture,  we  must  rewrite  the  pro¬ 
gram  to  remove  any  ineffective  parallelism,  to  introduce 
any  parallelism  appropriate  for  the  new  machine,  to  re¬ 
distribute  data  and  processing,  and  to  alter  the  form  of 
communication.  Architectural  adaptability  is  the  ease 
with  which  programmers  can  tune  or  port  a  program 
to  a  different  architecture.  The  thesis  of  this  disserta¬ 
tion  is  that  control  abstraction  is  fundamental  to  archi¬ 


tectural  adaptability  for  parallel  programs.  With  con¬ 
trol  abstraction,  we  can  define  and  use  a  rich  variety 
of  control  constructs  to  represent  an  algorithm’s  poten¬ 
tial  parallelism.  Since  control  abstraction  separates  the 
definition  of  a  construct  from  its  implementation,  a  con¬ 
struct  may  have  several  different  implementations,  each 
providing  different  exploitations  of  parallelism.  By  se¬ 
lecting  an  implementation  for  each  use  of  a  control  con¬ 
struct  with  annotations,  we  can  vary  the  parallelism  we 
choose  to  exploit  without  otherwise  changing  the  source 
code.  We  present  Matroshka,  a  programming  model  that 
supports  architectural  adaptability  in  parallel  programs 
through  object-based  data  abstraction  and  closure-based 
control  abstraction.  Using  the  model,  we  develop  several 
working  example  programs,  and  show  that  the  example 
programs  adapt  well  to  different  architectures.  We  also 
outline  a  programming  method  based  on  abstraction.  To 
show  the  implementation  feasibility  of  our  approach,  we 
describe  a  prototype  language  based  on  Matroshka,  de¬ 
scribe  its  implementation,  and  compare  the  performance 
of  the  prototype  with  existing  programs. 

Hartman,  L.,  “Decision  theory  and  the  cost  of  plan¬ 
ning,”  Ph.D.  Thesis  (and  TR  355),  Computer  Science 
Dept.  (Advisor:  D.H.  Ballard),  September  1990. 

This  thesis  shows  how  it  is  possible  for  a  planner  to 
make  decisions  that  are  sensitive  to  the  computational 
resources  that  it  expends.  The  main  feature  of  the 
approach  is  to  ignore  the  distinction  between  planning 
and  executing  and  interpret  planning  procedures  as  ac¬ 
tions  with  uncertain  outcomes.  A  planner  can  then  use 
standard  decision  theoretic  techniques  to  select  which 
among  a  set  of  alternative  procedures  is  a  better  gam¬ 
ble.  The  hypotheses  that  underlie  this  work  are  that 
an  autonomous  agent  must  monitor  its  resource  expen¬ 
diture  in  order  to  be  successful  and  that  computation  is 
an  important  and  valuable  resource.  The  main  points 
of  this  thesis  are:  (1)  it  is  possible  for  planners  to  make 
local  inexpensive  decisions  that  account  for  essentially 
their  entire  resource  expenditure;  (2)  there  are  limita¬ 
tions  to  what  a  planner  is  able  to  infer  about  optimal 
strategies;  and  (3)  it  is  possible  for  planners  to  be  sensi¬ 
tive  to  the  statistical  properties  of  their  problem-solving 
environments. 

Jain,  S.,  “Learning  in  the  presence  of  additional  in¬ 
formation  and  inaccurate  information,”  Ph.D.  Thesis 
(and  TR  357),  Computer  Science  Dept.  (Advisor:  M.A. 
Fulk),  September  1990. 

Inductive  inference  machines  (IIMs)  model  language 
and  scientific  learning.  In  the  classical  model,  a  ma¬ 
chine  attempts  to  construct  an  explanation  about  a  phe¬ 
nomenon  as  it  is  receiving  data  about  that  phenomenon. 
The  machine  is  said  to  be  successful  if  it  ultimately  suc¬ 
ceeds  in  explaining  the  phenomenon.  This  is  a  naive 
model  of  science.  For  one  thing,  a  scientist  has  more  in¬ 
formation  available  than  just  the  result  of  experiments. 
For  example,  a  scientist  may  have  some  knowledge  about 
the  complexity  of  the  phenomenon  he  (she)  is  trying  to 
learn.  For  another,  the  result  of  the  scientist’s  investi¬ 
gation  need  not  be  the  final  theory.  Finally,  a  scientist 
may  already  have  some  approximate  explanation  of  the 
phenomenon.  The  study  of  such  additional  information 
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constitutes  the  first  part  of  this  thesis.  In  the  real  world 
our  input  is  rarely  free  of  error.  Inputs  usually  suffer 
from  noise  and  missing  data.  The  study  of  different  no¬ 
tions  of  such  inaccuracies  in  the  input  is  the  focus  of  the 
second  part  of  this  thesis. 

Quiroz  Gonzlez,  C. A., “Systematic  detection  of  paral¬ 
lelism  in  ordinary  programs”  Ph.D.  Thesis  (and  TR  351), 
Computer  Science  Dept.  (Advisor:  D.  Baldwin),  May 
1991. 

This  dissertation  discusses  a  general  model  for  com¬ 
pilers  that  take  imperative  code  written  for  sequential 
machines  (ordinary  code)  and  detect  the  parallelism  in 
that  code  that  is  compatible  with  the  semantics  of  the 
underlying  programming  language.  This  model  is  based 
on  the  idea  of  separating  the  concerns  of  parallelism  de¬ 
tection  and  parallelism  exploitation.  This  separation  is 
made  possible  by  having  the  detection  component  pro¬ 
vide  an  explicit  representation  of  the  parallelism  avail¬ 
able  in  the  original  code.  This  explicitly  parallel  rep¬ 
resentation  is  based  on  a  formalization  of  the  notion 
of  permissible  execution  sequences  for  a  given  mass  of 
code.  The  model  discussed  here  prescribes  the  structure 
of  the  parallelism  detector.  This  structure  depends  on 
(1)  recognizing  a  hierarchical  structure  on  a  graph  rep¬ 
resentation  of  the  program,  and  (2)  separately  encoding 
parallelization  conditions  and  effects.  Opportunities  for 
parallelization  can  then  be  discovered  by  traversing  the 
hierarchical  structure  from  the  bottom  up.  During  this 
traversal,  progressively  larger  parts  of  the  program  are 
compared  against  the  independently  encoded  conditions, 
and  transformed  when  the  conditions  are  satisfied.  The 
hierarchy  guarantees  that  the  results  of  transforming  a 
piece  of  a  program  propagate  in  time  to  affect  the  possi¬ 
ble  parallelization  of  larger  pieces.  Although  some  of  the 
algorithms  used  have  exponential  worst  cases  for  general 
graphs,  their  observed  behavior  on  real  flow  graphs  is  no 
worse  than  quadratic  on  the  size  of  the  original  program. 

Simard,  P.Y.,  “Learning  state  space  dynamics  in  re¬ 
current  networks,”  Ph.D.  Thesis  (and  TR  383),  Com¬ 
puter  Science  Dept.  (Advisor:  D.H.  Ballard),  June  1991. 

Fully  recurrent  (asymmetrical)  networks  can  be  used 
to  learn  temporal  trajectories.  The  network  is  un¬ 
folded  in  time,  and  backpropagation  is  used  to  train  the 
weights.  The  presence  of  recurrent  connections  creates 
internal  states  in  the  system  which  vary  as  a  function 
of  time.  The  resulting  dynamics  can  provide  interest¬ 
ing  additional  computing  power  but  learning  is  made 
more  difficult  by  the  existence  of  internal  memories.  This 
study  first  exhibits  the  properties  of  recurrent  networks 
in  terms  of  convergence  when  the  internal  states  of  the 
system  are  unknown.  A  new  energy  functional  is  pro¬ 
vided  to  change  the  weights  of  the  units  in  order  to 
control  the  stability  of  the  fixed  points  of  the  network’s 
dynamics.  The  power  of  the  resultant  algorithm  is  il¬ 
lustrated  with  the  simulation  of  a  content  addressable 
memory.  Next,  the  more  general  case  of  time  trajecto¬ 
ries  on  a  recurrent  network  is  studied.  An  application 
is  proposed  in  which  trajectories  are  generated  to  draw 
letters  as  a  function  of  an  input.  In  another  application 
of  recurrent  systems,  a  neural  network  exhibits  certain 
temporal  properties  observed  in  human  callosally  sec¬ 


tioned  brains.  Finally  the  proposed  algorithm  for  stabi¬ 
lizing  dynamics  around  fixed  points  is  extended  to  one 
for  stabilizing  dynamics  around  time  trajectories.  Its  ef¬ 
fects  are  illustrated  on  a  network  that  generates  Lisajous 
curves. 

Swain,  M.J.,  “Color  indexing,”  Ph.D.  Thesis  (and  TR 
360),  Computer  Science  Dept.  (Advisor;  D.H.  Ballard), 
November  1990. 

Computer  vision  is  embracing  a  new  research  focus 
in  which  the  aim  is  to  develop  visual  skills  for  robots 
that  allow  them  to  interact  with  a  dynamic,  realistic 
environment.  To  achieve  this  aim,  new  kinds  of  vision 
algorithms  need  to  be  developed  that  run  in  real  time 
and  subserve  the  robot’s  goals.  Two  fundamental  goals 
are  determining  the  identity  of  an  object  with  a  known 
location  and  determining  the  location  of  a  known  ob¬ 
ject.  Color  can  be  successfully  used  for  both  tasks.  This 
dissertation  demonstrates  that  color  histograms  of  mul¬ 
ticolored  objects  provide  a  robust,  efficient  cue  for  in¬ 
dexing  into  a  large  database  of  models.  It  shows  that 
color  histograms  are  stable  object  representations  in  the 
presence  of  occlusion  and  over  change  in  view,  and  that 
they  can  differentiate  among  a  large  number  of  objects. 
For  solving  the  identification  problem,  it  introduces  a 
technique  called  histogram  intersection,  which  matches 
model  and  image  histograms,  and  a  fast  incremental  ver¬ 
sion  of  histogram  intersection  which  allows  real-time  in¬ 
dexing  into  a  large  database  of  stored  models.  It  demon¬ 
strates  techniques  for  dealing  with  crowded  scenes  and 
with  models  with  similar  color  signatures.  For  solving 
the  location  problem  it  introduces  an  algorithm  called 
histogram  backprojection,  which  performs  this  task  effi¬ 
ciently  in  crowded  scenes. 


114 


DARPA-Related  Publications 
University  of  Rochester 
Computer  Science  Department 

July  1990  -  June  1991 

Ballard.  D.H..  “Animate  vision,”  Artificial 
Intelligence  Jourrud  48, 57-86, 1991. 

Ballard,  D.H.,  “Animate  vision:  An 
evolutionary  step  in  computational  vision,”  J. 
of  the  Inst,  of  Electronics,  Information,  and 
Communication  Engineers  74, 4,  343-348,  April 
1991. 

Ballard,  D.H.,  “Models  of  human  intelligence: 
A  sub-symbolic  approach,"  invited  manuscript, 
Wolfson  College  Lecture  Series,  Wolfson 
College,  Oxford,  January  1991. 

Bandopadhay,  A.  and  D.H.  Ballard, 
“Egomotion  perception  using  visual  tracking,” 
Computational  Intelligence  7,  February  1991. 

Bogucz,  K.J.,  “Document  image  segmentation 
via  top-down  projection  profile  and  run-length 
analysis,”  Bellcore  Technical  Memorandum 
TR-ARH-018037,  August  1990. 

Brown.  C.M.,  “An  empirical  investigation  of 
differential  invariants,”  DARPA/Esprit 
Workshop,  Reykjavik,  March  1991. 

Brown,  C.M.  and  R.C.  Nelson,  “Image 
understanding  research  at  Rochester,”  Proc., 
DARPA  Image  Understanding  Workshop, 
Pittsburgh,  PA,  September  1990. 

Coombs,  D.J.  and  C.M.  Brown,  “Intelligent 
gaze  control  in  binocular  vision,”  Proc.,  IEEE 
Workshop  on  Intelligent  Control,  ” 
Philadel{Ma,  PA,  September  1990. 

Coombs,  D.J.  and  C.M.  Brown,  “Cooperative 
gaze-holding  in  binocular  vision,”  IEEE 
Control  Systems,  June  1991. 

Coombs,  D.J.,  T.J.  Olson,  and  C.M.  Brown, 
“Gaze  control  and  segmentation,”  Proc., 
AAAI  Qualitative  Vision  Workshop,  Boston, 
MA,  August  1990;  1990-91  Computer  Science 
and  Engineering  Research  Review,  Computer 
Science  Dept.,  U.  Rochester,  7-11,  September 
1990. 

Forsyth.  D.,  J.L.  Mundy,  A.  Zissennan,  and 
C.M.  Brown,  “Invariance,  a  new  framewoik  for 


vision,”  Int’l.  Conf.  on  Computer  Vision, 
Osaka.  Jtq>an,  December  1990. 

Forsyth,  D.,  J.L.  Mundy,  A.  25sserman,  and 
C.M.  Brown,  “Projectively  invariant 
representations  using  implicit  algebraic- 
curves,”  Springer-Verlag  Lecture  Notes  in 
Computer  Science,  427-436, 1990. 

Hartman,  L.B.,  “A  decision  theoretic  approach 
to  controlling  the  cost  of  planning,”  Proc.,  3rd 
Int’l.  Workshop  on  Artificial  Intelligence  and 
Statistics,  Ft.  Lauderdale,  FL,  January  1991. 

Hartman,  L.B.,  “Decision  theory  and  the  cost 
of  planning,”  Ph.D.  Thesis  and  TR  355, 
Computer  Science  Dept.,  U.  Rochester, 
September  1990. 

Hartman,  L.B.,  “Uncertainty  and  the  cost  of 
planning,”  TR  372,  Computer  Science  Dept.,  U. 
Rochester,  February  1991. 

Karlsson,  J.,  “A  motion  planner  for  Checkers,” 
Proc.,  6th  Annual  U.  Bi^alo  Graduate  Corf,  on 
Computer  Science,  Buffalo,  NY,  March  1991. 

Marsh,  B.D.,  C.M.  Brown,  T.J.  LeBlanc,  M.L. 
Scott,  T.G.  Becker,  P.Ch.  Das,  J.  Karlsson,  and 
C.A.  Quiroz,  “The  Rochester  checkers  player: 
Multi-model  parallel  programming  for  animate 
vision,”  TR  374,  Computer  Science  Dept,  U. 
Rochester,  June  1991. 

Nelson,  R.C.,  “Qualitative  detection  of  motion 
by  a  moving  observer,"  Proc.,  AAAI -90 
Workshop  on  Qualitative  Vision,  Boston,  MA, 
July  1990;  Proc.,  OARPA  Image  Understanding 
Workshop,  Pittsburgli,  PA,  Septembe;  1990; 
Proc.,  IEEE  Computer  Society  Conf.  on 
Computer  Vision  and  Pattern  Recognition, 
Hawaii,  June  1991. 

Olson,  T.J.  and  D.J.  Coombs,  “Real-time 
vergence  control  for  binocular  robots,”  Proc., 
DARPA  Image  Understanding  Workshop, 
September  1990. 

Rimey,  R.D.  and  C.M.  Brown,  “HMMs  and 
vision:  Representing  structure  and  sequences 
for  active  vision  using  hidden  Markov  models,” 
TR  366,  Computer  Science  Dept.,  U. 
Rochester,  January  1991. 

Rimey,  R.D.  and  C.M.  Brown,  “Selective 
attention  as  sequential  behavior:  Modeling  eye 
movements  with  an  augmented  hidden  Markov 
model,”  Proc.,  DARPA  Image  Understanding 
Workshop,  September  1990. 


115 


Rimey,  R.D.  and  C.M.  Brown,  “Sequences, 
structure,  and  active  vision,”  Proc.,  IEEE 
Computer  Society  Cortf.  on  Computer  Vision 
and  Pattern  Recognition,  Hawaii,  June  1991. 

Shaffer.  C.A..  H.  Samet,  and  R.C.  Nelson, 
“QUILT:  A  geographic  information  system 
based  on  quadtrees,”  Infl.  J.  of  Geographic  I  rtf. 
Systems  4,  2,  103-131,  August  1990. 

Simaid,  P.Y.,  “Learning  state  space  dynamics 
in  recurrent  networks,”  TR  383  and  Ph.D. 
Thesis,  Computer  Science  Dept.,  U.  Rochester. 
June  1991. 

Simard,  P.Y.,  C.  Cortes,  and  B.  Victorri,  “An 
energy  function  for  changing  the  dynamics 
around  state  trajectories  in  recurrent 
networks,”  presented.  Snowbird  Conf., 
Snowbird,  UT,  April  1991. 

Simard,  P.Y.  and  Y.  LeCun,  “Trajectory 
generation  using  the  ‘reverse’  TDNN 
architecture,”  presented.  Snowbird  Conf., 
Snowlnrd,  UT,  April  1991. 

Simard,  P.Y.  and  G.E.  Mailloux,  “Vector  field 
restoration  by  the  method  of  convex 
projections,”  Computer  Vision  Graphics,  and 
Image  Processing  52, 3,  December  1990. 

Simard,  P.Y.,  J.P.  Raysz,  and  B.  Victorri, 
“Shaping  the  state  space  landscape  in 
recurrent  networks,”  presented,  IEEE  Conf.  on 
Advances  in  Neural  Irtformation  Processing 
Systems  3  (NIPS),  105-112,  December  1990. 

Swain,  M.J.,  “Color  Indexing,”  TR  360  and 
Ph.D.  Thesis,  Computer  Science  Dept.,  U. 
Rochester.  November  1990. 

Swain,  M.J.,  “Parameter  learning  for  Markov 
random  fields  with  highest  confidence  first 
estimation,”  TR  350,  Computer  Science  Dept., 
U.  Rochester,  August  1990. 

Swain,  M.J.  and  D.H.  Ballard,  “Color 
indexing,”  1990-91  Computer  Science  and 
Engineering  Research  Review,  Computer 
Science  Dept.,  U.  Rochester,  12-13,  September 
1990. 

Swain.  M.J.  and  D.H.  Ballard,  “Indexing  via 
color  histograms,”  Proc.,  DARPA  Image 
Understanding  Workshop,  September  1990; 
Proc.,  Infl.  Cortf.  on  Computer  Vision  (ICCV 
90),  Kyoto,  Japan,  December  1990. 


Swain,  M.J.,  L.E.  Wixson,  and  P.B.  Chou. 
“Efficient  parallel  estimation  for  Markov 
random  fields,”  in  M.  Henrion,  R.  Schacter, 
L.N.  Kanal,  and  J.  Lemmer  (Eds.).  Uncertainty 
in  Artificial  Intelligence:  Volume  V  {Proc.,  5th 
Workshop  on  Uncertainty  and  Artificial 
Intelligence,  Windsor,  Ontario,  August  1989). 
New  York:  Hsevier  Science  Pub.  Co.,  407-422, 
1990. 

Whitehead,  S.D.,  “Complexity  and  cooperation 
in  reinforcement  learning,”  8th  Machine 
Learning  Workshop,  Evanston,  IL,  June  1991. 

Whitehead,  S.D.,  “A  framewoik  for  integrating 
perception,  action,  and  trial-and-error 
learning,"  Proc.,  AAAI  Spring  Symp.  on 
Integrated  Architectures,  Palo  Alto,  CA, 
Match  1991. 

Whitehead,  S.D.,  “A  study  of  cooperative 
mechanisms  for  faster  reinforcement  learning,” 
TR  365,  Computer  Science  Dept.,  U. 
Rochester,  March  1991. 

Whitehead,  S.D.  and  D.H.  Ballard,  “Active 
perception  and  reinforcement  learning.”  Neural 
Computation  2,  4, 409-419,  1990. 

Whitehead,  S.D.  and  D.H.  BaUard,  “Learning 
to  perceive  and  act  by  trial  and  error,”  Machine 
Learning  7, 1, 45-83, 1991. 

Whitehead,  S.D.,  R.S.  Sutton,  and  D.H. 
Ballard,  “Recent  advances  in  reinforcement 
learning  and  their  implications  for  intelligent 
control,”  Proc.,  IEEE  Infl.  Symp.  on  Intelligent 
Control,  1990. 

Wixson,  L.E.,  “Scaling  reinforcement  learning 
techniques  via  modularity,”  Proc.,  8th  Infl. 
Workshop  on  Machine  Learning,  Evanston,  IL, 
June  1991. 

Wixson,  L.E.,  and  D.H.  Ballard,  “Real-time 
qualitative  detection  of  multi-colored  objects 
for  object  search,”  Proc.,  AAAI  Workshop  on 
Qualitative  Vision,  Boston,  MA,  July  1990. 

Yamauchi,  B.  and  R.C.  Nelson,  “A  behavior- 
based  architecture  for  robots  using  real-time 
vision,"  Proc.,  IEEE  Infl.  Cortf.  on  Robotics 
and  Automation,  Sacramento,  CA,  April  1991. 


116 


Image  Understanding  Research  at  GE 


J.L.  Mundy* 

Box  8 

G.E.  Corporate  Research  and  Development 
Schenectady,  NY  12309 


Abstract 

Recent  progress  in  image  understanding  re¬ 
search  at  GE  is  described.  The  focus  of  GE’s 
program  in  lU  is  on  the  application  of  geometric 
constraint  models  2uid  geometric  invariants  to 
the  recognition  and  representation  of  objects. 
Progress  on  the  development  of  various  object- 
oriented  software  environments  to  support  lU 
research  and  applications  are  also  described. 

1  Overview 

1.1  An  Emphasis  on  Geometry 

Image  understanding  research  and  applications  at  GE 
have  been  centered  around  geometric  descriptions  and 
geometric  reasoning  for  representing  and  recognizing  ob¬ 
jects.  We  have  developed  this  geometric  theme  over  the 
past  decade  with  emphasis  on  object  recognition  and  as¬ 
sociated  approaches  for  representing  objects  to  facilitate 
recognition. 

Our  current  work  is  developing  along  a  number  of  re¬ 
lated  geometric  themes. 

1.1.1  Constraint-Based  Modeling 

The  conventional  approaches  to  object  recognition 
have  been  based  on  fixed  polyhedral  models  or  models 
with  fixed  relationships  between  components.  We  have 
been  developing  a  system  for  representing  broad  cate¬ 
gories  of  objects  by  defining  an  object  in  terms  of  geo¬ 
metric  relations,  or  constraints.  Any  specific  object  of 
the  class  is  considered  to  be  a  solution  of  the  constraint 
system.  In  most  cases,  there  is  a  continuum  of  solutions 
so  a  specific  object  is  selected  which  satisfies  the  con¬ 
straints  as  well  as  optimality  criteria  derived  from  image 
features. 

We  have  applied  constraint-based  modeling  to  the 
problem  of  constructing  site  models  from  aerial  images 
in  support  of  image  simulation  for  mission  training  and 
rehearsal  and  site  models  to  support  image  intelligence 
applications,  a  goal  of  the  RADIUS  program. 

‘Work  at  GE  was  supported  in  part  by  the  DARPA 
Strategic  Computing  Vision  Program  and  the  Air  Force  Of¬ 
fice  of  Scientific  Research  under  Contract  No.  F49620-89-C- 
0033  and  by  the  DARPA  Strategic  Computing  Vision  Pro¬ 
gram  under  Contract  No.  MDA972-91-C-0053. 


1.1.2  Geometric  Invariants 

A  significant  body  of  results  on  the  construction  of 
geometric  invariants  to  projective  and  affine  transfor¬ 
mations  have  been  developed  over  the  past  two  years. 
An  invariant  is  any  property  of  a  geometric  configura¬ 
tion  which  is  unaffected  by  viewpoint.  In  current  model- 
bctsed  vision  approaches  ,  it  is  necessary  to  test  each  ob¬ 
ject  in  the  library  since  the  specific  properties  of  the  ob¬ 
ject  can  only  be  exploited  for  discrimination  after  model 
pose  is  determined.  When  objects  are  described  in  terms 
of  invariants  the  resulting  properties  can  be  used  to  in¬ 
dex  large  model  libraries.  We  have  shown  that  invariant 
indexing  leads  to  object  recognition  cost  which  grows 
slowly  with  the  size  of  the  model  library. 

Invariants  are  also  an  effective  representation  for  the 
geometric  shape  of  an  object.  We  have  implemented  two 
applications  of  invariants,  the  first  is  using  a  hybrid  con¬ 
straint  and  invariant  model  for  geographic  features  to 
locate  these  features  in  synthetic  aperture  radar(SAR) 
imagery  [3].  The  second  application  demonstrates  the 
extension  of  invariant  theory  to  multiple  views  and  ar¬ 
bitrary  3D  pointsets  [fi]. 

This  latter  application  appears  very  promising  for  the 
construction  of  site  models  without  the  need  for  deter¬ 
mining  CEunera  modek  or  to  determine  the  3D  configura¬ 
tion  of  the  elements  comprising  the  site.  This  work  is  the 
result  of  a  collaboration  ^  with  a  number  of  rescEirchers 
interested  in  the  theory  and  application  of  invariants  to 
image  analysis  and  photogrammetry. 

We  shown  that  models  can  be  transferred  to  any  ar¬ 
bitrary  view  once  as  few  as  eight  correspondences  are 
known.  These  results  are  based  on  the  development  of 
Longuet-Higgins  [4]  who  developed  a  linear  algorithm 
for  structure  from  motion.  The  paper  in  this  proceed¬ 
ings  shows  that  accurate  model  transfer  can  be  achieved 
from  features  derived  by  image  segmentation. 

1.1.3  Image  Registration 

Another  algorithm  arising  from  the  basic  Longuet- 
Higgens  result  is  an  effective  non-iterative  adgorithm  for 
image  registration  and  camera  calibration  The  ap- 

*The  individuals  involved  in  the  collaboration  are,  E.  Bar¬ 
rett  and  P.  Payton  of  Lockheed  Missies  and  Space  Division, 
M.  Brill  of  SAIC,  McLean  VA. 

*See  a  more  detailed  treatment,  “Calibration  of  Cam¬ 
eras  Using  the  Essential  Matrix,”  by  R.  Hartley,  in  this 
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plication  considered  in  this  work  is  the  problem  of  de¬ 
termining  camera  parameters  and  the  structure  of  3D 
features  from  two  or  more  unknown  viewpoints. 

The  original  work  of  Longuet-Higgens  assumes  that 
the  internal  camera  parameters  are  given  and  the  prob¬ 
lem  is  to  determine  the  external  pose  parameters  of  the 
cameras.  Unfortunately  it  is  often  difficult  to  obtain  ac¬ 
curate  estimates  of  the  internal  parameters  and  existing 
camera  calibration  codes  are  often  iterative  and  there 
is  a  non-negligible  probability  of  the  process  failing  to 
converge.  Hartley  is  able  to  determine  the  two  camera 
focal  lengths  as  well  as  the  five  relative  external  parame¬ 
ters.  The  determination  of  focal  length  is  often  the  most 
unstable  process  in  conventional  camera  modeling  pro¬ 
grams. 

The  approach  taken  by  Hartley  is  non-iterative  which 
means  that  the  algorithm  can  be  used  as  a  reliable  com¬ 
ponent  in  recognition  or  site  modeling  systems.  Re¬ 
cent  experiments  have  demonstrated  that  for  the  case 
of  stereo  p2urs  the  camera  models  derived  by  the  algo¬ 
rithm  resulted  in  image  rejpstration  accuracy  of  better 
than  .15  pixels. 

1.2  Object-Oriented  Design 

In  addition  our  research  and  applications  of  object 
recognition  techniques,  we  are  involved  in  a  number  of 
projects  for  the  application  of  object-oriented  design  to 
the  development  of  research  and  application  environ¬ 
ments. 

1.2.1  CMEE 

As  part  of  the  DARPA  sponsored  RADIUS  project, 
it  is  planned  to  develop  a  common  software  environ- 
ment(RCDE)  to  facilitate  the  exchange  of  results  and 
to  provide  a  platform  for  the  demonstration  of  algo¬ 
rithms  which  are  targeted  at  the  intelligence  exploita¬ 
tion  of  aerial  images.  Currently,  the  Cartographic  Mod¬ 
eling  Environment(CME)  which  is  under  development 
at  SRI  International  is  being  extended  and  ported  the 
UNIX  operating  system  from  the  Symbolics  Lisp  Ma¬ 
chine.  GE  and  SRI  have  teamed  to  provide  the  docu¬ 
mentation  and  extensions  to  CME  to  meet  the  require¬ 
ments  of  the  RCDE^ 

1.2.2  lUE 

A  major  DARPA  project  has  been  initiated  to  provide 
a  standard  software  environment  for  carrying  out  image 
understanding  research  and  applications.  The  Image 
Understanding  Environment  or  lUE  is  currently  being 
specified  by  a  committee  of  senior  lU  researcher  and  is 
based  on  an  object-oriented  representation  of  the  major 
data  structures  and  operations  used  in  lU  algorithms. 

The  goal  is  that  the  lUE  will  become  a  standard  of 
exchange  of  lU  data  and  also  provide  common  interfaces 
so  that  new  algorithms  and  other  support  code  can  be 
freely  exchanged  between  research  institutions.  The  cur¬ 
rent  state  of  affairs  is  that  code  is  often  duplicated  and  it 

proceedings. 

®See  the  review  of  RCDE  in  this  proceedings,  “The  RA¬ 
DIUS  Common  Development  Environment”,  by  J.L.  Mundy 
et  al. 


is  very  difficult  to  evaluate  algorithms  developed  in  one 
environment  on  another. 

The  lUE  specification  is  scheduled  to  be  complete  by 
February  1992.  An  integrating  contractor  will  be  se¬ 
lected  in  1992  and  the  first  prototype  of  the  lUE  will  be 
available  for  distribution  by  the  end  of  1994.  The  pro¬ 
totypes  will  be  evolved  by  a  rapid-prototyping  strategy 
and  will  use  code  developed  at  lU  research  institutions 
under  the  specification  guidelines  and  supervised  by  the 
integrating  contractor. 

The  lUE  project  is  described  in  more  detail  in  this 
proceedings 

1.2.3  Geometer 

Most  of  our  applications  and  research  systems  have 
been  built  on  top  of  a  set  of  geometry  and  algebra  tools 
called  GEOMETER.  We  have  integrated  GEOMETER 
with  CME  on  the  Symbolics  Lisp  Machine  to  provide 
a  fairly  complete  environment  for  experimenting  with 
constraint-modeling  and  object  recognition.  Currently 
we  are  porting  our  GEOMETER  systems  to  the  C-t-f 
language  and  extending  the  functions  of  GEOMETER 
to  handle  computations  involving  invariants.  We  are  at¬ 
tempting  to  keep  the  class  hierarchy  of  GEOMETER-t— b 
consistent  with  CMEE  so  that  we  can  reintegrate  the  two 
systems  after  the  porting  process  is  stable. 

In  the  remainder  of  this  report  we  provide  more  details 
about  the  work  and  some  highlights  of  the  experiments 
and  demonstrations  over  the  eighteen  months. 

2  Constraint-Modeling 

Our  underlying  approach  to  constraint  modeling  is  the 
observation  that  all  geometric  constraints  can  be  rep¬ 
resented  in  terms  of  algebraic  equations.  A  constraint- 
model  is  thus  a  system  of  polynomial  equations  and  spe¬ 
cific  model  instances  correspond  to  root  of  the  polyno¬ 
mial  system.  In  several  earlier  papers  we  have  provided 
details  of  the  representation  and  have  shown  results  on  a 
number  of  solution  techniques  [9,  8].  Two  of  significant 
advancements  have  been  made  in  the  solution  of  con¬ 
straint  equations  since  our  last  report  in  the  lU  workshop 
proceedings. 

2.1  Exploiting  Sparse  Matrices 

We  have  discovered  that  most  of  the  equations  which  re¬ 
sult  from  defining  a  set  of  constraints  yield  a  sparse  cou¬ 
pling  between  vMiables  and  equations.  When  defining 
geometric  constraints  the  variables  are  associated  with 
parameters  of  the  geometric  primitives  such  as  points, 
surfaces  and  curves.  The  equations  defined  by  geomet¬ 
ric  relations  set  constraints  on  the  variables,  but  in  most 
cases,  a  given  variable  only  appears  in  a  small  fraction 
of  the  equations. 

For  example  if  two  planes  are  required  to  be  perpen¬ 
dicular,  the  corresponding  equation  is  ni  •  nj  =  0  If 
there  is  an  additional  perpendicularity  constraint  with 
respect  to  a  third  plane,  nj  •  ns  =  0,  it  does  not  involve 
any  of  the  parameters  associated  with  the  second  plane. 

*“The  Image  Understanding  Environment  Project”,  J.L. 
Mundy  et  al. 
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Figure  1;  The  entries  of  the  Jacobian  matrix  of  the  set 
of  constraints  for  the  hangar  building.  The  figure  shows 
that  constraints  are  sparse. 

Figure  1  describes  the  non-zero  elements  in  the  Jaco¬ 
bian  matrix  for  a  typical  constraint  modeling  example. 
The  top  half  is  block-diagonal,  has  equations  for  enforc¬ 
ing  the  planarity  of  the  faces  for  all  the  components  in 
figure  2.  We  have  been  able  to  exploit  the  sparse  nature 
of  the  constraint  system  by  implementing  the  solver  us¬ 
ing  sparse  matrix  techniques.  When  the  matrices  are 
sparse,  the  processing  time  can  be  reduced  from  O(N^) 
to  0(n).  This  approach  has  allowed  us  to  solve  large 
constraint  systems  with  up  to  a  thousand  variables. 

2.2  Avoiding  the  Constraint  Surface 

The  classical  approach  to  solving  constrained  optimiza¬ 
tion  problems  is  to  expand  the  constraint  surface  into  a 
first  or  second  order  polynomial  and  then  move  along  the 
constraint  surface  in  a  direction  defined  by  the  gradient 
of  the  cost  function  while  staying  on  the  surface.  We 
have  found  that  this  approach  leads  to  slow  convergence 
and  difficulties  with  singularities. 

In  the  constraint-based  modeling  application,  a  much 
better  strategy  is  to  set  as  many  of  the  model  param¬ 
eters  as  possible  to  values  determined  from  image  ob¬ 
servations.  The  process  is  described  in  more  detail  as 
follows. 

The  two  goals  of  finding  the  global  minimum  of  a  con¬ 
vex  function,  V/(x)  =  0,  and  satisfying  the  constraints, 
h(x)  =  0,  are  combined  to  give  an  over-constrained 
problem,  whose  linear  approximations  give; 

VV(x)dx  =  -V/(x) 

Vh(x)dx  =  -h(x)  (y/c)  ^ 

Since  the  two  goals  are  in  general  conflicting,  only  the 
least-square-error  solution  can  be  solved.  The  constraint 
equations  are  multiplied  by  a  factor  y/c.  Each  iteration 
of  (1)  has  a  line  search  that  minimizes  the  least-square- 
error: 

m(x)  =  |V/(x)|'  -I-  c  |h(x)|'  (2) 


Figure  2:  The  use  of  constraints  to  model  an  aircraft 
hangar.  The  figure  shows  the  convergence  process.  The 
set  of  lines  extending  from  the  building  outline  corre¬ 
spond  to  the  model  when  the  model  vertices  are  set  to  the 
corresponding  image  locations.  After  convergence,  the 
entire  model  is  indistinguishable  from  the  image  bound¬ 
aries. 

Starting  with  •y/c  =  0,  system  (1)  converges  to  the 
unconstrained  global  minimum  first,  and  so  avoids  local 
minima  and  singularities  on  the  constraint  surface.  As 
noted  before,  this  convergence  is  trivial,  because  the  fit¬ 
ting  error  /(x)  is  well  approximated  by  a  quadratic,  and 
so  the  Hessian  V^/  is  almost  constant.  It  takes  only 
one  iteration  for  affine  projection,  and  a  few  more  for 
perspective  eind  more  nonlinear  projections. 

When  the  Hessian  V^/  is  constant,  the  best-fit  surface 
is  a  linear  subspace  with  zero  curvature  and  no  singular¬ 
ity,  a  lot  simpler  than  the  constraint  surface.  The  inde¬ 
pendent  variables  of  the  best-fit  surface  corresponds  to 
depths  of  the  visible  vertices  which  can  be  freely  varied 
as  long  as  their  projections  agree  with  the  image. 

All  the  vertices  can  be  fully  constrained  if  the  model 
is  fit  to  more  than  one  view,  in  which  case  best-fit 
fully  solves  the  shape  similar  to  photogrammetric  stereo. 
Even  with  one  view,  most  of  the  shape  is  specified,  and 
this  helps  avoid  local  singularities  such  as  collapsing  lo¬ 
cations  and  orientations. 

One  can  slide  on  the  best-fit  surface  looking  for  the 
point  nearest  to  the  constraint  surface.  This  can  be 
done  by  first  eliminating  variables  using  the  gradient 
equations,  then  using  the  remaining  variables  to  solve 
for  least  residual  |Ii(x)|  of  the  constraint  equations.  Al¬ 
ternatively,  system  (1)  is  solved  with  v^<C  1  with  elim¬ 
ination  done  largest  pivot  first. 

The  convex  fitting  function  /(x)  can  be  viewed  as  a 
regularizer  in  the  solution  of  dx.  It  makes  the  constraint 
problem  well-posed  by  using  empirical  data  whenever  ad¬ 
ditional  constraints  are  needed  to  pin  down  free  variables 
in  h(x)  =  0. 


119 


3  Geometric  Invariants 

An  invariant  is  a  property  of  a  set  of  geometric  forms 
which  does  not  change  with  viewpoint.  Our  premise  is 
that  invariants  offer  a  sound  framework  for  the  repre¬ 
sentation  of  objects  leading  to  efficient  recognition  algo¬ 
rithms.  For  the  past  few  years  we  have  worked  jointly 
with  Oxford  University  to  develop  euid  apply  geometric 
invariants  to  the  problem  of  object  recognition  ® .  A  joint 
workshop  between  DARPA  and  ESPRIT,  “Applications 
in  Computer  Vision,”  was  held  in  Reykjavik,  Iceland  in 
April,  1991.  The  workshop  brought  together  the  lead¬ 
ing  researchers  in  invariant  theory  and  applications.  A 
collection  of  the  papers  from  the  workshop  is  currently 
in  preparation  and  will  be  published  by  MIT  Press  [7]. 
Below  are  some  of  the  highlights  our  recent  results  in  the 
application  of  invariants. 

3.1  Invariant  Indexing  Functions 

We  have  recently  finished  a  first  prototype  recognl 
tion  system  for  planar  objects  which  can  retrie  bject 
classes  in  nearly  constant  time  even  under  extr<  per¬ 
spective  effects  and  scene  complexity.  [lO].  The  majority 
of  the  indexing  functions  used  to  date  in  model  based 
vision  are  planar  projective  invariants.  In  our  current 
implementation  (an  extension  of  [ll]),  three  algebraic 
invariants  are  used. 


Invariant  1:  Five  Coplanar  Lines. 

Given  five  coplanar  homogeneous  lines  Ij,  i  €  {1,  -,5}, 
two  projective  invariants  are  defined; 

r  _  (^43i||A^52i|  ,  _  1^/421  (1^^532 1  /o\ 

^  |•M421 11-^^531 1  *  lAf432l|Af52l| 

where  Mijk  =  (liiljJk)  1^1  determinant  of 

M.  Should  any  triple  of  lines  become  concurrent  the 
first  invariant  is  undefined.  This  singular  case  is  common 
for  polygons  where  alternate  sides  are  parallel.  In  these 
cases  we  can  only  use  the  second  invariant  as  a  shape 
descriptor. 

Using  the  duality  of  points  and  lines  the  invariants 
can  be  defined  for  five  coplanar  points.  An  alternative 
definition  appears  in  [5]. 


Invariant  2:  A  Conic  and  Two  Lines. 


A  conic,  C,  and  two  lines  not  tangent  to  the  conic 
define  a  single  invariant  given  by  [2]: 

(4, 

where  C  is  the  matrix  of  the  conic:  a  conic  takes  the 
form  ax^  -f  bxy  +  cy^  -f-  d*  -H  ey  -1-  /  =  0  which  can  be 
expressed  as  the  quadratic  form: 


[*.  y,  1] 


0,  (5) 


or  as  x^Cx  =  0.  By  duality  a  conic  and  two  points  also 
yield  a  single  invariant. 


^The  individuals  from  Oxford  University  involved  in  the 
collaboration  are  A.  Zisserman,  O.  Forsyth  (now  at  the  Uni¬ 
versity  of  Iowa)  and  C.  Rothwell 


Invariant  3:  A  Pair  of  Conics. 

Using  the  notation  of  a  conic  matrix  C,-  in  exam¬ 
ple  3  above,  two  projectively  invariant  measures  can  be 
formed  for  a  pair  of  conics  normalized  so  that  |C,  |  =  1 . 
These  are; 

h  =  Trace[C-[^C2]  and  h  =  Trace[C^^Ci]  (6) 


3.2  The  Recognition  System 

The  major  functions  of  the  object  modeling  and  recog¬ 
nition  process  are  summarized  as  follows. 

1.  Feature  extraction.  The  conics  and  lines  needed 
to  form  the  invariants  we  use  are  extracted  from 
image  edge  data. 

2.  Model  Construction  A  set  of  features  is  associ¬ 
ated  with  a  particular  object  by  providing  one  or 
more  images  of  the  object  by  itself.  The  invariant 
feature  vector  is  computed  for  the  set  of  features  and 
the  model  is  added  to  the  library  according  to  the 
feature  vector  index.  The  model  features  are  also 
stored  in  the  library  to  be  used  during  hypothesis 
verification. 

3.  Hypothesis  generation.  The  invariants  for 
groups  of  features  are  computed.  We  index  the  mea¬ 
sured  invariants  against  invariant  values  in  the  li¬ 
brary  using  a  hash  table,  and  if  they  match,  produce 
a  recognition  hypothesis.  Hypotheses  for  common 
objects  are  then  combined  before  verification. 

4.  Hypothesis  verification.  When  a  potential 
match  is  found  we  confirm  it  by  projecting  edge  data 
from  an  acquisition  image  to  the  test  scene.  Should 
the  projected  and  scene  edge  data  be  sufficiently 
close  the  match  is  confirmed.  The  parameters  of 
this  projection  can  be  computed  from  the  geometry 
of  the  same  features  used  to  generate  the  invariant 
index.  It  is  always  the  case  that  enough  feature 
constraints  are  available  to  compute  the  transfor¬ 
mation  when  the  features  define  an  invariant.  The 
invariant  vector  index  also  uniquely  defines  the  cor¬ 
respondence  between  model  features  and  image  fea¬ 
tures  needed  to  solve  for  the  transformation. 

An  example  of  the  system  performance  is  shown  in  fig¬ 
ure  3.  Here  a  bracket  is  found  in  two  orientations  under 
occlusion  and  severe  perspective  distortion.  The  sys¬ 
tem  currently  has  31  models  in  the  library  and  recog¬ 
nition  time  increases  slowly  with  the  size  of  the  library 
due  to  more  false  hypotheses,  but  the  growth  rate  is  far 
slower  than  the  typical  linear  increase  expected  when 
each  model  has  to  be  tested  individually. 

3.3  Canonical  Frames 

As  just  illustrated,  there  is  considerable  benefit  in  us¬ 
ing  invariants  to  imaging  transformations  as  indexing 
functions  for  generating  recognition  hypotheses.  Indices 
should  be  local  and  have  some  redundancy  (i.e.  sev¬ 
eral  per  outline)  -  so  if  one  index  is  occluded  there  is 
a  good  chance  recognition  can  proceed  on  other  visible 
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Figure  3;  Recognition  using  projectively  invariant  indexing  functions. 


parts;  they  should  be  stable,  so  small  perturbations  in 
the  curve  (due  to  image  noise)  do  not  cause  large  fluc¬ 
tuations  in  index  value;  and  they  should  have  sufficient 
discriminatory  power  over  models  in  the  library  (so  all 
models  do  not  have  similar  index  values). 

All  of  these  requirements  are  satisfied  by  a  bitangent 
construction,  where  a  bitangent  is  a  line  tangent  to  a 
curve  at  two  points  on  the  curve.  Provided  the  object 
outline  is  sufficiently  rich  in  structure  there  will  be  sev¬ 
eral  such  constructions  for  each  object  -  and  thus  redun¬ 
dancy  in  the  representation  giving  partial  immunity  to 
occlusion. 

We  use  tangency  to  select  4  distinguished  points  on 
the  curve  as  described  in  figure  4 

Figure  5  demonstrates  this  process  for  one  concavity 
of  a  spanner.  The  point  is  that  the  object  curve,  and  any 
projective  view  of  it,  are  mapped  into  the  same  curve. 
Consequently,  any  (metric)  measurements  made  in  this 
frame  are  invariant  descriptors  and  hence  may  be  used 
as  index  functions  to  recognize  the  object,  figures  5. 

3.4  Rotationally  Symmetric  Surfaces 

3.5  Recognising  rotationally  symmetric 
surfaces  from  their  outlines 

This  section  sketches  techniques  used  to  recognise  rota¬ 
tionally  symmetric  surfaces  from  an  outline.  The  work 
is  described  in  more  detail  in  an  internal  report  [l].  The 
outline  of  a  surface  in  an  image  is  given  by  a  system  of 
rays  through  the  camera  focal  point  that  are  tangent  to 
the  surface.  The  points  of  tangency  of  these  rays  with 
the  surface  form  a  space  curve,  called  the  contour  gen¬ 
erator. 

Points  on  the  contour  generator  are  distinguished,  be¬ 
cause  the  plane  tangent  to  the  surface  at  such  points 
passes  through  the  focal  point  (this  is  an  alternative  def¬ 
inition  of  the  contour  generator).  As  a  result,  we  have: 

Lemma:  Except  where  the  image  outline 
cusps^,  a  plane  tangent  to  the  surface  at  a  point 
on  the  contour  generator  (by  definition,  such  a 

^We  ignore  cusps  in  the  image  outline  in  what  follows. 


plane  passes  through  the  focal  point),  projects 
to  a  line  tangent  to  the  surface  outline,  and 
conversely,  a  line  tangent  to  the  outline  is  the 
image  of  a  plane  tangent  to  the  surface  at  the 
corresponding  point  on  the  contour  generator. 

As  a  corollary,  we  have: 

Corollary  1:  A  line  tangent  to  the  outline 
at  two  distinct  points  is  the  image  of  a  plane 
through  the  focal  point  and  tangent  to  the  sur¬ 
face  at  two  distinct  points,  both  on  the  contour 
generator. 

This  yields  useful  relationships  between  outline  proper¬ 
ties  and  surface  properties.  For  example: 

Corollary  2:  The  intersection  of  two  lines,  bi- 
teuigent  to  the  outline  is  a  point,  which  is  the 
image  of  the  intersection  of  the  two  bitangent 
planes  represented  by  the  lines. 

The  lenuna  and  both  corollaries  follow  immediately  from 
figure  6. 

Generic  surfaces  admit  one-parameter  systems  of  bi¬ 
tangent  planes,  so  we  can  expect  to  observe  and  exploit 
intersections  between  these  planes.  One  case  in  which 
the  bitangent  intersections  are  directly  informative  oc¬ 
curs  when  the  surface  is  rotationally  symmetric.  In  this 
case,  the  envelope  of  the  bitangent  planes  must  also  be 
rotationally  symmetric,  and  be  ruled.  The  rotational 
symmetry  follows  from  the  rotational  symmetry  of  the 
surface,  and  the  fact  that  the  envelope  is  ruled  follows 
by  considering  a  pletne  section  of  the  surface  through  its 
axis  of  symmetry.  Notice  that,  as  figure  6  shows,  the 
image  outline  generically  has  no  symmetry  in  this  case. 

Because  it  is  rotationally  symmetric  and  ruled,  the  en¬ 
velope  is  either  a  right  circular  cone,  or  a  cylinder  with 
circular  cross-section  (this  is  a  right  circular  cone  whose 
vertex  happens  to  be  at  infinity).  We  shall  draw  no  dis¬ 
tinction  between  vertices  at  infinity  and  more  accessible 
vertices,  and  refer  to  these  envelopes  as  bitangent  cones. 
These  comments  lead  to  the  following 

Key  result:  The  vertices  of  these  bitangent 
cones  must  lie  on  the  axis  (by  symmetry),  and 
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(a)  0  1 
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Figure  4:  a.  Construction  of  the  four  points  necessary  to  define  the  canonical  frame.  The  first  two  points  {A  D) 
are  the  points  of  bitangency  that  mark  the  entrance  to  the  concavity.  Two  further  distinguished  points,  (B  C),  are 
obtained  from  rays  cast  from  the  bitangent  contact  points  and  tangent  to  the  curve  segment  within  the  concavity. 
These  four  points  are  used  to  map  the  curve  to  the  canonical  frsune.  b.  Curve  in  canonical  frame.  A  projection  is 
constructed  that  transforms  the  four  points  in  a)  to  the  corner  of  the  unit  square.  The  same  projection  transforms 
the  curve  into  this  frame. 


Figure  5:  a)  -  c)  3  views  of  a  spanner  with  extracted  concavity  curves  and  distinguished  points  marked.  Note  the 
very  different  appearance  due  to  perspective  effects,  d)  Canonical  frame  curves  for  the  three  different  views  of  a 
spanner.  This  demonstrates  the  stability  of  the  method.  Of  course  the  same  curve  would  result  from  a  projective 
transformation  between  the  object  and  canonical  frame. 


Figure  6:  A  rotationally  symmetric  object,  and  the 
planes  bitangent  to  the  object  and  passing  through  the 
focal  point,  are  shown.  It  is  clear  from  the  figure  that 
the  intersection  of  these  planes  is  a  line,  also  passing 
through  the  focal  point.  Each  plane  appears  as  a  line 
in  the  image:  the  intersection  of  the  planes  appears  as  a 
point,  which  is  the  image  of  the  vertex  of  the  bitangent 
cone.  Note  in  particular  that  the  image  outline  has  no 
symmetry.  This  is  the  generic  case. 


so  are  collinear.  Assuming  the  focal  point  lies 
outside  the  surface,  as  figure  6  shows,  the  ver¬ 
tices  of  the  bitangent  cones  can  he  observed  in 
an  image.  The  vertices  appear  as  the  intersec¬ 
tion  of  a  pair  of  lines  bitangent  to  the  outline. 

As  a  result,  if  the  surface  has  four  or  more  bitan¬ 
gent  cones,  the  vertices  yield  a  system  of  four  or  more 
collinear  points,  lying  on  the  axis  of  the  surface.  These 
points  project  to  points  that  can  be  measured  tn  the  im¬ 
age.  This  fact  yields  two  important  applications: 

•  Cross-ratios  of  the  image  points,  which  are  pro- 
jectively  invariant,  yield  indexing  functions  for  the 
surface,  which  can  be  determined  from  the  outline 
alone.  These  indexing  functions  can  be  used  to 
recognise  the  surface. 

•  The  image  points  can  be  used  to  construct  tuc  image 
of  the  axis  of  a  rotationally  synunetric  surface  from 
its  outline. 

Since  a  change  in  camera  parameters  simply  changes 
the  details  of  the  projection  of  the  points,  but  does  not 
change  the  fact  that  the  map  is  a  projection,  these  cross- 
ratios  are  invariant  to  changes  in  the  camera  parameters, 
as  well  as  to  camera  position  and  orientation. 

This  approach  can  be  generalized  in  two  ways.  Firstly, 
there  are  other  sources  of  vertices  than  bitangent  lines. 
Secondly,  the  geometrical  construction  described  works 
for  a  wider  range  of  surfaces  than  the  rotationally  sym¬ 
metric  surfaces. 

At  present,  there  are  four  further  known  sources  of 
vertices,  illustrated  iii  figure  7: 

•  The  tangents  at  a  crease  in  the  outline: 

•  The  tangents  at  an  ending  in  the  outline: 

•  A  tangent  that  passes  through  an  ending  in 
the  outline: 

•  Inflections  of  the  outline: 

In  each  case,  there  is  a  clear  relationship  between  the 
tangent  to  the  outline  and  a  plane  tangent  to  the  sur¬ 
face.  In  each  case,  the  envelope  of  the  system  of  planes 
tangent  to  the  surface  and  having  the  required  property, 
is  a  cone  with  a  vertex  along  the  axis.  These  sources  of 
information  are  demonstrated  in  figure  7. 

All  the  constructions  described  for  vertices  are  pre¬ 
served  under  projectivities  of  three-space:  this  means 
that,  for  example,  given  a  plane  tangent  to  a  surface  at 
two  points,  a  projective  mapping  takes  the  surface  to  a 
new  surface,  and  the  bitangent  plane  to  a  plane  bitan¬ 
gent  to  the  new  surface,  where  the  points  of  tangency 
are  given  by  applying  the  map  to  the  original  points 
of  tangency.  As  a  result,  all  these  constructions  apply 
to  surfrtces  that  eire  projectively  equivalent  to  rotation- 
adly  synunetric  surfaces,  for  example,  surfaces  obtained 
by  sweeping  a  given  ellipse  along' an  axis,  and  scaling 
it  while  sweeping  it.  Figure  8  shows  two  views  of  the 
base  of  a  lamp,  with  bitangents  marked.  These  figures 
yield  values  of  the  cross-ratio  within  under  5%  of  one  an¬ 
other.  Thus,  a  cross-ratio  formed  in  this  way  is  a  useful 
indexing  function. 
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Figure  7:  The  five  known  cases  that  produce  usable  coaxial  vertices.  Note  that  although  this  figure  appears  to  have 
a  reflectional  symmetry,  this  is  not  a  generic  property  of  the  outline  of  a  rotationally  symmetric  object. 


Figure  8:  Two  views  of  a  lamp  base,  illustrating  bitangent  construction 
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Abstract 

This  paper  presents  an  overview  of  the  research  in 
image  understanding  (lU)  at  the  University  of  Illi¬ 
nois  (UI)  conducted  during  1990-91.  During  this 
period,  we  have  made  progress  in  four  areas;  inte¬ 
gration  in  three-dimensional  vision,  motion  analy¬ 
sis,  recovery  based  scene  visualization,  and  repre¬ 
sentation  and  navigation.  Work  in  each  of  these 
areas  is  reviewed. 

1.  Introduction 

A  major  part  of  our  recent  research  is  in  four  areas  of 
image  understanding.  The  first  area  (Sec.  2)  deals  with 
integration  of  multiple  image  cues  in  performing  image 
interpretation.  These  cues  capture  different  aspects  of 
the  scene  structure,  and  their  integrated  analysis  leads 
to  a  more  robust  inference  about  the  scene  characteris¬ 
tics  than  possible  from  individual  cues.  The  second  area 
(Sec.  3)  is  concerned  with  our  work  on  interpretation  of 
image  sequences  showing  dynamic  scenes.  Here  we  con¬ 
sider  the  problems  of  detecting  feature  correspondences 
and  estimating  the  three-dimensional  (3D)  motion  pa¬ 
rameters  and  the  3D  surface  structure  from  feature  cor¬ 
respondences,  over  a  sequence  of  images  showing  rigid 
or  nonrigid  motion.  Projects  in  the  third  area  (Sec.  4) 
report  work  on  visualization  of  scenes  using  attributes 
recovered  during  interpretation,  or  depiction  of  3D  char¬ 
acteristics  using  artificial  attributes.  The  use  of  image 
attributes  during  3D  recovery  for  visualization  takes  ad¬ 
vantage  of  the  most  powerful  image  cues  of  scene  struc¬ 
ture  for  perceptually  effective  scene  representation.  The 
fourth  area  (Sec.  5)  is  concerned  with  different  compo¬ 
nents  of  an  evolving  3D  representation  and  navigation 
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der  grant  N00014^90-J-1270,  and  the  State  of  fUinois  De¬ 
partment  of  Commerce  and  Community  Affairs  under  grant 
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system  which  has  the  goal  of  autonomously  acquiring, 
maintaining  and  using  3D  information  about  the  envi¬ 
ronment.  Representative  projects  in  each  of  these  areas 
are  summarized  in  the  following  sections.  To  keep  the 
paper  brief,  we  have  minimized  discussion  of  and  refer¬ 
ences  to  relevant  work  done  by  others.  Such  discussion 
and  references  are  available  in  the  cited  and  other  listed 
publications. 

2.  Integration 

Our  goal  in  this  area  is  to  perform  3D  or  other  inter¬ 
pretation  of  images,  such  that  the  interpretation  simul¬ 
taneously  satisfies  a  range  of  constraints  imposed  by 
the  image  structure  and  the  model  of  the  scene.  To 
do  this,  we  use  different  computational  processes  each 
of  which  carries  complementary  or  redundant  informa¬ 
tion  derived  from  different  image  cues.  Image  interpre¬ 
tation  is  the  result  of  a  cooperative  computation  that 
resolves  conflicts  and  ambiguities  arising  from  the  indi¬ 
vidual  processes.  We  have  presented  several  examples 
of  the  integration  approach  in  previous  lU  workshops 
[36,  37].  One  form^ism  for  unifying  our  work  on  in¬ 
tegration  is  presented  in  [35,  20).  Here  we  summarize 
some  recent  work  on  integration. 

2.1.  Integrated  Active  Stereo 

The  goal  of  active  stereo  is  surface  reconstruction  from 
stereo  images  of  large  scenes  having  large  depth  ranges, 
where  it  is  necessary  to  aim  cameras  in  different  direc¬ 
tions  to  fixate  at  different  objects  and  to  construct  the 
global  surface  map  of  the  scene  flrom  small  patches.  The 
first  stage  of  this  work  involved  surface  reconstruction  of 
a  single  object,  having  no  depth  discontinuities.  It  per¬ 
forms  integration  of  camera  vergence,  focus,  aperture, 
stereo  and  calibration  processes  [37,  4].  E^ly  work  on 
the  second  stage,  involving  arbitrarily  placed  and  arbi¬ 
trary  size  objects  was  reported  in  [19,  18].  This  second 
stage  has  now  been  completed  [22,  24].  In  this  stage, 
a  part  of  the  visual  field  that  has  not  yet  been  fixated 
but  has  appeared  as  the  peripheral  visual  field  during 
a  fixation,  provides  coarse  (inaccurate)  structural  infor¬ 
mation,  to  be  refined  during  future  fixations.  The  avail¬ 
ability  of  coarse  peripheral  maps  make  it  possible  to  se- 
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lect  a  new  fixation  point  on  a  new  object.  The  process 
of  fixation  has  three  components:  moving  the  optic  axes 
so  they  intersect  at  the  fixation  point  (vergence  control), 
registering  the  sharpest  image  of  the  fixation  point  (fo¬ 
cus  control),  and  obtaining  the  best  depth  estimate  of 
the  fixation  point  by  optimal  combination  of  the  esti¬ 
mates  from  focus,  vergence,  and  stereo.  Coarse-to-fine 
images  are  acquired  and  analyzed  during  the  fixation 
process  thus  representing  integration  of  coarse-to-fine 
image  acquisition  and  coarse-to-fine  surface  reconstruc¬ 
tion.  The  coarseness  of  the  images  results  naturally 
from  optical  blurring  in  the  vicinity  of  the  target  point 
during  the  current  fixation.  The  depth  estimates  ob¬ 
tained  during  fixation  from  focus  and  vergence  are  used 
as  independent  observations  in  addition  to  stereo  to  re¬ 
duce  the  uncertainty  in  the  depth  of  the  fixation  point. 
Analytical  evaluation  of  the  performances  of  stereo,  ver¬ 
gence,  and  focus  is  used  to  devise  optimal  strategies  for 
their  integration.  The  final  depth  estimate  from  stereo, 
focus  and  vergence  is  obtained  in  a  way  such  that  the 
uncertainty  of  the  fused  estimate  is  less  than  that  of 
the  individual  depth  estimates.  The  various  objects 
in  a  scene  are  represented  by  the  vertices  of  a  graph. 
The  problem  of  surface  reconstruction  for  the  scene  then 
amounts  to  fixating  these  objects  successively  and  scan¬ 
ning  their  surfaces.  Assuming  that  no  object  is  fixated 
twice,  the  order  of  fixation  is  determined  by  finding  a 
path  in  the  graph.  If  a  cost  is  assigned  to  every  change  of 
fixation  from  one  object  to  another,  then  the  solution 
is  one  of  finding  the  minimal  cost  path  in  the  graph. 
However,  since  the  number  of  objects  and  their  spa¬ 
tial  relationships  are  unknown  until  the  scene  has  been 
completely  scanned,  only  a  locally  minimal  cost  path  is 
determined. 

2.2.  Integrated  3D  Mapping  and 
Calibration 

The  work  just  described  involves  integration  of  the 
camera  calibration  process  with  surface  reconstruction. 
Thus,  the  camera  calibration  parameters  are  treated  as 
vsuriables  (rather  than  constants  given  by  the  encoder 
readings),  which  makes  camera  calibration  process  an 
integral,  flexible  component  of  surface  reconstruction. 
As  a  result,  despite  using  ordinary  cameras  prone  to  cal¬ 
ibration  errors  during  movements,  the  integration  could 
achieve  an  average  absolute  estimated  depth  error  of 
less  than  0.15%  for  a  large  surface  having  a  depth  of 
approximately  2  meters  [4]. 

The  above  loss  of  calibration  does  not  involve  mono¬ 
tonic  accumulation  of  errors  because  the  camera  move¬ 
ments  are  periodic  with  bounded  errors.  However,  dur¬ 
ing  unrestricted  movement  of  a  mobile  platform  hous¬ 
ing  the  cameras,  the  calibration  errors  may  accumulate 
due  to  factors  such  as  wheel  slippage.  Such  errors  can 
only  be  corrected  using  external  spatial  references.  Two 
such  references  are:  partial  reconstructions  of  scene  sur¬ 
faces  and  recognizable  landmarks.  To  use  these  requires 
that  the  scene  information  acquired  from  different  cam¬ 


era  configurations  and  viewpoints  be  combined  for  inte¬ 
grated  estimation  of  scene  structure  and  camera  motion. 

We  have  now  addressed  this  general  problem  of  inte¬ 
grating  piecewise  scene  descriptions  derived  from  differ¬ 
ent  viewing  configurations,  under  uncertainities  of  the 
knowledge  of  viewing  parameters  [30].  The  type  of  de¬ 
scriptions,  the  viewing  parameters  undergoing  change, 
and  the  nature  of  error  sources  may  vary  but  many  el¬ 
ements  of  our  treatment  of  the  problem  would  apply. 
Specifically,  we  have  considered  the  problem  of  a  roving 
active  stereo  system  which  acquires  partial  depth  maps 
of  the  scene  from  different  viewpoints.  The  surface  in¬ 
formation  obtained  from  dynamic  cameras  involves  er¬ 
ror  accumulation  during  extended  motion.  Generally, 
the  amount  of  deviation  is  proportional  to  the  distance 
travelled.  Therefore,  while  the  information  from  a  sin¬ 
gle  viewpoint  can  provide  reliable  localized  surface  map, 
the  accuracy  of  the  global  map  decreases  continuously. 
We  have  studied  the  effect  of  using  a  large  number  of 
the  most  recently  acquired  images  and  the  correspond¬ 
ing  incremental  scene  information  to  update  the  global 
map,  instead  of  using  only  the  most  recent  single  frame. 
This  work  has  obvious  overlap  with  the  work  on  navi¬ 
gation  reviewed  in  Sec.  5. 

2.3.  Integrating  Region,  Border  and 
Component  Gestalt  for  Extracting 
Perceptual  Structure 

This  research  concerns  perceptual  grouping,  or  goal  in¬ 
dependent  detection  of  perceptual  organization  in  im¬ 
ages.  Elsewhere,  we  have  reviewed  our  approach  to  per¬ 
ceptual  grouping  of  dots  that  integrates  multiple  con¬ 
straints,  active  at  different  perceptual  levels  and  having 
different  scopes  in  the  dot  pattern  [37].  We  have  inves¬ 
tigated  two  extensions  to  this  work  [33].  The  first  is  the 
use  of  human  training  to  identify  the  various  percep¬ 
tual  roles  that  a  dot  can  play.  The  second  is  the  use  of 
systematic  feature  selection  techniques  to  identify  the 
most  relevant  features  in  the  computation  of  perceptual 
grouping.  Here,  we  have  concentrated  on  the  identifies^ 
tion  of  interior  and  border  dots.  A  total  of  23  features 
are  initially  used  to  characterize  the  geometric  proper¬ 
ties  of  the  Voronoi  polygons  defined  by  the  dots.  These 
features  include  the  moments  of  area  and  the  various 
invariants  computed  from  them,  compactness,  elonga¬ 
tion,  and  eccentricity.  By  performing  feature  selection, 
we  have  shown  that  the  following  six  features  are  the 
most  useful:  (a)  eccentricity  magnitude,  (b)  x  direction 
and  (c)  y  direction  of  the  eccentricity  vector,  (d)  the 
elongation  magnitude,  (e)  the  x  direction  and  (f)  the  y 
direction  of  the  major  axis  of  the  cell.  The  assignment 
of  perceptual  roles  is  accomplished  by  an  initial  classifi¬ 
cation  using  the  training  data  and  the  nearest  neighbor 
method  in  the  feature  space.  This  initial  non-contextual 
classification  is  then  refined  using  a  probabilistic  relax¬ 
ation  technique  that  uses  Gestalt  criteria  such  as  border 
smoothness. 
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3.  Motion  Analysis 

The  long-range  goal  of  our  research  in  this  area  is 
the  understanding  of  dynamic  scenes.  We  have  made 
progress  in  three  major  subareas:  finding  feature  corre¬ 
spondences  in  image  sequence,  determining  rigid  motion 
parameters  and  surface  structure  from  the  correspon¬ 
dences,  and  analyzing  nonrigid  motion. 

3.1.  Detecting  Feature  Correspondences 

Detecting  feature  correspondences  is  difficult  due  to  a 
wide  variety  of  three-dimensional  structural  discontinu¬ 
ities  and  occlusions  that  occur  in  real  world  scenes.  Our 
work  on  this  problem  is  concerned  with  matching  point 
features  and  is  divided  into  two  categories,  according  to 
whether  the  correspondences  are  detected  in  two,  long 
range  views,  or  in  a  dense  image  sequence. 

In  the  first  category,  we  have  developed  a  matching 
algorithm  for  obtaining  feature  point  correspondences 
across  images  containing  rigid  objects  undergoing  differ¬ 
ent  motions  [31].  The  algorithm  is  applied  recursively  to 
coarse-to-fine  resolution  images.  At  each  resolution,  a 
point  feature  detector  derived  from  the  Moravec  interest 
operator  is  applied  to  obtain  a  number  of  distinguished 
feature  points.  Then  a  variety  of  matching  constraints 
are  applied  sequentially,  starting  with  simplest  and  fol¬ 
lowing  with  more  informed  ones.  First,  an  intensity- 
based  matching  algorithm  is  used  to  obtain  unique  point 
correspondences.  This  is  followed  by  the  application 
of  a  sequence  of  newly  developed  constraints  involving 
rigidity  and  disparity.  The  rigidity  tests  match  geomet¬ 
rical  relationships  among  the  feature  points,  while  the 
disparity  test  ensures  that  no  matched  feature  point  in 
an  image  could  be  rematched  with  a  different  feature 
if  it  were  reassigned  a  disparity  value  associated  with 
another  matched  pair.  The  computational  complexity 
at  each  resolution  level  is  proportional  to  the  number  of 
detected  feature  points  in  the  two  images  at  that  level. 
Experimental  results  with  several  kinds  of  indoor  and 
outdoor  scenes  show  that  the  algorithm  yields  only  cor¬ 
rect  matches  for  scenes  containing  rigid  objects. 

We  have  initiated  a  research  project  in  applying  con¬ 
cept  learning  to  matching  [38].  Much  significant  work 
has  been  done  in  the  field  of  machine  learning  in  terms 
of  producing  excellent  classifiers.  All  of  this  work  can  be 
brought  to  bear  on  the  matching  problem.  By  exploit¬ 
ing  the  information  in  the  template  image,  we  can  derive 
8  set  of  negative  examples.  Specifically,  if  there  is  any 
ambiguity  in  matching  to  the  object  image,  it  will  be 
caused  by  similar  points  in  the  object  image  in  terms  of 
the  error  measure  used  for  matching.  Since  there  is  over¬ 
lap  between  the  template  and  object  images,  the  points 
which  would  have  caused  problems  in  direct  matching 
should  exist  in  the  template  image.  Thus,  we  find  points 
similar  to  the  template  point.  We  know  that  all  of  these 
similar  points  are  sources  of  ambiguity  so  we  label  them 
as  negative  examples,  and  the  template  point  is  the  pos¬ 
itive  example.  Then  we  can  pass  this  set  of  examples 
to  a  machine  learning  algorithm  which  will  learn  a  con¬ 


cept  of  the  template  point.  Finally,  we  classify  each 
point  in  the  search  area  in  the  object  image,  and  se¬ 
lect  the  point  which  classifies  closest  to  the  template 
point  as  the  correct  match.  The  preliminary  empirical 
results  on  a  stereo  pair  of  a  street  were  promising  at  96% 
correct  matches  of  zero-crossings.  Testing  the  training 
data  matrix  with  a  neural  network  concept  learner,  a 
specializer,  and  a  generalizer  are  scheduled  next. 

In  the  second  category,  we  have  developed  a  sequen¬ 
tial  hypothesis  testing  approach  to  find  trajectories  of 
feature  points  in  a  dense  image  sequence.  This  is  based 
on  our  earlier  work  on  detection  of  small,  faint  moving 
objects  from  many  frames.  More  details  can  be  found 
in  a  separate  paper  in  the  proceedings  [32]  and  in  [34]. 

3.2.  Rigid  Motion  and  Structure  from 
Two-View  Correspondences 

Our  work  in  this  area  is  concerned  with  estimating  mo¬ 
tion  and  structure  of  a  scene  from  feature  correspon¬ 
dences  over  two  views  showing  long  term  motion. 

We  have  developed  an  integrated  system  for  3D  mo¬ 
tion  analysis  and  object  recognition  with  outdoor  stereo 
images  as  inputs  [41].  The  goals  are  to  obtain  the  3D 
motion  description  and  the  identification  of  the  object 
in  the  input  stereo  images.  We  use  stereo  images  for  mo¬ 
tion  estimation  and  only  the  left  image  of  a  stereo  image 
pair  for  the  processes  related  to  object  recognition.  The 
system  consists  of  four  stages.  These  are  (i)  motion  es¬ 
timation,  (ii)  distinctive  feature  extraction,  (iii)  model 
database,  and  (iv)  object  recognition.  The  motion  es¬ 
timation  is  based  on  3D  point  correspondences  which 
can  be  derived  from  matched  points  on  the  stereo  im¬ 
ages.  In  this  stage,  we  obtain  the  following:  (1)  motion 
parameters,  (2)  3D  centroid  locations  of  the  object  and 
(3)  the  region  of  interest  (region  of  the  object  projected 
on  an  image).  Both  (2)  and  (3)  are  used  in  other  stages 
to  extract  properties  of  the  object  on  the  input  images. 
For  object  recognition,  we  use  two  features  that  can  be 
extracted  consistently  from  a  grey-level  image  to  de¬ 
scribe  an  object.  These  two  features  are  the  region  of 
interest  and  the  wheel-pattern  of  a  vehicle.  The  first 
feature,  region  of  interest,  can  be  obtained  from  motion 
detection  while  the  wheel-pattern  is  extracted  by  a  mod¬ 
ified  Hough  transform  algorithm.  The  model  database 
is  used  to  generate  the  perspective  view  of  each  model 
library  for  comparison  with  the  input  image.  It  requires 
four  parameters  (roll,  azimuth,  elevation  and  distance) 
to  generate  a  perspective  view.  These  four  parameters 
can  be  derived  from  the  3D  centroid  locations  and  trans¬ 
lation  directions  obtained  in  motion  estimation.  ^From 
the  generated  views,  we  know  the  wheel-pattern  and 
the  region  of  interest  of  each  model  library.  In  addi¬ 
tion,  the  expected  size  of  wheel  can  be  found  and  used 
in  the  wheel  detection  algorithm  to  reduce  the  compu¬ 
tation  time.  With  the  required  features,  namely  wheel- 
pattern  and  region  of  interest,  of  the  input  image  and 
each  model  library,  we  can  generate  the  attribute  list 
sets  of  the  image  and  the  model  libraries.  A  typical  at- 
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tribute  list  set  has  four  elements  which  are  (i)  region  of 
interest,  (ii)  number  of  wheels,  (iii)  locations  of  wheels 
and  (iv)  size  of  wheels.  By  comparing  the  attribute 
list  set  of  the  image  with  the  attribute  list  set  of  each 
model  library,  a  confidence  measure  can  be  computed. 
The  vehicle  in  the  input  images  is  identified  to  be  one 
of  the  model  libraries  represented  by  its  corresponding 
perspective  view  with  the  highest  value  in  confidence 
measure.  We  have  used  this  approach  on  sets  of  stereo 
image  pairs.  The  experimental  results  show  that  the 
system  can  provide  the  3D  motion  description  and  the 
identification  of  the  vehicle  in  all  the  stereo  image  pairs 
successfully. 

We  have  developed  an  algorithm  which  matches  im¬ 
age  regions  to  non-iteratively  estimate  3D  motion  and 
structure  of  a  moving  piecewise  planar  textured  surface 
from  two  perspective  views  [29].  The  algorithm  has  two 
major  steps.  In  the  first  (coarse)  step,  the  local  pla¬ 
nar  nature  of  the  surface  is  used  to  obtain  polynomial 
expressions  for  image  plane  displacements  of  features. 
With  regions  as  moving  features,  the  image  is  segmented 
using  Hough  transform  such  that  the  regions  in  each  seg¬ 
ment  have  the  same  polynomial  coefficients.  The  values 
of  these  coefficients  and  region  properties  (e.g.,  area)  are 
then  used  to  identify  region  correspondences.  In  the  sec¬ 
ond  step,  the  region  correspondences  are  used  to  com¬ 
pute  the  motion  parameters  and  surface  orientation  in 
closed  form  for  each  planar  patch.  The  second  step  uses 
a  finer  model  of  motion  than  the  first  step.  Further,  we 
have  identified  sufficient  conditions  for  double  or  unique 
solution  of  the  problem  of  motion  and  structure  from 
two  monocular  images.  We  list  several  properties  of 
the  essential  matrix  and  the  plane  motion  matrix,  both 
'  of  which  are  frequently  used  in  the  motion  and  struc¬ 
ture  estimation  problem.  We  have  developed  a  robust 
algorithm  and  some  essential  results  for  plane  motion 
estimation  from  two-view  point  correspondences.  We 
present  necessary  and  sufficient  conditions  for  plane  mo¬ 
tion  solution.  We  prove  that  three- view  matching  does 
not  necessarily  make  the  solution  unique.  Further,  we 
have  developed  a  two-view  nonlinear  motion  algorithm 
which  has  the  following  salient  features:  (i)  it  applies 
to  any  surface  undergoing  rigid  motion;  (ii)  it  globally 
minimizes  an  objective  functioi  (iii)  it  is  fast;  and  (iv) 
it  can  determine  whether  the  motion  is  unique.  This 
work  is  presented  in  [27,  21]. 

3.3.  Rigid  Motion  and  Structure  from  Long 
Image  Sequences 

In  this  area,  we  are  concerned  with  long  sequences  of 
images  of  a  dynamic  scene.  We  are  interested  in  seg¬ 
mentation  of  the  sequence  into  distinctly  moving  ob¬ 
jects,  as  well  as  in  the  estimation  of  their  motion  and 
structure.  We  will  discuss  two  separate  cases,  according 
to  whether  the  successive  images  in  the  sequence  show 
long  range  motion  or  short  range  motion,  i.e.,  whether 
the  image  sequence  is  sparse  or  dense. 

In  the  first  category,  we  have  developed  a  new  model 


for  vehicle-type  motion  which  assumes  that  the  motion 
is  a  rotation  around  an  axis  through  the  vehicle  cen¬ 
ter  followed  by  a  forward  translation  along  the  main 
axis  of  the  vehicle.  It  helps  greatly  to  have  a  motion 
model  which  includes  only  a  small  number  of  puaune- 
ters  which  can  be  assumed  constant  over  many  image 
frames.  For  the  chosen  model,  we  have  established  the 
following  results:  (1)  When  the  rotation  and  the  ampli¬ 
tude  of  translation  are  constant,  this  type  of  motion  is 
equivalent  to  a  constant  camera-centered  motion.  This 
indicates  that  a  constant  motion  in  the  conventional 
camera-centered  model,  which  is  conunonly  considered 
artificial,  can  in  fact  be  a  reasonable  model  in  real  life. 
(2)  A  constant  vehicle-type  motion  can  be  interpreted 
as  a  constant  screw  motion.  A  linear  algorithm  for  es¬ 
timating  constant  vehicle-type  motion  has  been  devel¬ 
oped  and  applied  to  a  sequence  of  images  of  an  outdoor 
scene  containing  a  moving  truck.  The  results  are  far  su¬ 
perior  to  those  from  two- view  methods  [47].  This  work 
is  discussed  in  a  separate  paper  in  these  proceedings  [66] 

Another  algorithm  that  we  have  developed  uses  the 
model  that  the  rotational  velocity  is  constant  and  the 
rotation  center  arbitrary.  It  estimates  10  parameters 
for  motion  and  structure  of  a  rigid  planar  patch  given 
point  correspondences  in  a  monocular  image  sequence 
under  perspective  projection.  The  algorithm  mainly 
consists  of  two  steps.  First,  the  3D  space  of 
is  searched  exhaustively  but  coarsely  to  minimize  an 
objective  function.  For  each  selected  (w»,Wy,w,),  we 
linearly  compute  all  the  other  parameters.  Some  of  the 
(u>e,Wy,u;x)  and  the  corresponding  structure  values  are 
used  as  the  initial  guesses  in  the  second  step  to  carry 
out  a  fine  search  to  iteratively  minimize  the  objective 
function  with  respect  to  the  five  variables  for  rotation 
and  structure.  The  solution  corresponding  to  the  global 
minimum  is  used  to  obtain  least  squares  estimates  of  the 
remaining  unknowns  -  translation  and  rotation  center. 
We  have  experimentally  found  that  the  objective  func¬ 
tion  converges  well  so  that  we  do  not  have  to  search  the 
3D  space  densely  in  the  first  step.  Results  are  presented 
for  three  image  sequences,  two  simulated  and  one  real 
[28]. 

In  the  second  category,  we  have  developed  an  algo¬ 
rithm  which,  given  a  dense  temporal  sequence  of  in¬ 
tensity  images  of  multiple  moving  objects,  will  separate 
the  images  into  regions  showing  distinct  objects,  and 
for  those  objects  which  are  rotating,  will  calculate  the 
three-dimensional  structure  and  motion  [13]  using  fac¬ 
torization  into  motion  and  structure.  The  algorithm 
consists  of  two  major  steps:  finding  and  tracking  fea¬ 
ture  points  on  the  objects  in  the  images,  and  deter¬ 
mining  the  subsets  of  trajectories,  motion,  and  struc¬ 
ture  corresponding  to  the  different  objects.  The  feature 
tracking  algorithm  can  track  features  across  frames  in 
which  the  feature  is  occluded  or  the  feature  detector 
response  is  weak.  The  trajectories  are  partitioned  into 
groups  corresponding  to  the  different  objects  by  fitting 


130 


the  trajectories  from  each  group  to  a  hierarchy  of  in¬ 
creasingly  complex  motion  models  in  a  coarse-to-fine 
manner.  Further  details  of  this  work  sire  presented  in  a 
separate  paper  in  these  proceedings  [32]  and  in  [34]. 

3.4.  Nonrigid  Motion 

We  have  worked  on  several  problems  that  involve  in¬ 
terpretation  of  image  sequences  containing  nonrigidly 
moving  objects  such  as  deformable  objects,  articulated 
objects  and  fluids. 

We  are  investigating  the  use  of  computer  vision  tech¬ 
niques  in  the  nondestructive  evaluation  of  structural 
damage.  The  structure  (a  bridge,  for  example)  under 
study  is  loaded  statically  or  dynamically,  and  the  dis¬ 
placements  and  movements  of  its  features  are  observed 
(by  several  CCD  and  TV  cameras).  These  measured 
data  2tre  combined  with  a  priori  knowledge  of  struc¬ 
tural  geometry  to  estimate  structural  parameters  such 
as  stiffness.  Reliable  feature  matching  and  measure¬ 
ments  to  subpixel  accuracy  are  critical  to  the  success 
of  our  method.  We  are  currently  experimenting  with  a 
small  two-story  frame  model  to  study  the  feasibility  of 
our  approach  [39]. 

In  many  scenarios,  robots  and  human  bodies  can  be 
modeled  as  articulated  objects.  We  have  developed 
three  algorithms  under  the  sissumption  of  perspective 
projection  for  the  recovery  of  3D  structure  from  the  mo¬ 
tion  of  several  types  of  joints  of  articulated  objects,  viz. 
joints  which  allow  only  fixed-axis  rotation  and  joints 
which  allow  only  planar  rotation  [68]  .  Then  we  have 
applied  these  algorithms  to  the  analysis  of  human  am¬ 
bulatory  motion  in  a  combined  way.  Our  numerical  ex¬ 
perimental  results  with  synthesized  data  have  indicated 
that  the  real  solution  from  the  proposed  algorithm  was 
always  unique  up  to  a  reflectance  and  the  time  of  con¬ 
vergence  was  reasonable.  Details  of  these  algorithms  are 
presented  in  a  separate  paper  in  these  proceedings  [67]. 

We  are  investigating  the  problem  of  interpolating,  un¬ 
der  physical  constraints,  3D  vector  fields  from  sample 
vectors  at  irregular  positions.  This  problem  arises  from 
analysis  of  fluid  motion,  but  our  results  can  also  be  used 
in  such  areas  as  geometric  modeling,  approximation  the¬ 
ory,  and  other  types  of  nonrigid  body  motion.  Our  algo¬ 
rithm  combines  the  generalized  multivariate  quadratic 
interpolation  and  physical  constraints  into  one  step  to 
form  an  over-determined  linear  equation  system  [55]. 
The  least  squares  solution  of  this  system  gives  the  coef¬ 
ficients  of  interpolation.  Since  the  interpolation  is  done 
in  one  step,  it  is  non-iterative  and  efficient  in  compu¬ 
tation.  We  utilize  methods  in  robust  statistics  to  de¬ 
tect  outliers  in  the  sample  data  so  that  the  results  are 
more  stable  in  the  presence  of  gross  errors.  Another 
merit  of  our  scheme  is  that  by  incorporating  physical 
constraints  into  linear  equation  system,  the  algorithm 
takes  the  characteristics  of  vector  field  into  account  and 
is  much  less  sensitive  to  noise.  The  algorithm  is  applied 
to  both  synthesized  and  empirically  measured  3D  vector 
fields.  With  the  application  to  3D  fluid  flow  in  mind,  we 


study  the  applicability  of  physicsd  constraints  accord¬ 
ing  to  the  kinematics  of  fluid  and  analyze  the  sources  of 
noise  from  the  real  data  acquisition  setup.  A  compari¬ 
son  of  our  algorithm  with  previous  work  under  different 
noise  levels  shows  the  robustness  of  our  algorithm. 

4.  Recovery  Based  3D- Visualization 

We  have  started  work  on  using  the  information  ex¬ 
tracted  during  3D  interpretation  to  perform  image  syn¬ 
thesis  for  the  visualization  of  the  original  scene.  One 
such  effort  is  concerned  with  identification  and  depic¬ 
tion  of  image  attributes  such  that  the  display  commu¬ 
nicates  effectively  the  3D  scene  structure  as  seen  by  an 
observer  in  relative  motion  to  the  scene. 

4.1.  Flight  Images  Sequences 

We  have  addressed  the  problem  of  visualizing  motion 
of  am  observer  relative  to  a  scene.  Our  work  in  this 
area  has  two  objectives.  First,  it  addresses  the  problem 
of  recovering  motion  and  structure  parameters  from  a 
monocular  image  sequence.  Second,  it  uses  the  inter¬ 
mediate  outputs  of  the  recovery  procedure  to  sythesize 
an  image  sequence  that  depicts  the  estimated  motion 
and  structure.  A  key  feature  of  the  approach  presented 
is  an  integrated  use  of  multiple  image  attributes  which 
are  shared  by  both  estimation  and  visualization  pro¬ 
cesses.  We  focus  on  flight  image  sequences,  i.e.,  im¬ 
age  sequences  acquired  by  an  observer  moving  smoothly 
over  a  planar,  textured  surface.  The  approach  presented 
allows  the  use  of  image  cues  such  as  regions,  point  fea¬ 
tures,  optical  flow,  texture  gradient  and  vanishing  line. 
The  integration  of  information  in  these  diverse  cues  is 
carried  out  using  optimization.  Visualization  is  done 
using  the  image  attributes  extracted  from  the  image 
sequence  as  well  as  su'tificial  attributes.  Experimental 
have  been  conducted  with  a  real  image  sequence  dig¬ 
itized  from  a  commercially  available  video  tape.  We 
have  produced  a  video  tape  showing  the  visualization 
sequence.  This  work  is  described  in  a  separate  paper  in 
these  proceedings  [14]  and  in  [15]  . 

4.2.  Heart  Images  Sequence 

In  another  project  on  3D  estimation  and  visualization, 
we  are  investigating  techniques  of  estimating  the  mo¬ 
tion  and  deformations  of  left  ventricle  from  3D  data 
based  on  the  hieruchical  decomposition  of  the  motion 
into  two  parts:  global  movement  and  local  movement 
[59].  The  global  movement  is  further  decomposed  into 
global  rigid  motion  and  global  nonrigid  motion  while  the 
local  movement  is  decomposed  into  local  rigid  motion 
and  local  deformation.  Such  a  decomposition  enables  us 
to  devise  a  coarse-to-fine  modeling  of  the  left  ventricle 
dynamics  according  to  a  priori  knowledge  of  the  heart 
motion  patterns.  We  then  formulate  the  motion  and 
deformation  analysis  of  the  left  ventricle  as  a  series  of 
estimation  processes  which  correspond  to  these  decom¬ 
positions.  Our  algorithms  are  based  on  the  3D  data 
points  derived  from  biplane  angiogram  sequences.  We 
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first  estimate  the  global  motion  by  finding  the  center-of- 
contraction  stnd  the  principal  axes  of  the  heart.  We  then 
fit  the  3D  data  to  superquadrics.  The  parameters  of 
the  modeling  primitive  are  directly  related  to  the  global 
nonrigid  motions,  such  as  expansion  or  contraction  and 
twisting  deformation.  The  local  movement  of  the  left 
ventricle  is  estimated  via  a  tensor  based  approach  using 
the  localized  surfaces  obtained  through  spherical  har¬ 
monic  interpolation  of  the  residual  distances  between 
the  data  points  and  the  fitted  global  superquadric  sur¬ 
face.  The  estimation  results  of  the  left  ventricle  motion 
and  deformation  are  then  visualized  through  scientific 
visualization  techniques  which  allow  the  estimated  mo¬ 
tions  to  be  perceived  interactively.  We  have  produced  a 
video  tape  containing  some  of  the  visualization  results. 

5.  Representation  and  Navigation 
There  are  two  goals  of  our  work  in  this  area.  First,  we 
are  interested  in  efficient  representation  of  the  surface 
shape  information  acquired  as  described  in  the  previous 
sections.  Second,  we  are  interested  in  using  the  scene 
representation  for  path  planning.  Our  work  in  both 
these  areas  has  a  common  theme:  the  use  of  potential 
field  as  the  underlying  model. 

5.1.  Efficient  Shape  Representation  using  a 
Potential  Field  Model 

We  have  worked  on  efficient  derivation  of  the  medial  axis 
transform  and  the  generalized  cylinder  representations 
of  a  two-dimensional  region  [17].  Instead  of  using  the 
shortest  distance  to  the  region  border,  a  potential  field 
model  is  used  for  computational  efficiency.  The  region 
border  is  assumed  to  be  charged  and  the  vedleys  of  the 
resulting  potential  field  ue  used  to  obtain  prleiminary 
estimates  of  the  axes  for  the  two  representations.  The 
potential  valleys  are  found  by  following  force  field  thus 
avoiding  two-dimensional  search.  First,  the  approach  is 
presented  to  obtain  the  representation  of  a  polygonal  re¬ 
gion.  Closed  form  computation  of  the  potential  field  is 
described  using  the  equations  of  the  border  segments. 
The  simple  Newtonian  potential  is  shown  to  be  inad¬ 
equate  for  deriving  accurate  representation.  A  higher 
order  potential  is  defined  which  decays  faster  with  dis¬ 
tance  than  as  inverse  of  distance.  It  is  shown  that  as 
the  potential  order  becomes  arbitrarily  large,  the  axes 
approach  those  computed  using  the  shortest  distance  to 
the  border.  An  algorithm  is  presented  to  efficiently  com¬ 
pute  the  medial  axis.  This  algorithm  is  then  modified 
to  obtain  the  generalized  cylinder  axis  of  a  polygonal 
region.  These  algorithms  for  polygonal  regions  are  used 
to  perform  a  multiresolution,  coarse-to-fine  computa¬ 
tion  of  the  medial  axis  and  generalized  cylinder  axes  of 
arbitrary  shaped  regions. 

5.2.  Path  Planning  using  a  Potential  Field 
Model 

We  have  focused  on  the  problem  of  path  planning,  i.e., 
deriving  an  efficient  and  collision  free  trajectory  to  move 


an  object  from  a  given  source  location/orientation  to  a 
give  destination  location/orientation  through  an  envi¬ 
ronment  whose  occupancy  map  is  given  by  the  locations 
and  shapes  of  the  various  objects.  We  have  surveyed  the 
state  of  the  art  approaches  to  autonomous  path  plan¬ 
ning  among  obstacles  in  [10].  The  problem  is  divided 
into  two  stages.  First,  a  candidate  topological  path  is 
selected.  Second,  the  candidate  path  is  cost-optimized 
to  derive  the  finzd  path  and  orientations  of  the  moving 
object.  In  earlier  work,  we  had  used  a  simple  poten¬ 
tial  field  based  model  [37,  5]  to  represent  the  free  space. 
We  have  now  begun  work  with  a  more  realistic,  New¬ 
tonian  potential  function  which  also  improves  the  com¬ 
putational  efficiency.  We  have  developed  a  Newtonian 
potential  field  based  approach  to  path  planning  in  which 
the  design  of  an  individual  path  segment  is  treated  as 
a  local  problem  in  that  the  location  and  orientation  of 
the  object  along  the  segment  are  determined  only  by 
the  nearby  obstacles  [26,  25].  In  particular,  those  parts 
of  free  space  closely  surrounded  by  obstacles  where  the 
object  has  to  tightly  maneuver  through  the  obstacles 
are  viewed  as  important  components  of  the  topological 
structure  of  the  free  space.  These  parts  comprise  bot¬ 
tlenecks  for  object  motion.  Between  these  bottleneck 
regions,  the  object  motion  is  much  more  free.  Thus, 
the  top  level  planning  determines  an  order  in  which  the 
bottlenecks  must  be  traversed  for  the  object  to  move 
from  source  to  destination.  This  decomposes  the  path 
planning  problem  into  two  sets  of  subproblems.  Ma¬ 
neuvering  through  each  bottleneck  comprises  one  set  of 
subproblems.  Traversal  of  free  space  between  the  bot¬ 
tlenecks  defines  the  second  set  of  subproblems.  The 
solutions  of  the  second  set  of  subproblems  define  path 
segments  that  link  the  solution  paths  which  are  obtained 
by  solving  the  subproblems  of  the  first  set,  thus  yield¬ 
ing  a  global  solution.  For  the  2D  problems,  free  space 
bottlenecks  are  defined  by  the  minimal  distance  links 
among  polygonal  obstacles,  and  object  shape  is  repre¬ 
sented  by  its  skeleton.  Both  of  these  representations  can 
be  computed  easily.  The  likelihood  of  collision  between 
the  object  to  be  moved  and  the  obstacles  is  measured 
through  the  Newtonian  potential  model  wherein  each 
object/obstacle  region  border  is  assumed  to  be  charged. 
In  the  local  problem  for  a  given  bottleneck,  the  exact 
location  and  orientation  of  the  object,  as  the  succes¬ 
sive  skeleton  points  enter  the  bottleneck,  are  determined 
so  that  the  force  and  torque  experienced  by  the  object 
are  minimized.  Such  repulsive  forces  and  torques  be¬ 
tween  polygonal  regions  can  be  computed  analytically. 
Global  planning  is  used  to  link  the  solutions  of  the  local 
problems  at  atdjacent  bottlenecks  into  a  global  solution. 
Once  the  object  exits  a  bottleneck,  it  is  pulled  by  the 
next  minimal  distance  link  which  is  achieved  by  forcing 
the  object  to  reduce  its  distance  to  the  link.  As  the  ob¬ 
ject  is  being  pulled,  the  location  and  orientation  of  the 
object  are  changed  to  minimize  the  force  and  torque 
experienced.  Thus,  the  object  is  forced  to  follow  the 
path  of  minimal  likelihood  of  collision  between  bottie- 
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necks.  Because  the  distance  to  the  next  link  decreases 
monotonically,  a  solution  is  guaranteed  to  be  found  if 
one  exists.  Due  to  such  mixed  use  of  geometry  and  po¬ 
tential  field  constraints,  the  local  minima  of  potential 
fields  do  not  present  a  problem. 
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1  Abstract 

In  this  paper  we  present  a  framework  for  research  into 
the  development  of  an  Active  Observer.  The  com¬ 
ponents  of  such  an  observer  aire  the  low  and  intermedi¬ 
ate  visual  processing  modules.  Some  of  these  modules 
have  been  adapted  from  the  community  and  some  have 
been  investigated  in  the  GRASP  laboratory,  most  no¬ 
tably  modules  for  the  understanding  of  surface  reflec¬ 
tions  via  color  and  multiple  views  and  for  the  segmen¬ 
tation  of  three  dimensional  images  into  first  or  second 
order  surfaces  via  superquadric/parametric  volumetric 
models.  However  the  key  problem  in  Active  Observer 
research  is  the  control  structure  of  its  behavior  based 
on  the  task  and  the  situation.  This  control  structure  is 
modeled  by  a  formalism  called  Discrete  Events  Dynamic 
Systems  (DEDS). 

2  Introduction 

We  are  interested  in  the  development  of  an  Active  Ob¬ 
server.  An  Active  Observer  is  an  agent  which  has  capa¬ 
bilities  to  observe  scenes,  objects,  situations  and  deliver 
the  observed  information  to  human,  manipulatory,  and 
mobile  agents.  Naturally  there  are  more  questions  than 
answers.  We  shall  list  a  few  which  are  of  particular  in¬ 
terest  to  us.  What  are  the  components/modules  that 
such  an  observer  must  have?  How  are  these  components 
interconnected,  i.e.  what  is  the  architecture  of  such  an 
agent?  Some  of  the  modules  correspond  to  certain  vi¬ 
sual  cues.  We  take  as  a  given  that  our  observer  has 
several  such  cues.  In  that  case,  the  subsequent  ques¬ 
tion  is  how  are  the  results  from  these  cues  integrated? 
When  are  they  invoked?  How  is  the  selection  process 
conducted/guided?  Which  cue  is  employed  and  when? 
Finally,  what  kind  of  information/messages  is  delivered 
by  the  observer  to  other  agents? 

Towards  this  end,  for  the  last  two  years  we  have  con¬ 
centrated  on  the  development  of  theoretical  and  experi¬ 
mental  understanding  of  some  of  the  cues/components, 
some  cues’  integration  and  selection,  and  control  strate¬ 
gies  for  observation  capability.  In  particular,  in  cue  de¬ 
velopment  we  have  tried  to  understand  surface  reflec- 
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tions  by  color  and  multiple  views.  An  important  finding 
of  this  work,  which  will  be  described  in  detail  in  Section 
2,  is  that  multiple  view  points  provide  useful  information 
for  discriminating  between  specular  and  Lambertian  re¬ 
flections  both  from  dielectrics  and  from  metals.  In  Sec¬ 
tion  3,  we  shall  describe  a  system  for  the  segmentation 
of  a  three  dimensioned  scene  into  components  that  can 
be  modeled  by  superquadric  parametric  fit.  This  system 
uses,  in  cooperation,  surface  segmentation,  contour  seg¬ 
mentation  and  gross  volumetric  segmentation  in  order  to 
arrive  at  the  proper  result.  The  scenes  are  of  moderate 
complexity  (up  to  10  parts),  but  no  other  assumptions 
are  made  about  objects  or  their  parts.  This  work  points 
to  the  common  fact  that  one  module  or  cue  or  approach 
cannot  handle  the  perceptual  variety  of  the  data  that 
the  real  world,  even  in  moderate  complexity,  represents. 
Multiple  cues  are  necessary  and  hence  a  great  deal  of 
thought  has  to  go  into  the  integration  policy  and  control 
structure.  In  Section  4,  we  present  a  formal  model  of 
an  observer  agent.  This  model  is  based  on  the  theory  of 
Discrete  Event  Dynamic  Systems  (DEDS),  which  allows 
us  to  unequivocally  predict  the  observation  capabilities 
of  an  observer.  In  order  for  this  to  occur,  the  observer 
must  know  the  discrete  events  of  the  taek.  So  far  this 
is  done  by  the  designer.  Finally,  in  Section  5  we  show 
the  recent  development  of  a  CCD  chip  (the  Retina)  with 
space  variant  resolution.  Details  are  described  in  this 
section. 

3  Understanding  of  Reflection 

Properties  Using  Color  and  Multiple 
Views 

Recently  there  has  been  a  growing  interest  in  the  detec¬ 
tion  of  specularity  in  both  basic  and  applied  computer 
vision  research.  In  general,  the  detection  of  speculuri- 
ties  from  a  single  gray-level  image  is  a  physically  under¬ 
constrained  problem,  and  more  information  needs  to  be 
collected  in  physically  sensible  ways  to  solve  the  prob¬ 
lem.  Successful  development  of  an  algorithm  for  image 
data  collection  and  interpretation  necessarily  depends 
on  physical  models  that  describe  how  surfaces  appear 
according  to  the  illumination  and  reflectance  properties 
and  sensor  characteristics.  Recently  the  computer  vi¬ 
sion  field  has  increasingly  incorporated  methodologies 
derived  from  physical  principles  of  image  formation  and 
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sensing  [7].  So  far  there  have  been  three  types  of  ap¬ 
proaches  to  solving  the  problem  of  specularity  detection 
through  the  collection  of  more  images;  (1)  with  differ¬ 
ent  light  directions,  (2)  with  different  sensor  polarization 
angles,  and  (3)  with  different  color  sensors. 

The  photometric-stereo-type  approaches  consider  the 
specular  and  Lambertian  reflectance  properties  for  ob¬ 
taining  object  shape  using  more  than  two  light  directions 
[4]  [9]  [11].  Since  the  direction  and  the  degree  of  the 
collimation  of  the  illumination  need  to  be  strictly  con¬ 
trolled,  application  of  the  approach  is  restricted  to  dark¬ 
room  environments.  The  polarization  method  analyzes 
the  polarization  of  reflected  light  and  detects  specular- 
ities  from  dielectrics  and  metals  [12].  The  polarization 
approach  places  some  restrictions  on  the  incident  illumi¬ 
nation  direction  with  respect  to  surface  orientation. 

The  dichromatic  model  [10]  proposed  by  Shafer  has 
been  the  key  model  to  the  recent  specularity  detection 
algorithms  using  color  [8]  [5]  [6]  [3].  The  basic  limita¬ 
tion  of  the  color  algorithms  is  that  objects  must  be  only 
colored  dielectrics  to  use  the  dichromatic  model.  For 
color  image  segmentation,  it  is  usually  assumed  that  ob¬ 
ject  surface  reflectance  is  spatially  piecewise  uniform  in 
color  and  that  scene  illumination  is  singly  colored.  We 
have  previously  developed  a  color  image  segmentation 
algorithm  for  the  separation  of  diffuse,  as  well  as  sharp, 
specularities  and  inter-reflections  from  Lambertian  re¬ 
flections  [3]. 

Our  recent  research  has  focused  on  the  development 
of  some  specularity  detection  or  separation  methods  that 
only  require  modification  of  sensors  but  not  any  modifi¬ 
cation  of  environments.  In  other  words,  they  are  meth¬ 
ods  that  are  active  in  modifying  sensors  but  passive  in 
modifying  environments.  There  are  two  kinds  of  modifi¬ 
cation  of  environments:  relocation  and  re-orientation  of 
objects  by  robot  meinipulation,  and  illumination  change. 
The  prime  example  of  the  illumination  change  is  the 
light  switching  for  the  photometric-stereo-type  meth¬ 
ods.  Since  illumination  lighting  needs  to  be  strictly  con¬ 
trolled,  the  photometric-stereo-type  approaches  are  ap¬ 
plicable  only  for  inspection  in  dark  rooms. 

Strict  illumination  control  is  not  always  possible  in 
investigating  surface  reflection  properties  in  many  gen- 
ereil  environments.  Examples  include  outdoor  inspec¬ 
tion,  indoor  or  outdoor  navigation,  and  exploratory  en¬ 
vironments.  Even  for  indoor  inspection,  a  well  controlled 
dark  room  is  not  always  available. 

For  general  environments  without  strict  illumination 
control,  only  sensors  are  controllable,  and  color  and  po¬ 
larization  can  be  the  possible  cues.  Another  possibility 
is  to  move  the  observer,  which  has  not  been  used  for  in¬ 
vestigating  reflection  properties  in  computer  vision.  The 
idea  of  moving  the  observer  was  directly  motivated  by 
the  concept  of  active  vision  [2],  For  low-level  vision  prob¬ 
lems  of  shape  or  structure,  it  has  been  demonstrated  that 
many  ilI-po.sed  problems  become  well-posed  if  more  in¬ 
formation  is  collected  by  active  sensors  [1].  Although 
the  paradigms  for  shape  or  structure  based  on  feature 
correspondence  cannot  be  directly  applied  to  the  study 
of  reflectance  properties,  the  idea  of  a  moving  observer 
motivated  the  investigation  of  new  principles  by  physical 


modeling  in  obtaining  more  information. 

In  this  paper,  we  suggest  the  use  of  multiple  views 
for  the  detection  of  specularity  by  introducing  two  algo¬ 
rithms.  The  first  algorithm,  called  spectral  differencing, 
uses  color  information  from  a  small  number  of  multiple 
views.  The  second  algorithm  is  called  view  .sampling. 
Using  many  views  of  gray-level  images  collected  in  wide 
angle,  the  view  S8unpling  reconstructs  object  structure 
and  detects  specularities.  An  important  principle  used 
for  the  algorithms  is  the  Lambertian  consistency,  which 
is  the  well-known  fact  that  the  Lambertian  reflection 
does  not  change  its  brightness  and  spectral  content  de¬ 
pending  on  viewing  directions,  but  the  specular  reflec¬ 
tion  or  the  mixture  of  Lambertian  and  .specular  reflec¬ 
tions  can  change. 

A  problem  associated  with  the  use  of  multiple  views 
with  color  is  what  kind  of  extra  spectral  information  can 
be  obtained  by  moving  a  color  camera  without  consid¬ 
ering  object  geometry.  If  there  is  any,  it  may  alleviate 
the  limiting  assumptions  imposed  on  the  object  and  illu¬ 
mination  domain  for  the  color  segmentation  approaches, 
and  provide  higher  confidence  in  detecting  specularities. 

The  spectral  differencing  algorithm  is  based  on  the 
observation  that  any  presence  of  specular  reflections  can 
be  inferred  by  the  difference  in  the  distribution  of  pixel 
colors  between  two  color  images.  According  to  the  Lam¬ 
bertian  consistency,  the  color  distribution  of  pixels  from 
only  Lambertian  reflections  should  be  consistent  regard¬ 
less  of  view  points.  On  the  other  hand,  specularities 
or  the  mixture  of  specular  and  Lambertian  reflections 
can  change  the  distribution  of  pixel  colors  between  two 
views. 

The  spectral  differencing  algorithm  does  not  require 
any  assistance  from  image  segmentation  and  geometri¬ 
cal  manipulation.  Since  the  algorithm  does  not  rely  on 
the  segmentation  and  the  dichromatic  model,  it  is  appli¬ 
cable  to  dielectric  objects  with  nonuniform  reflectance 
and  metals  under  multiply  colored  illumination.  Fig¬ 
ures  1  and  2  show  two  dielectric  objects  with  varia¬ 
tion  in  reflectance  and  a  metallic  object  in  neutral  re¬ 
flectance  color.  Two  fluorescent  light  tubes  and  a  tung¬ 
sten  light  bulb  are  used  for  illumination  and  there  eire 
inter-reflections  from  the  walls.  MSD(0  ^  1)  shows  the 
regions  of  new  color  distribution  in  view  0  compared  to 
view  1,  and  MSD(1  ♦—  0)  the  regions  of  new  color  dis¬ 
tribution  in  view  1  compared  to  view  0.  Under  multiply 
colored  and  extended  illumination,  it  can  be  seen  that 
most  of  the  specularities  are  detected  by  the  spectral 
differencing. 

Another  approach  we  introduce  is  to  ''L  ain  reflection 
properties  using  only  multiple  views  without  any  color 
information.  With  densely  sampled  views  in  wide  an¬ 
gle  and  with  known  viewing  directions,  the  view  sam¬ 
pling  algorithm  reconstructs  object  structure  as  well  as 
detects  specularities  from  Lambertian  reflections.  The 
view  sampling  algorithm  is  applicable  to  dielectrics  and 
metals. 

If  object  structure  is  reconstructed  assuming  the  Lam¬ 
bertian  consistency  for  both  Lambertian  and  specular  re¬ 
flections,  the  structure  reconstructed  from  the  .specular 
reflections  would  not  in  general  represent  the  real  object 


138 


Figure  1:  Spectral  differencing 


Figure  2;  Spectral  differencing 


surface,  while  the  one  reconstructed  from  the  Lamber¬ 
tian  reflections  does.  By  examining  the  differently  recon¬ 
structed  object  structures  from  specular  and  Lambertian 
reflections,  we  can  identify  the  reflection  types  and  the 
read  object  structure. 

We  adopted  an  algorithm  for  computerized  tomogra¬ 
phy  through  photometric  modeling  for  the  reconstruc¬ 
tion  of  object  structure.  Figure  3  shows  the  camera  con¬ 
trol  scheme  and  Figure  4  (a)  shows  4  out  of  30  view  sam¬ 
ples  of  a  gray  dielectric  object  from  different  view  points. 
Figure  4  (c)  and  (d)  show  the  reconstructed  structures 
at  the  cross  sections  1  and  2  illustrated  in  Figure  4  (b), 
respectively.  As  shown  in  Figure  4  (c)  and  (d),  the  struc¬ 
ture  reconstructed  from  specularities  at  the  cross  section 
2  is  different  from  the  real  object  surface  reconstructed 
by  Lambertian  reflections. 

The  future  direction  of  our  studies  is  the  integration  of 
many  cues  in  the  light  of  active  vision  [2].  Active  vision 
involves  not  only  the  modeling  of  physical  sensing  and 
data  processing  for  vision  modules  (local  model),  but 
also  the  control  of  the  modules  (global  model).  Global 
models  characterize  the  overall  performance  and  make 
predictions  on  how  the  individual  modules  will  interact, 
which  in  turn  determines  how  intermediate  results  are 
combined.  It  is  the  global  model  that  analyzes  and  com¬ 
bines  the  information  from  many  visual  cues  to  assign 
stable  descriptors.  For  more  stable  descriptions  of  re¬ 
flection  properties  in  more  general  environments,  it  is 
desirable  to  extract  extra  information  from  a  synergistic 
combination  of  multiple  cues.  The  spectral  differencing 
algorithm  demonstrates  the  synergy  from  the  combina¬ 
tion  of  color  and  multiple  views.  There  are  also  poten¬ 
tials  for  extra  information  from  the  combination  of  color, 
polarization  and  multiple  views. 


Figure  4:  View  sampling 
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4  Surface  and  Volumetric  Segmentation 
of  Complex  3-D  Objects  Using 
Parametric  Shape  Models 

The  problem  of  part  definition,  description,  and  decom¬ 
position  is  central  to  shape  recognition  systems.  In  this 
paper  we  present  an  integrated  framework  for  segment¬ 
ing  dense  range  data  of  complex  3-D  scenes  into  their 
constituent  parts  in  terms  of  surface  (bi-quadrics)  and 
volumetric  (superquadrics)  primitives,  without  a  priori 
domain  knowledge  or  stored  models.  Our  objective  is 
to  recover  a  structured  description  of  complex  3-D  ob¬ 
jects,  guided  entirely  by  the  geometric  properties  of  the 
shape  models.  The  resulting  decomposition  into  parts 
is  very  useful  for  the  high-level  processes,  which  can  at¬ 
tach  domain  specific  labels  to  the  parts,  and  reason  at  a 
level  where  the  visual  input  is  structured  in  terms  of  ge¬ 
ometric  primitives,  rather  than  cope  with  the  difficulties 
of  low-level  vision  and  a  huge  amount  of  unstructured 
data. 

Since  the  shapes  have  to  be  recovered  from  raw  data, 
it  is  not  possible  to  invoke  complex  models  (models  with 
hundreds  of  degrees  of  freedom)  straight  away.  It  is, 
however,  feasible  and  perceptually  less  ambiguous  to  use 
simpler  but  powerful  models  that  can  capture  the  local 
and  global  properties  of  the  object  shapes,  and  provide  a 
first  approximation  to  the  more  complex  models.  With 
computability,  simplicity,  and  the  utility  of  the  shape 
representation  as  our  major  concerns,  we  use  bi-quadrics 
and  superquadrics  as  our  surface  and  volumetric  models 
respectively.  We  develop  SUPERSEG  (SUPERquadric 
SEGmentation),  a  control  structure  to  effectively  carry 
out  the  decomposition  of  complex  objects  in  range  im¬ 
ages,  and  addres.s  the  numerous  issues  encountered  in  a 
data-driven  bottom-up  approach  [13;  14;  15]. 

The  SUPERSEG  system  5  has  five  major  components: 
namely,  the  bi-quadric  surface  segmentation  module;  the 
module  for  extracting  surface  properties  and  adjacency 
relationships;  the  superquadric  model  recovery  module; 
the  residual  generation  and  analysis  module;  and  the 
control  module  for  superquadric-based  segmentation. 

4.1  Surface  Segmentation:  Bi-quadric  Models 

The  surface  segmentation  is  performed  by  a  novel  local- 
to-global  iterative  regression  approach  of  searching  for 
the  best  piecewise  description  of  the  data  in  terms  of 
bi-quadric  models  [16;  17].  The  model-recovery  mod¬ 
ule  consists  of  independently  extrapolating  all  the  .seed- 
regions  and  fitting  the  mociv  using  the  lerist-squares  re¬ 
gression  method.  The  region-growing  is  controlled  by 
a  compatibility- constraint,  whose  value  depends  on  the 
noise  due  to  sen.sor  and  quantization,  as  well  as  the  al¬ 
lowed  tolerance  between  the  shapes  of  the  model  and  un¬ 
derlying  data.  Seed-regions  are  placed  in  a  grid-pattern 
all  over  the  image,  and  allowed  to  grow  until  they  are  ei¬ 
ther  completely  grown  or  rejected  by  the  model-.selection 
procedure  (which  maximizes  a  linear  benefit-cost  func¬ 
tion).  Instead  of  first  growing  all  the  regions  and  then 
invoking  the  model-selection  procedure  (Recover- then- 
select),  the  model-recovery  and  model-selection  pro- 
ce.s.se.s  are  dynamically  combined  (Recover-and-select)  to 


Figure  5:  The  SUPERSEG  system:  A  framework  for 
surface  and  volumetric  segmentation. 

achieve  a  computationally  feasible  and  robust  method 
capable  of  rejecting  outliers  and  determining  its  domain 
of  applicability. 

4.1.1  Refining  Surface  Segmentation  & 
Extracting  Surface  properties 
The  bi-quadric  segmentation  achieved  by  the  above 
procedure  needs  refinement  before  it  can  be  used  as  an 
intermediate  segmentation  by  superquadric-based  vol¬ 
ume  segmentation.  Also,  the  coefficients  of  the  second- 
order  surfaces  have  information  about  orientation  and 
surface-type  (convex  or  concave)  inherent  in  them.  The 
orientation  information  is  tremendously  useful  in  align¬ 
ing  the  major  axis  of  cylindrical  superquadric  models. 
Further,  due  to  the  compatibility-constraint,  regions  in¬ 
tersecting  to  form  surface  normal  discontinuities  (C'l) 
overlap  in  the  vicinity  of  the  discontinuity,  thereby  local¬ 
izing  it.  We  developed  a  systematic  method  for  tracing 
the  biquadric  intersection  curve,  which  is  used  to  refine 
the  segmentation  as  well  as  to  localize  the  discontinuities 
(edges)  and  to  characterize  them  as  convex  or  concave. 
In  addition,  a  surface  adjacency  graph  (SAG)  is  con¬ 
structed  with  surface  patches  as  nodes  and  discontinuity- 
type  as  edges  between  them.  The  information  extracted 
from  the  bi-quadric  patches  is  used  to  generate  and  test 
hypotheses  by  the  volumetric  segmentation  module. 

4.2  Superquadrics:  Volumetric  Part-Models 

Superquadric  models  are  convex  part-models  (except  the 
bent  models)  that  can  be  recovered  for  a  given  set  of 
.3-D  points  by  minimizing  a  function  based  on  the  modi¬ 
fied  implicit  inside-outside  superquadric  funct  ion  [18;  19; 
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EVALUATION  A  INTEGRATION 

‘  Residual  Analysw  lor  Part  Hypotheses. 

‘  Extrapolation  (growth)  of  Part-modslB. 

Contour/Surfacs  Residuals 
Over/underestimated  Regions 

Superquadric  Model  Evaluation 

CONTIWL  MODULE 
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Concave  edge 
Convex  edge 


Figure  6:  The  NIST  object:  Top;  The  ranee  image  and  its  bi-quadric  surface  segmentation.  Center:  tlieCi  (surface 
normal)  edges  marked  at  the  overlapping  p..rts  of  the  surfaces,  following  a  procedure  similar  to  tlu-  intersection 
cleaning,  the  edges  are  marked  as  convex  or  concave  end  a  surface  adjacency  graph  (SAG)  is  constructed.  Bottom: 
The  three  iterations  of  the  global-to-local  procedure  to  extract  the  part-structure. 


15].  This  formulation  enforces  a  minimum  volume  con¬ 
straint  as  well  as  a  surface  constraint,  but  is  incapable 
of  decomposing  the  data  set  if  no  appropriate  convex 
model  Ccin  be  found  in  the  model  vocabulary.  Thus,  the 
superquadric  model  recovery  module  is  adequate  only 
for  recovering  an  optimal  model  (if  oriented  correctly) 
given  a  data  set,  but  not  for  segmenting  it.  To  decide 
whether  a  recovered  model  is  adequate  for  the  given  data 
set,  we  have  developed  an  exhaustive  set  of  criteria  com¬ 
prised  of  qualitative  and  quantitative  measures.  Quan¬ 
titative  measures  are  the  normalized  global  deviation  of 
the  model  from  data.  The  deviation  can  be  the  inside- 
outside  function  value,  or  can  be  measured  along  the 
direction  of  the  viewpoint  (Z-residuals  for  a  range  scan¬ 
ner),  or  along  the  direction  of  the  minimum  distance  of 
a  point  from  the  model  (Euclidean  distance).  The  qurd- 
itative  measures  are  the  ‘local’  residuaJs  characterized 
by  the  clusters  of  3-D  points  that  are  either  inside  the 
model,  or  on  the  model,  or  outside  the  model.  Both 
qualitative  and  quantitative  measures  are  necessary  for 
complete  evaluation  of  a  recovered  model. 

4.3  Volumetric  Segmentation;  The  Control 
Strategy 

In  view  of  the  fact  that  volumetric  models  don’t  have 
good  surface  support  (as  opposed  to  bi-quadric  models), 
they  cannot  be  recovered  by  following  exclusively  the  ex¬ 
trapolation  method  (local-to-global)  used  by  bi-quadrics. 
In  order  to  obtain  an  optimal  piecewise-convex  volumet¬ 
ric  segmentation,  it  is  necessary  to  proceed  global-to- 
local,  where  data  is  decomposed  only  if  the  global  model 
is  inadequate.  This  allows  controlled  residual-driven  de¬ 
composition  of  3-D  data,  as  also  introduction  of  an  ob¬ 
jective  evaluation  criteria  for  an  acceptable  description. 
However,  the  global-to-local  method  can  be  aided  by  the 
bi-quadric  segmentation  in  forming  hypotheses  about 
convex  combination  of  surfaces,  which  although  is  not 
true  in  general  (an  L  shape  for  example),  can  signifi¬ 
cantly  reduce  the  computational  overhead  if  true  for  a 
particular  part.  Previous  researchers  have  a.ssumed  that 
a  1-to-l  mapping  exists  between  surface  patches  and  su¬ 
perquadric  models,  which  is  also  not  true  in  general.  But 
it  does  provide  a  planarity  check  for  the  patches,  as  well 
as  the  orientation  and  shape  of  the  individual  patches  in 
3-space. 

Thus,  a  strategy  that  combines  the  bi-quadric  infor¬ 
mation  with  the  global-to-local  residual-driven  method 
is  most  effective  in  recursively  segmenting  the  scene  to 
derive  the  part-structure  [13].  A  set  of  acceptance  crite¬ 
ria  based  on  the  quantitative  and  qualitative  measures 
provide  the  objective  evaluation  of  intermediate  descrip¬ 
tions,  and  decide  whether  to  terminate  the  proceduic, 
or  selectively  refine  the  segmentation,  or  generate  nega¬ 
tive  volume  description.  The  control  module  generates 
hypotheses  about  superquadric  models  at  clusters  of  un¬ 
derestimated  data  and  performs  controlled  extrapolation 
of  part-models  by  shrinking  the  global  model.  The  re¬ 
cursive  splitting  of  data  results  in  a  hierarchical  part- 
structure  comprising  of  global  and  local  models.  The 
results  of  complete  processing  of  the  range  image  of  a 
machined  object  (from  NIST)  is  shown  in  Figure  6. 


We  have  tested  the  SUPERSEG  system  on  real  range 
images  of  scenes  of  varying  complexity,  including  objects 
with  occluding  parts,  and  scenes  where  surface  segmen¬ 
tation  is  not  sufficient  to  guide  the  volumetric  segmenta¬ 
tion.  Some  of  the  applications  of  our  approach  include, 
data  reduction,  3-D  object  recognition,  geometric  mod¬ 
eling,  automatic  model  generation,  object  manipulation, 
qualitative  vision,  and  active  vision. 

5  A  Framework  for  Visual  Observation 

In  this  work  we  establish  a  framework  for  the  general 
problem  of  observation,  which  may  be  applied  to  dif¬ 
ferent  kinds  of  visual  taisks.  We  define  “intelligent’’ 
high-level  control  mechanisms  for  the  observer  in  order 
to  achieve  efficiency  in  recognizing  different  processes 
within  a  specific  dynamic  system.  The  intelligent  ob¬ 
server  is  able  to  recognize  the  visual  tasks,  understands 
the  meaning  of  the  scene  evolution  and  successfully  re¬ 
ports  on  the  current  visual  state.  It  is  obvious  that  there 
is  a  need  for  high-level  interpretation  of  actions  within 
the  environment  and  to  have  guarantees  for  observation 
capabilities  and  stability  within  the  viewing  mechanism. 
The  framework  is  a  predictable  one  that  satisfies  the  fol¬ 
lowing  general  requirements; 

•  Recognizes  visual  tasks  and  events. 

•  Repositions  itself  adaptively  auid  intelligently. 

•  Operates  in  real  time. 

•  Asserts  and  reports  on  distinct  and  discrete  visual 
states. 

•  Utilizes  the  continuous  parametric  evolution  of  the 
visual  system. 

•  Accommodates  visual  uncertainties. 

We  concentrate  on  observing  a  manipulation  process 
in  order  to  illustrate  the  ideas  and  motive  behind  our 
framework.  The  process  of  observing  a  robot  hand  ma¬ 
nipulating  an  object  is  very  crucial  for  many  robotic  and 
manufacturing  tasks.  It  is  important  to  know  in  an  au¬ 
tomated  manufacturing  environment  whether  the  robot 
hand  is  doing  the  correct  sequence  of  operations  on  an 
object  (or  more  than  one  object).  It  might  be  a  fcict  that 
the  workspace  of  the  robotic  manipulator  cannot  be  ac¬ 
cessed  by  humans,  as  in  the  ceise  of  some  space  applica¬ 
tions  or  some  areas  within  a  nuclear  plant,  for  example. 
In  .such  a  case,  having  another  robot  “look"  at  the  pro¬ 
cess  is  a  very  good  option.  Thus,  the  observation  process 
can  be  thought  of  jus  a  stage  in  a  closed-loop  fully  auto¬ 
mated  system  where  there  are  robots  who  perform  the 
required  manipulation  task  and  some  other  robots  who 
observe  them  and  correct  their  actions  when  something 
goes  wrong  Typical  manipulation  processes  include 
grasping,  pushing,  pulling,  lifting,  squeezing,  screwing 
and  unscrewing.  In  this  work,  we  address  the  problem 
of  observing  a  single  hand  manipulating  a  single  object 
and  recognizing  what  the  hand  is  doing.  No  feedback 
will  be  supplied  to  the  manipulating  robot  to  correct  its 
actions.  We  divide  the  problem  into  three  major  com¬ 
ponents.  First,  we  identify  a  high-level  framework  for 
the  visual  states.  Next,  we  define  the  events  that  cause 
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Figure  7:  A  Model  for  a  Grasping  Task 

state  transitions.  Finally,  we  utilize  visual  uncertainties 
to  assert  the  state  of  the  system. 

5.1  State  Space  Modeling 

We  use  a  discrete  event  dynamic  system  as  a  high-level 
structuring  skeleton  to  model  the  visual  manipulation 
system.  Discrete  event  dynamic  systems  (DEDS)  are  dy¬ 
namic  systems  (typically  asynchronous)  in  which  state 
transitions  Jire  triggered  by  the  occurrence  of  discrete 
events  in  the  system.  Our  formulation  uses  the  knowl¬ 
edge  about  the  system  and  the  different  actions  in  or¬ 
der  to  solve  the  observer  problem  in  an  efficient,  sta¬ 
ble  and  practical  way.  The  model  incorporates  differ¬ 
ent  hand/object  relationships  and  the  possible  errors  in 
the  manipulation  actions.  It  also  uses  different  tracking 
mechanisms  so  that  the  observer  can  keep  track  of  the 
workspace  of  the  manipulating  robot.  A  framework  is 
developed  for  the  hand/object  interaction  over  time  and 
a  stabilizing  observer  is  constructed.  The  construction 
process  utilizes  a  task-dependent  coarse  quantization  of 
the  manipulation  actions  in  order  to  attain  an  active, 
adaptive  and  goal-directed  sensing  mechanism.  An  ex¬ 
ample  of  a  DEDS  automaton  for  a  simple  grasping  task 
is  shown  in  Figure  7. 

5.2  Event  Identification 

Low-level  modules  are  developed  for  recognizing  the 
“events”  that  cause  state  transitions  within  the  dynamic 
manipulation  system.  To  be  able  to  observe  how  the 
events  evolve  over  time,  we  must  be  able  to  identify  how 
the  hand  moves  and  how  the  hand/object  physical  re¬ 
lationship  evolves  over  time.  We  use  a  mix  of  2-D  and 


3-D  modules  to  recover  a  set  of  parameters  that  define 
the  continuous  parametric  evolution  of  the  scene  under 
observation.  Three  dimensional  evolution  of  the  hand 
motion  is  recovered  by  tracking  a  set  of  features  and 
two-dimensional  cues  to  the  number  of  objects  and  their 
relative  location;  two  dimensional  motion  with  respect 
to  the  manipulating  hand  is  recovered  in  real-time.  The 
recovered  events  are  then  used  to  assert  state  transitions 
within  the  DEDS  automata.  We  also  recover  uncertain¬ 
ties  associated  with  the  visual  event  recovery  and  utilize 
them  for  navigating  the  observer  automata. 

5.3  Utilizing  Uncertainties 

This  work  examines  closely  the  possibilities  for  errors, 
mistakes  and  uncertainties  in  the  visual  manipulation 
system,  observer  construction  process  and  event  identifi¬ 
cation  mechanisms.  We  divide  the  problem  into  a  num¬ 
ber  of  major  levels  for  developing  uncertainty  models  in 
the  observation  process.  The  propagation  of  uncertainty 
is  shown  in  Figure  8. 

The  sensor  level  models  deal  with  the  problems  in 
mapping  3-D  features  to  pixel  coordinates  and  the  errors 
incurred  in  that  process.  We  identify  these  uncertainties 
and  suggest  a  framework  for  modeling  them.  The  next 
level  is  the  extraction  strategy  level,  in  which  we  develop 
models  for  the  possibility  of  errors  in  the  low-level  image 
processing  modules  used  for  identifying  features  that  au-e 
to  be  used  in  computing  the  2-D  evolution  of  the  scene 
under  consideration.  In  the  following  level,  we  utilize  the 
geometric  and  mechanical  properties  of  the  hand  and/or 
object  to  reject  unrealistic  estimates  for  2-  D  movements 
that  might  have  been  obtained  from  the  first  two  lev¬ 
els.  We  transform  the  2-D  uncertainty  models  into  3- 
D  uncertainty  models  for  the  structure  and  motion  of 
the  entire  scene.  The  next  level  uses  the  equations  that 
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Figure  9:  Experimental  Setting 


govern  the  2-D  to  3-D  relationship  to  perform  the  con¬ 
version.  VVe  then  reject  the  improbable  3-D  uncertainty 
models  for  motion  and  structure  estimates  by  using  the 
existing  information  about  the  geometric  and  mechanical 
properties  of  the  moving  components  in  the  scene.  The 
highest  level  is  the  DEDS  formulation  with  uncertainties, 
in  which  state  transitions  and  event  identification  is  as¬ 
serted  according  to  the  3-D  models  of  uncertainty  that 
were  developed  in  the  previous  levels,  and  error  recovery 
is  performed  according  to  the  ordering  of  the  recovered 
distributions. 

5.4  Conclusions 

The  approach  used  can  be  considered  as  a  framework  for 
a  variety  of  visual  tasks,  as  it  lends  itself  to  be  a  prac¬ 
tical  and  feasible  solution  that  uses  existing  information 
in  a  robust  and  modular  fashion.  The  work  examines 
closely  the  possibilities  for  errors  and  uncertainties  in 
the  manipulation  system,  ob.server  construction  process 
and  event  identification  mechanisms.  Ambiguities  are  al¬ 
lowed  to  develop  and  are  resolved  after  finite  time;  recov¬ 
ery  mechanisms  are  devised  too.  Details  of  the  observer 
system  can  be  found  in  [20;  21;  22;  23].  Theoretical  and 
experimental  aspects  of  the  work  support  adopting  the 
framework  as  a  new  basis  for  performing  task-oriented 
recognition,  inspection  and  observation  of  visual  phe¬ 
nomena.  The  observer  and  manipulating  robots  exjjeri- 
mental  setup  is  shown  in  Figure  9. 

6  Spatio- Variant  Sensing 

Traditional  imaging  for  robotics  vision  has  relied  al¬ 
most  exclusively  on  common  commercial  imagers,  no¬ 
tably  television  format  .sensors.  Their  advantages  are 
clear;  the  cameras  are  inexpensive  and  readily  available, 
and  the  sampling  of  the  data  is  on  a  ’’natural”  carte¬ 
sian  (x,y)  grid.  The.se  sensors  have  placed  enormous  de¬ 
mands,  however,  on  processing  architectures.  The  prob¬ 
lem  is  not  only  that  image  analysis  is  an  ill-defined  ta.sk 


in  the  real  world,  but  that  we  have  only  very  expensive 
machines  that  can  begin  to  process  the  data. 

Over  the  la.st  seven  years  an  international  team,  led  by 
Van  der  Spiegel  at  the  University  of  Pennsylvania,  San- 
dini  at  DIST  in  ItcJy,  and  Claeys  at  IMEC  in  Belgium, 
designed,  built,  and  tested  a  new  imaging  chip  called  the 
Retina  [24],  The  new  camera  serves  as  the  foundation 
to  a  new  approach  to  robotics  vision.  We  shift  the  focus 
at  the  systems  level  from  gathering  better  data  and  de¬ 
signing  machines  to  analyze  it  to  gathering  data  for  the 
computing  resources  that  exist.  The  result  is  a  prototype 
sensor  that  reduces  the  computational  complexity  of  the 
problem  by  three  orders  of  magnitude  and,  if  scaled  to 
commercial  cameras,  by  six  orders  [2-5] . 

The  Retina  attempts  to  model  the  gross  characteris¬ 
tics  of  the  primate  visual  system  in  a  mathematically 
elegant  way.  The  computational  savings  arise  from  the 
same  mechanism  the  eye  uses,  namely,  to  maintain  one 
area  of  high  resolution  on  the  focal  plane  and  to  drop 
the  resolution  elsewhere.  The  mathematical  expression 
of  this  is  a  log-polar  mapping.  That  mapping  trans¬ 
forms  a  polar  data  space,  where  a  point  P  has  the  polar 
coordinates  (r,  theta),  by  taking  the  logarithm  of  the  ex¬ 
pression  for  the  point: 

P  =  re'*  — ►  P'‘"  =  ln(r)  +  i6  =  u  -f  iv 

This  mapping  has  the  useful  property  of  separating  ro¬ 
tations  (changes  in  theta)  from  magnifications  (changes 
in  r).  If  the  sensor  has  a  uniform  sampling  grid  in  u  (In 
(»•)),  then  the  spatial  grid  in  r  will  exponentially  grow  as 
distance  from  the  center  grows.  This  models  the  growth 
of  the  receptive  fields  in  primate  retinas. 

The  Retina  layout  in  Figure  10  implements  this  map¬ 
ping  by  sampling  in  {r, theta)  at  points  matching  a  uni¬ 
form  (  «,  u)  grid.  The  sensor  clearly  has  rotational  sym¬ 
metry  and  exponentially  decreeising  resolution.  The  cir¬ 
cular  section  contains  only  1920  pixels  (30  circles  of  64 
pixels/circle);  at  the  center  is  a  dense  rectangular  grid  of 
102  additional  photosites  [26].  The  cells  grow  fast:  the 
outermost  circle  is  over  ten  times  as  wide  as  the  inner¬ 
most.  This  leads  directly  to  the  small  pixel  count. 

The  chip,  with  its  custom  driving  electronics,  is  now 
working  at  the  GRASP  laboratory  [27]  and  is  producing 
good  pictures  as  shown  in  Figure  11. 

Clearly  visible  in  the  data  space  is  the  large  magni¬ 
fication  of  the  inner  circles.  The  outer  .section  provides 
much  poorer  data,  with  pi.xels  widely  spaced  and  aver¬ 
aging  the  incident  light  over  a  larger  area.  Still  they  do 
not  provide  useless  information. 

The  nature  of  the  information  has  changed,  however. 
No  longer  do  we  get  high  quality  data  across  the  focal 
plane.  Indeed,  we  a.s.sume  from  the  start  that  we  do  not 
try  to  build  a  model  of  the  world  in  one  step.  Insteail, 
we  use  the  periphery  to  guide  our  attention  where  we 
point  the  camera.  Implicit  here  is  the  idea  of  an  active 
observer.  The  Retina,  just  sitting  on  a  bench  waiting 
for  an  object  to  enter  its  high-resolution  spot,  is  usidess. 
We  must  actively  build  the  world  by  moving  t  he  canu'ra. 
u.sing  the  periphery  to  suggest  candidates  for  attention. 

The  cost  of  using  this  sensor  might  be  considered  high. 
The  new  data  space  will  require  rewriting  or  adapting 
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Figure  10:  The  Retina  CCD  Imager 


Figure  11:  Picture  of  a  mouse  from  the  camera,  centered  between  the  buttons  (to  the  left)  eind  ball.  The  picture  on 
the  left  is  in  the  mapped  plane:  the  vertical  axis  is  r  (v,  the  angle  of  the  point,  increases  moving  down  the  axis)  and 
the  horizontal  is  u  (u,  the  log  of  the  radial  distance  of  the  point,  increases  to  the  right).  The  triangle  at  the  upper 
left  of  the  image  is  data  remapped  back  onto  a  cartesian  grid. 
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all  our  tools  for  the  cartesian  plane;  this  is  the  primary 
cost  outside  the  hardware  development.  The  advantages, 
however,  suggest  profit.  The  Retina  has  some  one  hun¬ 
dred  times  fewer  pixels  than  a  standard  television  cam¬ 
era,  which  drastically  reduces  the  computational  burden 
of  analysis,  bringing  it  within  the  abilities  of  modern 
machines.  The  gains  also  include  the  rich  mathemati¬ 
cal  structure  of  the  mapping.  That  structure  simplifies 
pattern  matching  by  m^ing  rotations  and  magnifica¬ 
tions  lineeir  shifts  in  the  data  space,  and  speeds  time-to- 
impact  measurements  by  looking  only  at  a  radial  flow. 
Some  distortions  introduced  by  the  mapping,  such  as 
translational  variance  (which  is  linear  translations  be¬ 
coming  curves  in  the  data  space)  also  disappear  in  an 
active  observer,  where  for  example  attention  and  track¬ 
ing  automatically  compensate  for  linear  motion. 

Since  the  sensor  began  working  this  summer,  our  focus 
at  the  GRASP  laboratory  has  been  redeveloping  tradi¬ 
tional  image  processing  tools.  Our  work  has  looked  at 
edge  detection  in  the  new  data  space,  detecting  lines  us¬ 
ing  a  Hough  algorithm,  calculating  the  centroid  of  an  ob¬ 
ject,  and  measuring  time-to-impact.  Each  of  these  areas 
requires  an  analysis  of  their  mathematical  basis  under 
the  log  mapping  and  coding  the  results  on  real  images. 
All  algorithms  must  further  be  computationally  simple 
to  work  in  a  real-time  environment. 

This  integration  of  sensor  and  computer  is  now  the 
fundamental  area  of  research  involving  the  Retina  at 
Penn.  That  the  Retina  works  proves  the  concept  of 
the  hardware,  of  designing  custom  imaging  sensors  for 
robots.  The  integration  itself  will  prove  the  concept  of 
the  system.  The  Retina  is  the  basic  building  block  for  a 
real-time  interactive  observer. 

7  Conclusions  and  future  plans 

The  development  of  an  Active  Observer  is  underway 
at  the  GRASP  laboratory.  Although  future  emphasis 
will  be  placed  on  the  control  structure  of  such  an  ob¬ 
server,  its  integration  policies,  and  communication  issues 
with  other  observers  and  agents  in  general,  there  is  still 
a  need  for  further  studies,  developments  and  improve¬ 
ments  of  component  technologies.  For  example,  in  the 
case  of  understanding  surface  reflectance,  we  still  have 
not  completed  the  theoretical  underpinning  of  trans¬ 
parency.  With  the  problem  of  segmentation,  while  the 
cooperation  between  surface  and  volumetric  fittings  is 
necessary,  and  they  help  in  resolving  ambiguities,  the 
first  and  second  order  primitives  are  clearly  not  sufficient 
for  modeling  a  broad  class  of  real  life  object:.  Higher 
order  models  will  have  to  be  invoked,  but  only  selec¬ 
tively  and  locally  after  the  lower  order  fits  have  failed.  If 
this  order  of  fitting  data  is  violated  then  instabilities  in 
the  fitting  procedures  can  be  expected.  Finally,  there  is 
the  question  of  the  control  mechanism  of  the  Active  Ob¬ 
server.  As  shown  above,  we  have  employed  the  Discrete 
Event  Dynamic  System  model.  DEDS  is  a  suitable  for¬ 
malism  to  model  continuous  processes  of  observation,  as 
well  as  events  occuring  in  discrete  intervals.  As  a  result, 
this  model  allows  us  to  predict  the  observation  capabil¬ 
ity  as  defined  by  the  control  theory  community.  The 
assumption  here,  however,  is  that  the  task  of  observa¬ 


tion  is  a  priori  in  terms  of  the  discrete  events.  While  in 
the  original  theory  the  transitions  from  one  state/event 
to  another  were  discrete,  we  have  extended  the  theory 
to  transitions  with  uncertainties.  The  next  task  should 
be  to  loosen  the  requirements  for  explicit  knowledge  of 
the  desired  observable  events.  These  events  should  be 
able  to  be  generated  from  some  rules  of  physics,  geom¬ 
etry  and  other  conventions  of  the  object’s  and  agent’s 
interactions.  In  conclusion,  we  are  on  our  way  to  com¬ 
plete  an  Active  Observer  which  has  a  control  structure 
that  allows  us  to  predict  observation  capabilities.  The 
components  developed  here  allow  the  Active  Observer  to 
handle  moderately  complex  scenes  of  shapes/materials, 
their  spatial  arrangements  and  their  illuminations.  The 
real  time  issue  of  processing  is  a  crucial  one  and  hence 
our  efforts  in  special  purpose  CCD  chips  and  related 
hardware.  The  open  questions  are  many  but  we  wish 
to  concentrate  on  the  intercommunication  of  several  ob¬ 
servers  and  other  agents,  such  as  manipulatory,  mobile 
and  human  agents.  Ultimately,  the  final  issue  is  this; 
who  tells  what  and  how  much,  and  to  whom. 
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Abstract 

Our  researcL  focuses  on  navigation  in  out¬ 
door  terrain.  We  are  particularly  concerned 
with  viewpoint  determination  in  the  absence 
of  easily  and  uniquely  identifiable  landmarks. 
Progress  has  been  made  on  methods  for  ex¬ 
tracting  topographic  features,  on  the  develop¬ 
ment  of  effective  strategies  for  accurate  view¬ 
point  determination,  and  in  characterizing  lo¬ 
calization  errors.  These  results  are  relevant  to 
problems  in  mobile  robotics,  navigation  aids, 
and  training. 

1  Introduction 

Vision-based  navigation  is  a  difficult  and  challenging 
problem,  particularly  in  large-scale  outdoor  domains. 
The  Universities  of  Utah  and  Minnesota  are  jointly  con¬ 
ducting  research  on  navigation  in  outdoor  terrain.  The 
current  focus  is  on  localization  tasks  in  which  a  viewpoint 
must  be  determined  given  a  map  of  the  area  and  imagery 
available  at  the  viewpoint  [Heinrichs  ei  ai,  1989]. 

Two  general  approaches  to  localization  are  possible. 
Signal-based  methods  (e.g.,  [Ernst  and  Flinchbaugh, 
1989,  Andreas  et  ai,  1978,  Baird  and  Abramson,  1984]) 
correlate  expected  imagery  (or  other  data)  generated 
from  terrain  models  with  that  actudly  coming  from  sen¬ 
sors.  Signal-based  approaches  work  best  when  there  are 
good  a  priori  estimates  of  current  location,  reducing  the 
combinatorics  associated  with  correlation  over  viewing 
position  and  direction.  Even  then,  it  can  be  difficult  to 
generate  a  sufficiently  accurate  expected  view  to  com¬ 
pare  against  actual  data  using  any  efficiently  computed 
matching  metric.  Signal-based  methods  have  been  most 
successful  in  applications  such  as  cruise  missile  guidance 
where  active  sensing  is  used  to  gather  range  “images” 
which  can  also  easily  be  synthesized  from  elevation  maps. 
Feature-based  methods  first  extract  salient  features  from 
the  map  and  the  available  imagery  and  then  match  these 
features  rather  than  the  raw  data  itself.  They  are  partic¬ 
ularly  appropriate  when  definable  landmarks  are  present 
and  where  the  available  sensing  modalities  are  such  that 
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photometrically  accurate  synthetic  views  cannot  be  pro¬ 
duced.  Feature-b2ised  approaches  to  object  recognition 
have  recently  been  receiving  substantial  interest  (e.g., 
[Crimson,  1990]),  since  a  careful  choice  of  features  can 
significantly  reduce  sensitivity  to  image  properties  not 
relevant  to  the  task  at  hand. 

Our  work  is  principally  directed  at  fpat  tre-based 
methods  for  localization.  Three  sub-problems  need  to 
be  solved  [Thompson,  1990,  Thompson  et  ai,  1990]. 
Salient  features  relevant  to  localization  must  be  ex¬ 
tracted  from  imagery.  This  is  in  principal  a  rela¬ 
tively  standard  low-level  vision  task,  though  the  im¬ 
age  understanding  community  has  done  only  limited 
work  to  date  on  extracting  such  features  from  ground- 
level  images.  Equivalent  features  need  to  be  extracted 
from  digital  elevation  models  or  other  available  maps. 
The  cartographic  community  has  done  related  work, 
but  not  specifically  in  support  of  localization.  Fi¬ 
nally,  correspondences  must  be  established  between  map 
and  image  features  and  these  correspondences  then 
used  to  determine  the  viewpoint.  The  closest  related 
work  to  this  is  in  photogrammetry  (e.g.,  [Sanso,  1973, 
Thompson,  1958])  and  alignment  approaches  to  ob¬ 
ject  recognition  (e.g.,  [Huttenlocher  and  Ullman,  1987, 
Crimson,  1990]).  Neither  of  these  sources,  however,  pro¬ 
vides  much  help  in  developing  methods  for  establishing 
feature  correspondences  in  terrain  data  while  avoiding 
prohibitive  combinatorial  difficulties. 

Results  obtained  over  the  last  year  include: 

•  Feature  selection  and  grouping  strategies  have  been 
developed  which  reduce  the  combinatorics  of  fea¬ 
ture  matching  while  at  the  same  time  simp)  Hy¬ 
ing  viewpoint  determinations  [Smith  et  ai,  1991, 
Heinrichs  et  ai,  1992,  Pick  et  ai,  1992]. 

•  We  have  shown  that  the  choice  of  the  particular 
landmarks  used  for  localization  can  have  a  dramatic 
effect  on  the  precision  with  which  localization  can 
be  accomplished  [Sutherland,  1992,  Sutherland  and 
Thompson,  1992]. 

•  A  method  has  been  developed  for  extracting  rele¬ 
vant  features  from  elevation  data  which  combines 
local  and  global  effects  to  maocimize  saliency.  Pre¬ 
liminary  results  have  been  obtained  in  extracting 
image  ft  '-ttures  which  are  specifically  relevant  to  the 
localization  task  in  outdoor  terrain  [Savitt  et  ai, 
1992). 
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2  Strategies  for  Localization 

Landmark-based  localization  involves  a  combinatorial 
matching  problem  in  which  features  from  a  map  and  fea¬ 
tures  from  views  must  be  put  into  correspondence.  For 
locadization  to  succeed,  these  correspondences  must  be 
correct  and  must  yield  an  accurate  estimate  of  the  view¬ 
point.  In  the  absence  of  unique  cultural  features,  solv¬ 
ing  the  correspondence  problem  is  difficult  because  in¬ 
dividually  identifiable  topographic  features  are  not  typ¬ 
ically  present  in  views  of  open  terrain.  For  example, 
while  a  number  of  hills  may  be  visible  from  a  particular 
viewpoint,  the  views  seldom  provide  enough  information 
about  the  detailed  shape  of  each  hill  to  find  the  single 
corresponding  pattern  in  a  map. 

Two  important  consequences  follow  from  this  analysis: 

•  Correspondences  must  be  established  in  a  manner 
that  minimizes  ambiguity. 

•  Since  ambiguity  cannot  be  eliminated,  it  must  be 
possible  to  generate  and  then  evaluate  multiple  hy¬ 
potheses  about  the  location  of  the  viewpoint. 

These  objectives  can  be  achieved  by  exploiting  the 
power  of  a  number  of  simple  matching  strategies.  Many 
of  these  strategies  will  seem  obvious,  but  few  have  previ¬ 
ously  been  explicitly  incorporated  into  models  of  local¬ 
ization.  Matching  should  be  initiated  by  starting  with 
features  in  the  view  and  using  these  to  drive  a  search 
for  the  corresponding  map  features.  This  is  because  fea¬ 
tures  in  the  view  are  far  more  likely  to  be  relevant  than 
features  on  the  map,  since  most  map  features  will  not  be 
visible  from  any  single  viewpoint.  Information  about  the 
immediate  surroundings,  often  obtainable  with  reason¬ 
able  accuracy,  can  significantly  constrain  possible  view¬ 
points.  Such  constraints  can  be  incorporated  into  the 
path  planning  process  to  generate  movement  which  fa¬ 
cilitates  localization.  The  manner  in  which  multiple  hy¬ 
potheses  are  compared  is  critical.  In  particular,  it  is  most 
important  to  note  expectations  which  are  not  met.  Fi¬ 
nally,  correspondence  is  ruded  when  individual  features 
are  assembled  into  configurations  of  multiple  features. 
These  configurations  should  have  both  viewpoint  invari¬ 
ant  properties  that  help  in  searching  over  the  map  and 
viewpoint  dependent  properties  that  constrain  possible 
viewpoints. 

3  Sensitivity  in  Viewpoint 
Determination 

Assuming  that  correct  correspondences  between  visi¬ 
ble  landmarks  and  map  features  have  beeri  established, 
the  viewpoint  must  still  be  deternuneo.  In  princi¬ 
ple,  this  involves  straightforward  trigonometric  compu¬ 
tations.  However,  effective  localization  also  requires  an 
understanding  of  the  errors  that  can  occur.  Often,  dis¬ 
tance  estimates  to  landmarks  will  not  be  available.  Mea¬ 
surements  of  absolute  bearing  to  a  single  landmark  or  the 
visual  angle  between  landmarks  will  always  have  some 
amount  of  uncertainty  associated  with  them.  This  uncer¬ 
tainty  propagates  through  the  localization  computation 
and  translates  into  uncertainty  about  the  actual  view¬ 
point. 
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Figure  1:  Regions  of  viewpoint  uncertainty  as  a  function 
of  landmark  configurations. 

We  have  completed  an  error  analysis  of  localization 
based  on  relative  bearings  to  landmarks  all  within  a  field 
of  view  <  180®.  The  nature  of  the  error  was  found  to 
vary  substantially  with  the  particular  configuration  of 
landmarks  used.  Figure  1  provides  four  examples.  In 
each  case,  Ob  marks  the  actual  observation  point.  A,  6, 
and  C  indicate  the  location  of  three  landmarks.  In  edl 
four  cases,  the  observation  point  is  approximately  the 
same  distance  from  the  landmarks  and  the  landmarks 
are  approximately  the  same  distance  from  each  other. 
The  dark  line  circumscribes  the  region  of  uncertainty 
associated  with  an  error  of  ±30%  in  the  estimation  of  th'' 
visuad  angle  between  laindmarks.  That  is,  any  location 
within  the  dark  lines  is  consistent  with  visual  angles  that 
vary  no  more  tham  30%  from  the  actual  angles  observed 
at  0b.‘ 

Figure  la  shows  the  situation  when  observations  are 
made  to  the  side  of  an  equally  spaced,  linear  configura¬ 
tion  of  landmairks.  Substantial  localization  uncertainty 
results.  Figures  lb  and  Ic  are  similar,  except  that  land¬ 
marks  are  no  longer  arrayed  along  a  straight  line.  Fig¬ 
ure  Id  shows  another  linear  configuration  of  landmarks, 
this  time  viewed  from  a  position  offset  to  the  side.  Not 
only  is  the  region  larger  than  in  Figure  la,  but  the  area 
of  uncertainty  is  skewed  away  from  the  “true”  position. 

When  there  is  a  choice  of  landmarks  available  on  which 
to  base  a  localization  determination,  it  is  important  to 
choose  the  configurations  that  lead  to  the  least  uncer¬ 
tainty.  Furthermore,  the  general  nature  of  the  uncer¬ 
tainty  regions  can  be  predicted  based  on  simple  proper- 

*  An  error  bound  of  ±30%  in  visual  angle  is  realistic  when 
dealing  with  non-point  features  such  as  rounded  hills.  Even 
when  the  bearing  to  landmarks  can  be  better  localized,  the 
same  qualitative  differences  in  uncertainty  regions  hold. 
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Figure  2;  Ridgelines  and  valleys  using  local  slope  properties. 


ties  of  the  configuration  of  landmarks. 

4  Extraction  of  Map  and  Image 
Features 

Vision-based  navigation  requires  image  understanding 
techniques  to  extract  landmark  features  from  views  of 
the  environment  and  map  understanding  methods  to  lo¬ 
cate  corresponding  features  in  cartographic  data.  While 
existing  segmentation  algorithms  can  be  modified  to  deal 
with  the  first  of  these  tasks,  new  approaches  are  required 
for  map  understanding. 

The  cartographic  community  has  developed  a  num¬ 
ber  of  methods  for  efficiently  encoding  elevation  data. 
Less  attention  has  been  paid  to  the  extraction  of  land¬ 
marks  specifically  useful  for  navigation.  We  know  that 
peaks,  ridgelines,  saddle  points,  and  valleys  are  impor¬ 
tant  in  localization.  Each  of  these  has  a  well  specified 
definition  in  terms  of  differential  operators  applied  to 


the  elevation  surface  [Haralick  et  ai,  1983].  Implemen¬ 
tations  of  these  operators  work  fairly  well  for  detecting 
point  features  such  as  peaks  and  saddle  points  in  digital 
elevation  models.  They  are  less  effective  in  extracting 
line  features  such  as  ridgelines  and  valleys.  The  reason 
for  this  is  that  the  operators  are  computed  over  a  local, 
symmetric  neighborhood.  The  saliency  of  ridge  and  val¬ 
ley  topography  to  navigation,  however,  is  a  function  of 
both  local  and  global  effects.  For  example,  the  impor¬ 
tance  of  a  ridgeline  has  to  do  with  the  sharpness  of  the 
ridge,  its  height  above  the  surrounding  terrain,  and  its 
length.  A  local  differential  operator  can  only  n  .-a-sure 
the  sharpness. 

We  have  developed  a  novel  method  of  extracting  valley 
features  that  employs  a  “reverse  engineering”  heuristic 
based  on  the  processes  that  created  the  valley  in  the 
first  place.  A  hydrologic  simulation  is  done  to  deterinine 
the  flow  that  would  result  if  the  local  topography  was 


covered  with  a  uniform  density  of  fluid. ^  We  have  found 
that  high  flow  rates  in  the  simulation  are  associated  with 
valleys  that  are  of  a  magnitude  sufficient  to  make  them 
priority  candidates  as  landmarks.  Ridgelines  are  found 
using  a  variation  of  this  method  that  is  less  closely  re¬ 
lated  to  the  terrain  forming  geological  processes  but  is 
nevertheless  effective  in  finding  salient  features.  In  effect, 
the  fluid  flow  is  now  “up  hill” ,  with  appropriate  modifi¬ 
cations  to  deal  with  the  fact  that  ridgelines  do  not  vary 
monotonically  in  elevation  along  their  length. 

Figure  2  shows  valleys  (coded  in  white)  and  ridges 
(coded  in  black),  extracted  from  a  digital  elevation  model 
using  local  differential  detectors.  Figure  3  shows  the 
corresponding  features  obtained  with  the  hydrologic  flow 
method.  (Not  shown  is  the  fact  that  this  method  gives 
a  measure  of  importance  to  each  point  on  the  ridge  and 
valley  features.)  The  features  in  Figure  3  are  connected, 
as  they  should  be,  and  as  shown  in  [Savitt  et  a/.,  1992] 
better  correspond  to  significant  topographic  structures. 

5  Relevance 

The  work  outlined  above  has  relevance  to  a  wide  variety 
of  applications.  When  large  numbers  of  potential  land¬ 
marks  are  available  to  a  navigation  system,  our  work 
on  error  analysis  and  sensitivity  provides  criteria  for  se¬ 
lecting  those  landmarks  which  will  yield  the  highest  ex¬ 
pected  precision.  Such  information  is  useful  in  both  mis¬ 
sion  planning  situations  and  for  automated  navigation 
aids.  The  extraction  of  salient  topographic  features  from 
digital  elevation  data  may  be  of  use  in  both  navigation 
aids  and  training.  The  problem  solving  strategies  which 
make  possible  efficient  feature-based  solutions  to  navi¬ 
gation  problems  will  be  important  to  future  automated 
systems  which  use  such  a  feature-based  approach.  More 
immediately,  it  may  be  possible  to  use  an  understand¬ 
ing  of  these  strategies,  determined  from  a  computational 
analysis  of  the  problem,  to  improve  education  in  map 
usage.  Finally,  this  research  contributes  to  basic  science 
in  image  understanding.  Localization  provides  a  well- 
defined  task  with  clear  criteria  for  success  enabling  the 
effective  evaluation  of  the  associated  image  understand¬ 
ing  techniques.  In  addition,  a  formal  analysis  of  vision- 
based  navigation  problems  reveals  similarities  with  other 
computer  vision  tasks.  As  an  example,  our  work  on  er¬ 
ror  and  sensitivity  has  relevance  to  pose  estimation  in 
object  recognition. 
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Abstract 

We  continue  to  explore  the  issues  involved  in 
building  integrated  planning,  sensing,  and  con¬ 
trol  systems  for  applications  in  service  robotics. 

The  project  is  applicable  to  materials  handling 
robotics,  ground  or  aerial  information  gather¬ 
ing  robotics,  and  other.  The  robotics  appli¬ 
cations  serve  as  a  practical  focus  and  impetus 
for  combining  individual  efforts  in  image  under¬ 
standing,  tactile  sensing,  planning,  and  control. 

One  common  effort  is  focussed  on  a  small  mo¬ 
bile  platform  mounting  a  pair  of  stereo  cameras, 
a  low  resolution  laser  rangefinder,  and  a  me¬ 
chanical  force  sensor.  The  goal  here  is  model- 
based  navigation  through  corridors  and  rooms, 
and  recognition  and  position  estimation  of  ob¬ 
jects  to  be  acquired.  The  algorithms  and  sys¬ 
tems  being  developed  for  this  purpose  are  spe¬ 
cialized  versions  of  more  general  results  coming 
from  a  number  of  research  directions.  Those  re¬ 
ported  on  in  this  overview  are  the  following. 

1.  Vision:  Major  progress  was  made  in:  2D 
curve  and  3D  surface  object  modeling,  ro¬ 
bust  low  computational  cost  recognition, 
and  analysis  based  on  implicit  polynomi¬ 
als  of  degree  higher  than  2;  robust  low 
computational  cost  estimation  of  optical 
flow  for  motion  detection  and  object  track¬ 
ing;  highly  complex  curve  shape  modeling 
and  recognition  based  on  reaction-diffusion 
equations. 

2.  Mechanical  Sensing:  3D  objects  can  now 
be  recognized  by  measuring  planar  surface 
curves  at  different  elevations  with  a  contin¬ 
uous  contact  probe,  £ind  inferring  the  sur¬ 
face  from  the  curves. 

3.  Planning  and  System  Integration:  We 

have  developed  efficient  decision-theoretic 
methods  for  generating  multistep  plans 
that  involve  information  gathering  and  the 
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control  of  sensing.  These  methods  are  be¬ 
ing  applied  to  mobile  robotics  problems  in¬ 
volving  conflicting  tasks  and  requiring  rel¬ 
atively  short  response  times. 

A  low  computational  cost  algorithm  find¬ 
ing  paths  through  a  region  populated  by 
curved  complex  shaped  objects  has  been 
developed  by  modeling  objects  as  implicit 
polynomial  surfaces. 

Other  work  in  new  areas  in  vision  and  in  planning  and 
system  integration  is  in  progress,  and  will  be  presented 
in  the  next  reporting  period. 

1  Vision 

1.1  Implicit  Polynomial  Curves  of  Degree 
Higher  Than  Two 

Implicit  higher  degree  polynomials  in  x,y,z  (or  in  x,y 
for  curves  in  images)  have  considerable  global  and 
semiglobal  rejiresentation  power  for  objects  in  3D  space. 
(Spheres,  cylinders,  cones  and  planes  are  special  cases  of 
such  polynomials  restricted  to  second  degree.)  Hence, 
they  have  great  potential  for  object  recognition  and 
position  esliniation.  In  [l3,  14,  15,  16],  Taubin  pre¬ 
sented  a  beautifully  organized  easily  accessible  introduc¬ 
tion  to  these  polynomials  and  some  of  their  properties, 
and  developed  low  computational  cost  algorithms  that 
give  excellent  curve  fits  to  data  in  the  plane  and  non- 
planar  data  in  3D,  and  surface  fits  to  range  data  in 
3D.  Though  the  polynomial  curve  and  surface  fits  to 
the  data  are  stable  in  the  vicinity  of  the  data  and  fit 
the  data  remarkably  well,  in  practice  we  find  that  the 
coefficients  of  these  higher  degree  polynomials  may  be 
sensitive  to  small  changes  in  the  data.  This  poses  a 
problem  since  we  would  like  to  compare  curves  and  sur¬ 
faces  based  on  the  polynomial  coefficients  only.  Since 
the  last  lU  Workshop,  we  have  worked  on  and  solved 
three  problems  to  overcome  this  difficulty  and  enable  the 
use  of  these  polynomials  in  real  world  robust  systems  [4, 
12]:  1)  Characterization  and  fitting  algorithms  for  the 
subset  of  these  algebraic  curves  and  surfaces  that  is 
bounded  and  exists  largely  in  the  vicinity  of  the  data;  2) 
A  Mahalanobis  distance  for  comparing  the  coefficients 
of  two  polynomials  to  reliably  determine  whether  the 
curves  or  stirfaces  that  they  represent  are  close  over  a 
specified  region;  3)  Geometric  Invariants  for  determin¬ 
ing  whether  one  implicit  polynomial  curve  or  surface  is 
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a  rotation  and  translation  of  another,  or  whether  one 
implicit  polynomial  curve  is  an  affine  transformation  of 
another.  Unlike  the  invariants  from  classical  invariance 
theory,  ours  are  functions  of  all  the  coefficients  for  a  poly- 
nomied  of  arbitrary  degree,  and  require  modest  compu¬ 
tation.  In  addition  to  handling  objects  with  easily  de¬ 
tectable  features  such  as  vertices,  high  curvature  points, 
and  straight  lines,  the  polynomials  and  tools  discussed 
in  this  paper  are  ideadly  suited  to  smooth  curves  and 
smooth  curved  surfaces  which  do  not  have  detectable 
features.  The  motivation  for  this  work  was  for  a  system 
that  could  handle  very  general  object  shapes.  This  work 
is  presented  in  this  workshop  proceedings  in  [5]. 

1.2  Practical  Optical  Flow  for  Motion 
Detection  and  Tracking 

Many  current  optical  flow  algorithms  are  not  suited  for 
practical  implementations  such  as  tracking  because  they 
either  require  massively  parallel  supercomputers,  spe¬ 
cialized  hardware,  or  up  to  several  hours  on  a  scien¬ 
tific  workstation.  One  particular  reason  for  this  is  the 
quadratic  nature  of  the  search  algorithms  used  in  these 
problems. 

Recent  work  has  developed  two  modifications  to  these 
types  of  algorithms  which  can  convert  quadratic-time  op¬ 
tical  flow  algorithms  into  (at  worst)  linear-time  ones. 
The  first  uses  a  variable  image  sampling  rate  which 
trades  space  for  time  and  yields  an  algorithm  that  is 
at  worst  linear,  and  at  best  constant,  in  the  speed  of  the 
moving  objects  in  the  image.  This  technique  finds  the 
fastest  motion  in  an  image  and  is  ideal  for  tracking,  since 
the  fastest  moving  objects  in  a  robot’s  environment  are 
generally  the  most  interesting. 

The  second  modification  extends  this  approach  to 
create  a  multiple-speed  optical  flow  field  by  trans¬ 
forming  two-dimensional  searches  over  space  into  one¬ 
dimensional  searches  in  time.  This  space-time  inversion 
has  the  effect  of  searching  for  faster  moving  objects  in 
each  image  before  searching  for  slower  moving  ones,  with 
additional  effort  being  exerted  to  search  for  slower  ob¬ 
jects  only  when  desired.  A  system  of  velocity  masking 
allows  a  treideoflr  of  angular  resolution  (but  not  magni¬ 
tude  resolution)  in  the  optical  flow  field  for  an  algorithm 
that  is  only  linear,  rather  than  quadratic,  in  the  range 
of  velocities  present. 

These  algorithms  yield  real-time  performance  on  stan¬ 
dard  serial  hardware,  and  are  suitable  as  input  to  higher 
level  routines  such  as  tracking  modules  and  depth  deter¬ 
mination  based  on  relative  motion. 

1.3  Exploring  the  Shape  Manifold:  the  Role  of 
Conservation  Laws 

People  have  the  remarkable  ability  to  recognize  objects 
from  their  two-dimensional  projections.  This  is  often 
despite  the  absence  of  other  visual  cues  like  texture, 
color,  stereo,  motion,  etc.  Although  shape  plays  a  sig¬ 
nificant  role  in  recognition,  viable  theories  for  represent¬ 
ing  and  describing  shape  have  been  elusive.  We  believe 
that  the  key  to  robust  object  recognition  is  to  explic¬ 
itly  capture  the  inherent  connectivity  that  exists  among 
similar  shapes.  For  example,  shapes  generated  as  an  ob¬ 
ject  is  being  occluded,  stretched,  bent,  or  dented  are  all 


similar;  this  similarity  relationship  should  be  explicitly 
represented  in  the  space  of  shapes.  We  propo.ie  an  ab¬ 
stract  mathematical  framework  for  shape  which  leads  to 
the  computation  of  a  robus'.  representation  suitable  for 
object  recognition.  Our  approach  relies  on  axioms  mo¬ 
tivated  by  properties  of  objects  and  their  projections. 
The  relationship  between  an  object  and  its  neighbors  in 
this  abstract  space  is  described  by  slight  deformations. 
As  such,  we  study  deformations  of  objects  and  as  a  first 
step,  those  that  change  a  shape’s  boundary  solely  as  a 
function  of  its  local  geometry. 

We  have  shown  that  arbitrary  deformations  of  curves 
as  a  function  of  their  local  geometry  is  to  a  first  degree 
captured  by  combinations  of  deformations  along  the  nor¬ 
mal  that  are  either  constant  or  depend  on  curvature  [6]. 
It  is  in  the  interaction  between  constant  and  curvature 
deformations  that  such  issues  as  boundary/region,  lo- 
cal/global,  and  process/parts  are  resolved.  Our  study 
of  the  evolution  of  curves  under  such  deformations  and 
their  local  behavior  [8]  has  led  to  an  intriguing  result. 
Namely,  this  evolution  satisfies  a  viscous  conservation 
law  which  depicts  a  two  dimensional  space,  our  reaction- 
diffusion  space.  Since  these  deformations  are  nonlinear 
they  lead  to  singularities.  A  notion  of  entropy  for  shape 
is  developed  which  limits  the  singularities  of  shape  to 
shocks.  The  formation  of  shocks  and  their  classification 
in  the  reaction-diffusion  space  is  the  basis  of  our  repre¬ 
sentation  for  shape.  As  a  speciatl  case  the  locus  of  shocks 
gives  the  medial  axis  but  which  in  addition  is  “colored” 
with  type  and  significance.  As  another  application,  a  no¬ 
tion  of  scale  lor  approximating  shape  naturally  emerges 
from  our  freimework  (based  on  entropy)  leading  to  a  two- 
dimensional  entropy  scale  space.  Interestingly,  certain 
morphological  operations  as  well  as  Gaussian  smooth¬ 
ing  methods  are  special  cases  in  this  space  [7].  It  is  in 
this  space  that  the  idea  of  a  shape  similarity  metric  be¬ 
comes  viable.  Our  computational  results  also  suggest 
a  language  for  describing  shape  in  which  parts,  protru¬ 
sions,  and  bends  are  the  basic  elements,  and  the  percep¬ 
tual  reality  of  which  is  illustrated  via  queditative  exper¬ 
iments.  Finally,  from  a  numerical  perspective,  our  sim¬ 
ulations  are  robust  under  low  resolution  (128x128)  and 
high  noise,  and  are  potentially  suitable  for  applications 
involving  object  recognition  and  classification. 

2  Continuous-Contact  Tactile 
(Force/Torque)  Sensing 

The  current  focus  of  this  research  has  been  directed  at 
expanding  the  two  dimensional  algorithm  described  in 
last  year’s  report  to  enable  a  robot  to  track  and  to  rec¬ 
ognize  three  dimensional  objects.  Initially,  the  objects 
considered  are  spheres,  cylinders,  cubes,  cones  and  vari¬ 
ations  of  these. 

The  three  dimensional  object  tracking  algorithm  is 
meant  to  be  as  general  as  possible.  It  does  not  de¬ 
pend  on  an  object’s  orientation,  shape  or  location  in 
the  workspace;  i.e.  there  is  no  prior  knowledge  of  the 
object’s  shape  or  surface  contour.  In  our  initial  experi¬ 
ments,  all  objects  are  tracked  using  the  two  dimensional 
tracking  algorithm.  The  robot  moves  along  an  object’s 
surface  in  a  plane  parallel  to  the  earth’s  surface,  and 
planar  slices  are  taken  at  increasing  levels  in  the  vertical 
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Figure  1:  Thii.  figure  illustrates  a  portion  of  the  reaction 
diffusion  space  for  the  shape  of  a  fighter  jet.  The  horizontal 
axb  is  a  measure  of  reaction  (left)  versus  diffusion  (right)  and 
the  vertical  axis  is  time  (amount  of  deformation).  Note  how 
shocks  form  and  disappear  as  one  traverses  this  space. 


direction.  The  location  of  the  object  and  the  change  in 
shape  of  each  planar  slice  is  identified  from  the  informa¬ 
tion  gathered  during  the  tracking.  Information  about 
the  change  in  the  surface  contour  in  the  vertical  (or  any) 
direction  can  he  extrapolated  from  explicit  knowledge  of 
the  Cartesian  position  of  each  planar  slice.  Given  this 
information,  the  recognition  program  is  able  to  distin¬ 
guish  between  simple  objects,  such  as  spheres,  cylinders, 
cones,  and  boxes. 

Simple  three  dimensional  objects  have  been  used  to 
test  the  tracking  algorithm.  For  example,  planar  slices 
from  a  box  were  identified  as  quadrilaterals  and  the  data 
points  from  a  cylinder  were  recognized  as  circles.  Hence, 
simple  objects  can  be  recognized  by  combining  shape 
information  from  the  planar  data  slices. 

Complex  objects  also  have  been  employed  in  the  ex¬ 
periments.  A  standard  telephone  receiver  has  been 
placed  at  different  orientations  in  the  workspace  and  sur¬ 
face  contour  data  has  I  een  collected  using  a  modified 
dual-drive  tracking  control  algorithm.  This  data  will  be 
used  in  an  implicit  polynomial  model-based  object  rec¬ 
ognizer.  This  work  is  detailed  in  [lO,  ll] 

3  Planning,  System  Integration  and 
Navigation 

3.1  Navigating  a  Robot  between  Curved 
Lrregular  Obstacles 

In  this  work  an  attempt  is  made  to  plan  smooth  collision- 
free  paths  for  mobile  robots  between  obstacles  described 
by  polynomials.  Although  the  idea  of  using  algebraic 
techniques  to  describe  obstacles  is  not  new,  previous 
work  is  theoretical  in  nature  and  does  not  reflect  on 
how  to  describe  arbitrary,  non-polygonal  obstacles.  Also, 
previous  algebraic-based  methods  take  at  least  exponen¬ 
tial  time  to  find  a  collision  free  path.  This  work  over¬ 
comes  these  problems,  it  does  so  by  utilizing  two  pow¬ 
erful  mathematical  tools  to  solve  the  problem  of  nav¬ 


igating  a  mobile  robot  between  irregulary  shaped  ob¬ 
stacles.  Tile  first  is  description  of  obstacles  by  poly¬ 
nomials,  which  enables  a  very  compact  description  of 
fairly  complicated  obstacles,  as  well  as  a  rule  for  very 
quickly  deciding  whether  the  robot  touches  an  obstacle 
or  not.  The  second  tool  is  regularization  to  choose  a 
smooth  solution  from  a  multitude  of  possible  solutions. 
In  our  case  regularization  will  assist  in  choosing  a  short 
and  smooth  path  among  the  paths  constituting  a  solu¬ 
tion,  e.g.  connecting  the  starting  and  goal  positions  and 
avoiding  the  obstacles.  It  is  known  that  the  shortest 
path  problem  is  NP-complete;  regularization  promises  a 
reaisonably  short  path.  A  marked  advantage  of  the  work 
presented  here  is  that  it  allows  to  describe  these  seem¬ 
ingly  different  requirements  from  a  path  -  its  smoothness 
and  non-intersection  with  obstacles  -  in  the  same  lan¬ 
guage,  which  is  a  simple  algebraic  expression  in  the  path 
points,  thus  enabling  a  unified  treatment  of  these  two 
requirements.  The  algorithm  has  a  running  time  that  is 
linear  in  the  number  of  obstacles. 

3.2  Planning  and  System  Integration 

We  have  adopted  Bayesian  decision  theory  as  the  basis 
for  communication  between  modules  performing  veirious 
sensing  and  interpretation  tasks  and  those  engaged  in 
decision  making.  Modules  performing  sensing  and  inter¬ 
pretation  are  prompted  for  information  concerning  spa¬ 
tial  or  object  properties  eind  return  hypotheses  and  asso¬ 
ciated  distributional  information  (e.g.,  an  object  of  type 
T  is  located  in  region  R  with  probability  better  than  tt). 

For  instance,  in  the  process  of  locating  a  particular 
sought-after  object,  a  decision  medcing  module  might  ask 
a  module  responsible  for  interpreting  range  data  from 
stereo  for  information  about  the  location  of  three  planar 
patches  in  a  particular  spatial  configuration.  The  stereo 
module  would  return  a  set  of  hypotheses,  where  each  hy¬ 
pothesis  consists  of  a  3-D  description  for  the  three  planar 
patches,  and  a  measure  of  confidence  for  each  hypothesis. 
The  decision  mediing  module  would  use  the  information 
returned  by  the  stereo  module  to  confirm  (or  discon- 
firm)  one  of  its  own  hypotheses;  it  might  commit  to  a 
particular  hypothesis  (e.g.,  reporting  on  the  location  of 
the  sought-after  object)  or  it  might  engage  in  further  in¬ 
formation  gathering  (e.g.,  moving  the  robot  to  obtain  a 
different  view  emd  then  asking  some  module — not  nec 
essarily  the  stereo  module — for  additional  information 
[12]).  If,  for  example,  low-error  object  identification  is 
required,  the  decision  module  might  use  several  stereo 
image  pairs  taken  from  different  positions  to  generate  a 
small  number  of  candidate  positions  for  the  sought  af¬ 
ter  object,  and  then  employ  a  more  time  consuming  test 
involving  tactile  recognition  to  eliminate  spurious  can¬ 
didates.  The  interpretation  of  range  data  from  stereo 
constitutes  a  relatively  quick-and-dirty  estimate,  while 
the  tactile  sensing  provides  high-accuracy  information 
even  in  low-  or  no-light  situations. 

In  our  framework,  decision  modules  routinely  perform 
value-of-information  calculations  to  decide  whether  to 
perform  additional  information  gathering  tasks.  In  all 
of  the  applications  we  are  looking  at,  the  robot  is  inter¬ 
acting  with  other  processes  outside  of  its  control,  and 
so  it  is  constantly  pressed  for  time.  Information  costs 
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in  terms  of  time  and  opportunities  lost  while  obtaining 
that  information. 

In  [2],  we  describe  our  first  mobile  robotics  applica¬ 
tion  involving  a  robot  given  the  task  of  tracking  and 
periodically  reporting  on  the  position  of  a  mobile  target. 
Performance  is  measured  in  terms  of  the  accuracy  of  the 
reports.  The  robot  is  given  a  representation  (map)  of  its 
environment,  but  has  to  estimate  its  location  in  that  en¬ 
vironment  by  continually  observing  its  surroundings  and 
matching  against  the  information  in  the  map.  Since  its 
primary  sensor  for  obtaining  observations  of  its  environ¬ 
ment  is  an  array  of  sonar  sensors,  the  robot  can  easily  get 
lost  when  moving  in  large  open  areas.  The  robot  tracks 
the  target  using  multiscale  optical  flow  calculations  [l]. 

At  each  point  in  time,  the  robot  has  to  make  trade¬ 
offs  that  involve  losing  registration  with  the  map  (get¬ 
ting  lost)  and  losing  contact  with  the  target.  In  both 
cases,  losing  registration  and  losing  contact,  the  accu¬ 
racy  of  the  robot’s  reports  on  the  location  of  the  target 
suffer.  Determining  an  optimal  or  near-optimal  course  of 
action  requires  careful  consideration  of  the  robot’s  sur¬ 
roundings  and  perceptual  capabilities.  By  employing  a 
general  graph-based  representation  for  Bayesian  decision 
models,  we  are  able  to  carefully  analyze  the  complexity 
of  the  decision  models  required.  The  decision  model  that 
we  provide  for  this  problem  is  quite  general  and,  at  least 
in  theory,  provides  an  optimal  decision  procedure.  In 
practice,  we  were  forced  to  make  concessions  to  the  com¬ 
binatorics  inherent  in  the  problem.  In  [9],  we  describe 
those  concessions  in  some  detail,  justifying  the  tradeoffs 
that  we  made  in  terms  of  the  potential  loss  of  preci¬ 
sion  versus  the  potential  reduction  in  decision-making 
response  time. 

The  tracking  problem  described  above  was  designed  as 
a  (relatively)  simple  exercise  involving  decision  making 
and  active  perception  in  time-stressed  situations.  With 
the  exception  of  the  techniques  for  multiscale  optical 
flow,  the  task  does  not  involve  any  image  understand¬ 
ing.  We  are  continually  reviewing  our  image  understand¬ 
ing  and  tactile  sensing  research  in  an  effort  to  find  and 
transfer  techniques  to  our  service  robotics  efforts.  In 
particular,  we  are  seeking  to  develop  interpretation  and 
parameter  estimation  routines  that  run  on  the  compu¬ 
tational  resources  available  in  our  mobile  robotics  test 
environment.  We  are  developing  fast,  robust  procedures 
for  estimating  the  parameters  of  surface  patches  from 
stereo  pairs,  tracing  surface  contours  using  tactile  sens¬ 
ing,  and  extracting  depth  information  from  optical  flow 
fields. 

We  have  recently  begun  work  on  a  new  application 
involving  a  mobile  robot  engaged  in  reconnaissance  op¬ 
erations.  In  this  proceedings  [3],  we  investigate  some  of 
the  representational  issues  that  arise  in  the  reconnais¬ 
sance  task  as  well  as  more  general  issues  that  arise  in 
quantifying  (gathering  statistics  for)  Bayesian  decision 
models  and  deigning  efficient  inference  algorithms  for 
problems  in  robotics. 
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Abstract 


The  Department  of  Defense  Unmatmed  Ground  Vehicle 
Program  is  a  multi-year  effort  involving  the  Services, 

OSD  and  DARPA.  The  objective  of  the  program  is  to 
first  field  a  teleoperated  unmanned  ground  vehicle 
system,  followed  by  preplanned  product  improvements 
leading  to  self-navigating  systems  performing 
reconnaissance,  surveillance,  target  acquisition  and 
designation  missions.  The  system  capabilities 
envisaged  are  enabled,  in  large  part,  by  image 
understanding,  planning,  control  and  robotics  research 
performed  over  the  last  ten  years.  The  program  will 
continue  to  push  the  state  of  research,  while 
significantly  advancing  the  state  of  practice. 

1.0  Background 

In  recent  years,  Congress  “has  been  concerned  about  the 
direction  and  composition  of  the  many  diverse  robotics 
projects  undertaken  by  the  armed  services  and  defense 
agencies.”*  Congress  therefore  directed  the  establishment 
of  a  DoD  robotics  master  plan  in  1989.  The  diversity  of 
robotics  projects  that  were  described  in  the  1989  plan  led  to 
Congressional  request  in  1990  to  consolidate  all  of  the 
ground  robotics  vehicle  projects  “under  OSD  policy  and 
program  direction.” 

In  response  to  this  Congressional  mandate,  previously 
separate  ground  vehicle  related  robotics  efforts  were 
consolidated  in  a  single  program  element,  under  the 
direction  of  the  Tactical  Warfare  Programs  (TWP)  office  of 
the  Director  for  Defense  Research  and  Engineering 
(DDR&E).  Since  1990,  the  TWP  office  has  been 
responsible  for  reporting  on  the  activities  of  this  program 


1.  Report  101-132  from  the  Senate  Committee  on  Appropriations 
on  the  Department  of  Defense  Appropriations  Bill,  1990.  Quota¬ 
tions  is  this  S’Ction  are  from  Report  101-132. 


element  to  Congress,  providing  direction,  allocating 
appropriated  funds  to  projects,  and  carefully  monitoring 
the  progress  of  all  DoD  Unmanned  Ground  Vehicle 
activities.  The  Services  and  the  Defense  Advanced 
Research  Projects  Agency  (DARPA)  are  responsible  for  the 
conduct  and  daily  management  of  the  projects. 

1.1  Recent  Planning 

Following  the  consolidation  of  DoD  robotics  activities,  a 
number  of  planning  activities  took  place  involving  OSD, 
the  Services  and  DARPA.  The  overall  rationale  for  the 
robotics  program  was  developed,  based  on  analysis  of 
existing  and  emerging  requirements,  coupled  with  an 
assessment  of  emerging  image  understanding,  control,  and 
planning  technologies.  The  various  projects  underway  or 
proposed  by  the  Services  were  reviewed.  Several  on-going 
programs  were  terminated,  and  those  that  were  selected  for 
continuance  were  restructured  to  produce  a  “a  more 
focused  and  cost-effective  robotics  program,”  as  directed 
by  Congress. 

Prior  to  the  consolidation  of  robotics  efforts,  all  robotics 
and  unmanned  ground  vehicle  programs  in  DoD  were 
principally  technology  driven,  as  opposed  to  operating 
under  the  mandate  of  user  requirements.  Since 
consolidation,  the  U.S.  Army  has  approved  a  requirement 
for  an  unmanned  ground  vehicle,  named  “CALEB,”  to 
conduct  reconnaissance,  surveillance  and  target  acquisition 
(RSTA)  in  support  of  ground  infantry  forces. 

National  economic  pressures,  coupled  with  world-wide 
events  leading  to  the  dismantling  of  the  Warsaw  Pact,  and 
events  leading  to  the  secession  of  virtually  all  republics 
from  the  USSR  are  forcing  the  DoD  to  reduce  overall  force 
structure.  However,  as  the  events  of  Desert  Shield/Storm 
have  shown,  the  United  States  still  must  maintain  the 
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capability  to  move  large,  technologically  sophisticated 
forces  to  counter  the  actions  of  hostile  nations. 

1.2  impact  of  Desert  Storm 

Many  lessons  concerning  the  relative  importance  and 
likelihood  of  future  robotic  systems  were  demonstrated  in 
Desert  Storm.  Among  them; 

•  For  the  first  time.  Unmanned  Air  Vehicles  (UAVs)  were 
widely  used  in  combat. 

•  Land  forces  confronted  the  immediate  threat  of  chem¬ 
ical  weapons. 

•  A  hurried  request  for  remotely  operated  mine-clearing 
tanks,  and  shallow  watercraft  was  made. 

•  High  technology  weapons  demonstrated  the  effective¬ 
ness  of  autonomous  guidance. 

•  Desert  Storm  revealed  the  political  significance  of  in¬ 
dividual  weapons  systems  in  regional  confiicts,  with  the 
SCUD/PATRIOT  events  being  the  most  dramatic  ex¬ 
ample. 

•  Desert  Storm  set  a  standard  of  minimal  fiiendiy  casu¬ 
alties  against  which  the  results  of  future  conflicts  will  be 
measured. 

1.3  UGV/TUGV  Program  Structure 

The  technology  associated  with  unmanned  systems*  is 
maturing  faster  than  the  concepts  of  employment  are  being 
developed.  The  structure  of  the  DoD  Unmanned  Ground 
Vehicle  Program  reflects  this  reality,  in  a  coordinated 
evaluation  and  development  program  with  the  objective  of 
first  fielding  of  an  unmanned  system  by  1998. 

The  TUGV  is  the  principal  effort  of  the  current  UGV 
advanced  development  program.  The  TUGV  program  is 
being  planned  and  managed  with  an  awareness  that  it 
represents  an  initial  step  in  the  evolution  and  fielding  of 
UGVs  for  combat  applications  and  that  its  success  or 
failure  may  have  far-reaching  consequences. 

Three  principal  foci  encompass  the  DoD  UGV  program. 
First,  several  Surrogate  Teleoperated  Vehicles  (STVs)  will 
be  developed  and  used  to  support  Early  User  Test  and 
Evaluation  (EUTE)  of  UGV  concepts.  Second,  a  full  scale 
Engineering  and  Manufacturing  Development  (EMD) 
program  will  develop  the  first  fielded  Teleoperated 
Unmanned  Ground  Vehicle.  Third,  the  Unmanned  Ground 


1.  The  term  UGV  is  used  in  a  general  sense  to  include  a  range  of 
applications.  The  term  Tactical  Unmanned  Ground  Vehicle 
(TUGV)  will  refer  to  a  specific  project,  which  is  developing  one 
class  of  UGVs. 


Vehicles  Technology  Enhancement  and  Exploitation 
(UGVTEE)  program  will  focus  on  maturing  those  robotics 
technologies  of  particular  interest  to  UGV  systems.  The 
UGVTEE  program  is  a  demonstration-directed  effort, 
including  Demo-I  and  Demo-II,  whose  principal  aims  are 
to  mature  and  transition  near-term  technology,  and  to 
develop  semi-autonomous  navigation  technology 
respectively. 

2.0  The  Surrogate  Teleoperated  Vehicle 
(STV)  Program 

The  STV  program,  managed  by  the  Joint  Unmanned 
Ground  Vehicles  Office  (JUGVO)  at  the  U.S.  Army  Missile 
Command  (MICOM)  in  Huntsville,  Alabama,  will  develop 
14  Surrogate  Teleoperated  Vehicles  (STVs).  These  will  be 
used  to  cmiduct  Early  User  Test  and  Evaluation  (EUTE),  by 
placing  six  STVs  in  a  USMC  infantry  brigade,  and  six  in  a 
U.S.  Army  brigade  for  a  period  of  one  year,  starting  in 
1992. 

FIGURE  1.  The  Surrogate  Teleoperated  Vehicle 


As  shown  in  Figure  1,  the  STV  is  a  six-wheel-drive,  fully 
amphibious  platform.  It  contains  all  automotive  and 
navigational  components,  including  sensors  and  cmitrd  for 
teleoperated  driving  under  day,  night,  and  adverse 
environmental  conditions.  The  platform  is  powered  by  a 
hybrid  25  horsepower  (HP)  diesel  engine  and  a  3  hp  electric 
motor,  for  silent  locomotion  when  required.  Automotively, 
the  STV  will  be  able  to  traverse  roads  at  35  miles  per  hour 
(mph)  and  travel  off-road  at  25  mph.  Its  remote  driving 
speeds  will  depend:  1)  on  the  skill  of  the  operator  in 
teleoperated  mode  or,  2)  on  the  sophistication  of  the 
software  as  semi-autonomous  capabilities  are  added.  By 
placing  these  teleoperated  vehicles  into  the  hands  of 
soldiers  and  marines,  the  JUGVO  will  acquire  direct  access 
to  employment  concepts  created  by  users  in  tactical 
environments. 
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3.0  TUGV  Engineering  and 

Manufacturing  Deveiopment 

In  the  Engineering  and  Manufacturing  phase,  the  selected 
system  contractors)  will  be  responsible  for  fabricating  a 
production-ready  TUGV.  The  Government  will  conduct 
Developmental  Test  and  Evaluation  and  Initial  Operational 
Test  and  Evaluation  of  the  contractor-provided  TUGV 
prototypes.  These  tests  will  determine  readiness  for 
production  of  a  first  generation  TUGV.  Milestone  III  is 
planned  for  the  end  of  1997. 


4.0  UGV  Technology  Enhancement  and 
Exploitation 

UGVTEE  consists  of  technology  base  efforts  supporting 
current  and  future  UGV  projects,  and  involve  participation 
from  academe,  industry,  DoD,  DoE  and  NASA 
laboratories.  The  UGVTEE  program  is  directed  to  exploit 
robotics  advances  and  mature  those  technologies  that  are 
critical  to  the  robotization  of  UGV  systems.  The  near  term 
focus  of  this  program  is  on  providing  the  mission 
capabilities  and  technological  enhancement  required  for 
the  TUGV.  This  part  of  the  program  (Demo-I)  will 
conclude  in  FY1992.  The  long  term  focus  is  on  image 
understanding,  planning  and  control  technologies  that  will 
enhance  operational  capability  and  survivability.  This  part 
of  the  program  (DEMO-II)  has  been  initiated  in  FY1991, 
with  the  main  focus  of  developing  autonomous  navigation 
under  battlefield  conditions.  Technology  development  will 
include  support  for  reconnaisance,  surveillance,  and 
acquisition  (RSTA)  functions  while  the  UGV  is  moving, 
and  distributed  artificial  intelligence  supporting  automated 
communication  with  other  vehicles,  and  work  load 
partitioning  between  vehicles  to  accomplish  mission 
objectives. 


5.0  UGVTEE  Demo-i 

The  principal  purpose  of  Demo-I  is  to  mature  critical 
system  component  technologies  for  first  generation 
teleoperated  UGVs  and  demonstrate  their  readiness  for 
acquisition  programs.  Based  on  the  results  of  Demo-I, 
selected  technologies  will  be  integrated  into  the  basic  STV 
for  the  development  of  a  complete  TUGV  prototype.  The 
emphasis  is  on  reducing  operator  work  load  while 
enhancing  performance  of  the  RSTA  mission. 


6.0  UGVTEE  Demo-ll 

The  purpose  of  Demo-II  is  to  develop  and  mature  those 
navigation  technologies  that  are  critical  to  evolving  UGVs 
from  labor  intensive  teleoperated  systems  requiring  fibre- 
optic  cables  for  communication  to  supervised  autonomous 


systems  utilizing  low-bandwidth  non-line-of-sight 
communication.  The  objective  of  the  program  will  be  to 
demonstrate  four  semi-autonomous  cooperating  unmanned 
ground  vehicles  performing  navigation,  reconnaissance, 
surveillance,  target  acquisition  and  target  designation. 

As  shown  in  Figure  2,  Demo-II  is  a  four-phase  five  year 
program  with  three  interim  demonstrations  directed  at 
transitioning  research  results  onto  the  surrogate  vehicles.  In 
each  of  the  sections  describing  the  interim  demonstrations 
that  follow,  the  objectives  and  technological  approach  for 
each  will  be  discussed. 


FIGURE  2.  Demo-II  Overview 
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6.1  Demo-II  Technologies 

Realization  of  the  Demo-II  objectives  will  require 
moderate  to  substantial  increases  in  capabilities  from 
current  state-of-the-art  in  Image  Understanding, 
Navigation,  Planning,  Control,  and  Distributed  Artificial 
Intelligence.  The  technological  basis  upon  which  the 
recommendations  for  approving  the  Demo-II  program 
were  based  on  research  results  developed  under  supj^  by 
the  DARPA  Image  Understanding',  Planning^,  and 
Robotics  Science^  Programs,  and  others.^ 

Of  principal  importance  in  the  consolidation  of  the  DoD 
program,  has  been  the  research  results  demonstrated  on  the 
Carnegie  Mellon  University  Navlab,  shown  in  Figure  3. 
Navlab  is  a  mobile  research  laboratory,  designed  for 
facilitating  research  in  navigation,  control,  and  image 
understanding. 
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FIGURE  3.  CMU  Navlab 


6.1.1  Stereo  Vision. 

Emeiging  requirements  for  all  military  systems  require  that 
principles  of  stealth,  low  observability,  and  low  emission 
be  observed.  Consequently,  a  significant  objective  of  the 
E>emo-II  program  is  to  develop  passive  stereo  vision  based 
techniques  for  acquiring  the  local  environment  models 
required  by  unmanned  ground  vehicles  for  navigation  and 
obstacle  avoidance.  While  much  progress  has  been  made  in 
stereo  vision,  differences  in  technical  approach  and 
hardware  requirements  demand  that  the  current  state  of 
research  be  stressed,  that  metrics  be  established,  and  that 
the  alternative  approaches  be  quantifiably  measured  against 
these  metrics.  The  Demo-II  program  is  specifically  funding 
research  in  real-time  stereo  vision  and  stereo  analysis  to 
overcome  the  existing  limitations  in  analysis  of  stereo 
vision. 


1.  Principal  Image  Understanding  Research  provided  by  Carn¬ 
egie  Mellon  University  [Kanade,  Thorpe,  Whittaker],  University 
of  Massachusetts  [Hanson,  Riseman,  Weems],  University  of 
Maryland  [Rosenfeld,  Davis],  University  of  Pennsylvania 
[Bacjsy],  University  of  Rochester  [Ball^,  Brown],  AdvaiKed 
Decision  Systems  [  Morgan],  General  Electric  Corporation 
[Mundy],  SRI  [Bolles,  Hannah,  Strat],  Hughes  Aerospace  Com¬ 
pany  [Tseng,  Nash],  Massachusetts  Institute  of  Technology  [Pog- 
gio,  Lozano-Perez],  and  others. 

2.  Principal  Planning  Research  provided  by  University  of  Mary¬ 
land  [Davis],  Carnegie  Mellon  University  [Thorpe,  Whittaker], 
Massachusetts  Institute  of  Technology  [Brooks],  Advanced  De¬ 
cision  Systems  [Soldo],  University  of  Michigan  [Jain,  Durfee], 
Yale  University  [Dean],  Sumford  University  [Cannon. 
LaTombe],  and  others. 

3.  Principal  Robotics  Science  Research  provided  by  University  of 
Pennsylvania  [Bacjsy],  Stanford  University  [Cannon],  Massachu¬ 
setts  Institute  of  Technology  [Brooks,  Lozano-Perez],  and  others. 

4.  My  sincerest  apologies  to  those  not  listed  due  to  minimal 
space. 


Current  research  indicates  that  it  is  now  becoming  feasible 
to  use  range  images  obtained  with  passive  stereo  vision  to 
detect  obstacles  in  support  of  semi-autonomous,  cioss- 
country  navigation.  Within  the  Demo-II  program,  we  are 
focusing  on  methods  to  characterize  the  performance  of 
stereo  vision  quantitatively,  and  means  to  develop  design 
mediodologies  that  relate  task  requirements  to  stereo  sys¬ 
tem  design  parameters. 

Our  objective,  as  shown  in  Figure  4,  is  to  develop  an 
engineering  approach  to  designing  passive  stereo  vision 
systems  that  provide  sufficient  detail  to  enable  successful 


FIGURE  4.Stereo  Vision 


small  vehicle  navigation  under  overhanging  tree  branches, 
and  to  detect  obstacles  such  as  rocks  of  a  size  that  is 
relevant  to  the  current  navigation  task  -  far  enough  ahead  of 
a  vehicle  to  ensure  safety  at  computationally  and  visually 
supportable  speeds.  Such  an  engineering  approach  includes 
calculating  the  vehicle  stopping  distance  as  a  function  of 
initial  velocity,  calculating  the  minimum  look-ahead 
distance  at  which  sensors  must  detect  obstacles,  and  to 
determine  detectability  criteria  as  a  function  of  look-ahead 
distance,  obstacle  size  and  sensor  noise  characteristics. 
Finally,  the  approach  leads  to  specification  of  sensor 
system  parameters,  such  as  image  resolution  and 
computational  speed  that  is  required  to  meet  the 
detectability  criteria  for  the  desired  vehicle  speed. 

Multiple  technical  approaches  are  currently  under 
investigation  [Hannah  89,  Matthies  91 ,  Nishihara  90].  Fach 
emerging  approach  will  be  evaluated  against  the 
engineering  criteria  as  discussed,  and  further  characterized 
as  to  actual  performance  in  a  system  configuration  that  is 
supported  by  the  overall  Demo-II  effort. 

6.1.2  Image  Understanding 

The  four  vehicles  participating  in  Demo-II  will  be 
equipped  with  two  sensor  packages,  one  for  navigation 
(NAV),  and  one  to  support  the  vehicle  primary  mission  of 
reconnaissance,  surveillance  and  target  acquisition, 
(RSTA).  Image  Understanding  plays  an  important,  if  not 
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dominant  role  in  each  of  these  primary  systems.  lU 
provides  indicators  in  the  NAV  system,  aiding  the 
determination  of  the  class  of  environment  in  which  the 
vehicle  is  operating,  and  providing  identification  of  road 
signs,  vehicles,  and  terrain  features  supporting  planners  for 
rules  of  the  road,  obstacle  interaction,  and  vision  based 
position  updates  respectively. 

Each  vehicle  will  be  expected  to  navigate  in  on-road  and 
off-road  environments  utilizing  the  NAVsubsystem,  while 
simultaneously  surveying  areas  of  interest  for  threat  vehi¬ 
cles  or  targets  with  the  RSTA  subsystem.  Having  deter¬ 
mined  the  existence  of  a  target,  the  RSTA  subsystem  will 
be  designed  to  facilitate  target  segmentation,  tracking  while 
on  the  move,  and  designation  with  a  laser  target  designator. 
Additionally,  target  classification  and  identification  will  be 
provided,  to  the  extent  technologically  feasible,  to  indicate 
to  the  operator  a  suggestion  of  the  type  of  target  confront¬ 
ing  the  vehicle. 

6.1.3  Navigation 

Navigation  research,  coupled  closely  with  results  from  the 
image  understanding  community,  has  formed  the  technical 
basis  for  the  investment  in  the  Demo-II  efforts.  For 
example.  Figure  5  illustrates  work  on  YARF,  or  Yet 
Another  Road  Follower  [Kluge  90].  YARF  explicitly 
models  as  many  aspects  as  possible  for  driving  on 
structured  roads.  YARF  has  individual  knowledge  sources 
that  know  how  to  model  and  track  specific  features,  such  as 
road  edge  markings  (white  stripes);  road  center  lines 
(yellow  stripes);  and  shoulders.  YARF  also  uses  an  explicit 
geometry  model  of  the  road,  consisting  of  location  of 
vehicle  on  road;  location  of  stripes;  type  of  stripes  (e.g., 
broken  or  solid);  and  maximum  and  current  road  curvature. 


FIGURE  5. Yet  Another  Road  Follower  (YARF) 


6.1.4  Ladars 

The  driving  speed  of  the  vehicle  will  be  limited  by  the  ob¬ 
stacle  detection  ctq>ability,  the  reliability,  and  the  speed  of 
the  range  sensors.  Laser  radar  (LADAR)  is  an  active  range 
sensor  that  can  be  used  for  both  obstacle  detection  and 
landmark  identification  [Besl  88,  Hebert  88].  In  general, 
LADAR  technology  seems  to  be  relatively  mature  and  re¬ 
liable.  Since  LADAR  employes  active  illumination,  per¬ 
formance  at  night  should  be  as  good  as  during  the  day,  if 
not  better  (since  the  sun  is  a  noise  source  during  the  day). 
The  data  density,  the  percentage  of  pixels  for  which  a  range 
estimate  is  available  should  be  quite  high  for  moderate 
scenes  (approximately  90%) .  Signal  to  noise  consider¬ 
ations  cause  the  standard  deviation  of  range  measurements 
to  be  approximately  a  quadratic  function  of  the  true  range. 

The  field  of  view  (FOV)  of  the  ERIM  laser  scanner  used  in 
the  ALV  program  is  80  degrees  horizontally  and  30  degrees 
vertically,  with  a  spatial  resolution  of  256  pixels  horizon¬ 
tally  (about  0.3  degrees/pixel)  and  64  pixels  vertically 
(about  O.S  degrees/pixel).  The  ERIM  scanner  has  a  maxi¬ 
mum  range  of  20  meters  to  the  first  ambiguity  interval  and  a 
range  resolution  quantized  to  about  8  centimeters.  It  pro¬ 
duced  the  image  in  0.5  seconds. 

A  scanner  made  by  Odetics  provides  128  x  128  pixels  in  a 
60  X  60  degree  field  of  view,  giving  an  angular  resolution  of 
about  0.5  degrees.  It  has  an  ambiguity  interval  of  9.4  meters 
and  a  quantization  interval  of  1.8  centimeters.. 

6.1.5  Mission  Levei  Pianning 

Mission  planning  utilizes  digital  terrain  data,  DTED  level 
II,  augmented  with  Interim  Tactical  Data  (ITD)  containing 
soils,  roads,  draining,  foliage  and  water  characteristics.  The 
planningsystem,  illustrated  in  Figure  6,  has  the  following 
components: 

.  Mission  Specification 

•  Mission  Analysis 
>  Mission  Planning 

.  Mission  Execution 

•  Plan  Evaluation  &  Revision 

Mission  specification  is  supported  by  2-D  graphics  input 
procedures,  augmented  with  3-D  views  of  selected  maps, 
pre-computed  map  analyses,  intelligence  data,  and  current 
weather.  Mission  analysis  is  performed  by  map  analysis 
tools,  which  generate  visibility  maps,  threat  cost  overlays 
and  mobility  estimates  for  the  mission  area.  Mission 
pianning  itself  utilizes  online  mission  data  and  the  products 
of  map  analysis  to  generate  selected  waypoints,  refined 
constraints  and  internal  plan  specifications  for  each  of  the 
SSVs  participating  in  the  mission. 
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A  unique  feature  of  the  planning  system  is  the  mission  pre¬ 
view  that  is  provided  to  the  operator.  Similar  in  concept  to 
SIMNET,  the  operator  may  view  any  portion  of  the  spatio- 
temporal  projection  of  the  mission  in  a  3-D  view,  and  is 
provided  with  a  viewport  facility  that  enables  an  aerial  per¬ 
spective  of  the  area  of  operations  during  mission  preview, 
as  well  as  during  mission  execution.  The  preview  facility 
also  provides  for  after-action  reporting  and  mission  review, 
following  capture  of  vehicle  position  data,  target  nomina¬ 
tion,  and  map  annotation  identification  reports  by  the  ve¬ 
hicles  during  mission  execution. 


FIGURE  6.Mission  Planning  System 


During  mission  execution,  each  of  the  vehicles  interacts 
with  the  CXTU,  providing  position  updates  and  video  feed¬ 
back  to  the  operator  on  demand.  Status  data  for  each  of  the 
vehicles  is  provided,  including  window-oriented  textual,  2- 
D  graphics  position  description  and  RSTA  module  view¬ 
point,  and  updated  indicators  marking  targets  and  objects 
of  interest  in  the  3-D  view. 

The  Irian  evaluation  and  revi'-ion  system  utilizes  the  ‘flying 
carpet’  view  to  display  the  mission  to  the  operator.  During 
simulated  mission  execution,  the  operator  is  provided  with 
the  utilities  to  refine  mission  parameters,  revise  constraints, 
and  update  the  individual  vehicle  plans. 


6.1.6  Enroute  Planning  and  Control 

Once  the  UGV  begins  to  traverse  the  planned  path,  auton¬ 
omous  navigation  functions  depend  on  sensor  inputs  and 
algorithms  for  processing  these  inputs  to  ascertain  and 
characterize  the  local  environment.  This  includes  local 
three-dimensional  map  generation  from  stereo  vision  and/ 


or  LADAR  images;  model-based  envircmment  character¬ 
ization  with  object  and  scene  identification  and  tracking; 
landmark  recognition;  and  global  and  local  map  merging  to 
upgrade  the  terrain  map  and  determine  vehicle  location  and 
orientation.  Local  path  planning  uses  the  terrain  map,  the 
sensed  environment,  vehicle  position,  and  planning  con¬ 
straints.  Driving  algorithms  include; 

1 .  Neural  network  and  feature-based  road  following,  al¬ 
lowing  the  vehicle  to  travel  on  paths  ranging  from  multi¬ 
lane  paved  highways  to  gravel  roads  to  jeep  trails,  traverse 
intersections,  and  recognize  road  branches. 

2.  Road  and  cross-country  navigation  based  on  “fuzzy 
routes  or  spline  driving”  instructions  with  position  provid¬ 
ed  by  GPS  receivers,  inertial  measurenrent  systems,  odom- 
etry,  and  vision-activated  landmark  position  estimates. 

3.  Vision-guided  obstacle  avoidance 

4.  Execution  of  local  path  plan  timed  to  the  global  mission 
plan  and  apjM’opriately  adjusted  to  terrain,  visibility,  path 
surface  condition,  automotive  and  dynamic  constraints, 
and  limitations  on  concurrent  processing  of  the  large 
amount  of  data  required  for  the  vehicle  navigation  and 
RSTA  function. 


6.1.7  Processing  Technology 

DoD  has  an  ongoing  research  and  development  program  to 
develop  a  succession  of  prototypes  of  scalable  parallel  and 
distrivbuted  hetrogeneous  high  performance  computing 
systems  and  associated  software  and  algorithms  for  mili¬ 
tary  applications.  These  systems  are  developed  with  pro¬ 
gressively  larger  scale,  more  advanced  components,  more 
dense  pack  ;;ing,  and  more  advanced  architecture.  DEMO 
II  will  use  le  most  current  prototype  that  is  suited  for 
TUGV  app  .cations.  Under  consitteration  are  the  Intel 
iWarp  and  an  Image  Understanding  Architecture  (lUA) 
processor  developed  jointly  by  the  University  of  Massa¬ 
chusetts  and  Hughes. 

As  shown  in  Figure  7,  the  Image  Understanding  Architec¬ 
ture  (lUA)  consists  of  three  levels  of  tightly  coupled  par¬ 
allel  processors  (for  low  level,  intermediate  level,  and  high 
level  vision  operations  [Weems  89]).  The  lUA  proof-of- 
concept  prototype  has  been  completed  and  experience  with 
both  the  hardware  and  extensive  software  simulations  are 
guiding  the  development  of  a  second  generation  of  the  lUA 
that  will  be  used  in  the  UGV  program.  Furthermore,  the  ini¬ 
tial  research-oriented  software  development  environment  is 
currently  being  replaced  by  a  sophisticated  set  of  applica- 
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tion-oriented  tools.  Thus,  the  lUA  effort  is  in  the  process  of 
making  the  transition  from  an  isolated  research  project  to 
being  in  a  position  of  accessibility  to  the  wider  community. 
In  this  proceedings,  Weems  et  al  [Weems  92]  describe  the 
current  status  of  the  effcat  and  some  of  the  plans  for  the  fu¬ 
ture.  The  second  generation  lUA  retains  the  basic  three- 
level  structure  of  the  prototype,  but  the  new  system  will  be 
computationally  more  powerful  than  the  prototype  and  will 
be  much  easier  to  use  for  ai^lications.  The  lUA  develop¬ 
ment  is  taking  place  at  three  sites:  the  University  of  Mas¬ 
sachusetts  at  Amherst,  Hughes  Research  Laboratories  in 
Malibu,  California,  and  Amerinex  Artificial  Intelligence 
Inc  at  Amherst. 

The  coterie  network  fcH'  the  iWarp  is  shown  in  Fig.  8. 
FIGURE  T.lmage  Understanding  Architecture 


FIGURE  8.Coterie  Network 


6.2  The  Demo-il  Program 


The  Demo-II  program  will  provide  the  operator  a  man- 
portable  operator  control  unit,  with  capabilities  for 
displaying  a  simulated  view  of  the  area  of  interest  of  the 
battlefield,  overlaid  on  request  with  direct  video  feedback 
from  one  or  more  RSTA  or  NAV  system  sensors  carried  by 
the  SSVs.  Figure  9  illustrates  die  nussion  planning  context 
that  might  be  presented  to  the  operator,  providing  normal 
military  symbology  for  boundaries,  assembly  areas, 
objectives,  phase  lines,  and  so  forth.  The  detailed  terrain 
model  underlying  the  planning  system  will  be  acquired  by 
UAV  overflight  of  the  area  of  interest,  and  transmittal  of 
overhead  imagery  to  the  OCU  for  further  processing  to 
develop  a  properly  annotated  terrain  model.  The  dashed 
lines  in  Figure9  illustrate  the  individual  vehicle  paths  that 
are  computed  using  “Stealth  Terrain  Navigation” 
procedures,  such  as  those  described  in  these  proceedings 
by  Davis  [Davis  92]. 

The  operator  interacts  with  the  OCU  to  provide  the 
specification  of  the  plan,  verifies  acceptability  of  the  path 
plans  from  the  planner,  and  commands  the  OCU  to 
download  the  individual  vehicle  plans  onto  each  of  the 
vehicles  participating  in  the  current  mission. 

6.2.1  Offensive  Operation 

Demo-II  will  operationally  demonstrate  a  movement  to 
contact  by  a  team  of  four  cooperating  UGVs.  The  team  will 
conduct  a  screening  operation  for  a  manned  force  using 
bounding  overwatch,  or  similar  military  tactics.  Semi- 
autonomous  navigation  will  over  semi-arid  terrain 
traversing  a  distance  of  some  10  KM  at  an  average  speed, 
while  maneuvering,  of  40  KMH.  Once  in  overwatch 
positions,  the  team  will  employ  the  RSTA  mission  module 
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to  observe  threats;  locate,  detect,  assess,  and  designate  for 
engagement  by  indirect  fire. 


FIGURE  10.  Navlab-ll 


6.2.2  Defensive  Operation 

Demo-II  will  also  operationally  demonstrate  a  retrograde 
operation  by  a  team  of  four  cooperating  UGV’s.  The  team 
will  screen  a  manned  force  by  sequentially  occupying 
preplanned  defensive  positions,  or  order,  to  maximize 
degradation  of  enemy  forces.  Once  the  UGV-supported 
commander  determines  that  enemy  forces  have  been 
attrited,  the  team  will  move  to  pre-planned  locations  in  the 
main  battle  area  to  designate  and  defeat  remnants  of 
advancing  enemy  forces. 

6.2.3  Architectural  Overview 

Figure  13  represents  the  SSV  Top-Level  Architecture.  This 
diagram,  and  substantial  accompanying  material,  was 
developed  by  the  SSV  Architectural  Tiger  Team*  in 
response  to  a  DARPA  mandate  to  construct  a  high-level 
view  of  the  system  architecture.  The  goal  of  this  effort  was 
to  capture  the  essence  of  the  technical  elements  of  the 
program,  provide  a  means  of  comprehensible  explanation, 
and  resolve  the  seemingly  diminishing,  yet  ever-present 
conflicts  between  layered  and  behaviorist  approaches  to 
robotics  architectural  specification. 

The  resulting  architectural  specification  provides  a  clear 
framework  for  describing  the  technological  approach  of  the 
program.  Based  on  a  three-tiered  hierarchy,  the  essence  of 
the  specification  is  to  include  behavioral  generation  and 
command  arbitration  elements  in  the  central  layer.  The  sep¬ 
aration  into  layers  is  not  intended  to  be  a  subsystem  par¬ 
titioning,  but  rather,  represents  a  way  of  classifying 
subsystems  and  managing  their  interfaces.  Elements  of  the 
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layers  (boxes  in  the  diagram)  represent  subsystems.  Ele¬ 
ments  of  the  subsystems  (boxes  underlying  the  boxes  in 
Figure!  1)  are  objects. 


Layer  View 

The  control  layer  contains  all  of  the  system’s  “autonomic” 
functions.  The  highest  rate  elements  of  the  system  reside 
here  ••  nominally  all  servo  loop  closures  and  compensation. 
Processes  in  this  layer  tend  to  be  running  in  synchronous 
real  time,  facilitating  control  observability.  The  implemen¬ 
tation  will  likely  be  multi-rate,  since  different  control  sys¬ 
tems  have  differing  closure  loop  rate  requirements.  Low- 
level  safety  and  consistency  checks  are  also  included  in  this 
layer. 

The  local  action  layer  of  the  system  contains  mainly  be¬ 
havioral  and  reflexive  elements.  Most  data  in  this  layer  (i.e. 
local  world  model)  is  ephemeral.  The  “world  model”  at  this 
level  is,  for  the  most  part,  implicit  in  the  intermediate  rep¬ 
resentation  of  this  layer.  For  instance,  grid-based  maps  ate 
very  finely  resolved  (perhaps  20  centimeters)  and  contin¬ 
uously  rolled  over  as  the  vehicles  traverses  to  prevent  data 
overload.  Functions  in  this  layer  operate  at  a  lower  rate  than 
the  control  layer,  but  nominally  still  in  synchronous  real 
time  (with  provision  forevent-  or  signal-driven  functions). 
This  layer  may  also  be  multimodal,  responding  to  mode 
commands  or  constraints  from  the  Global  Action  Layer. 

The  concept  of  a  “behavior”  is  loosely  defined  at  this  level. 
By  “behavior,”  we  refer  to  the  generation  of  some  com¬ 
mand  or  sequence  which  may  ultimately  be  committed  to 
action,  depending  on  the  priorities  and  modes  of  the  sys¬ 
tem’s  current  state.  These  behaviors  vary  from  the  very 
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currently  being  replaced  by  a  sophisticated  set  of  applica¬ 
tion-oriented  tools.  Thus,  the  lUA  effort  is  in  the  process  of 
making  the  transition  from  an  isolated  research  project  to 
being  in  a  position  of  accessibility  to  the  wider  community. 
In  this  proceedings,  Weems  et  al  [Weems  92]  describe  the 
current  status  of  the  effort  and  some  of  the  plans  for  the  fu¬ 
ture.  The  second  generation  lUA  retains  the  basic  three- 
level  structure  of  the  prototype,  but  the  new  system  will  be 
comp-tationally  more  powerful  than  the  prototype  and  will 
be  much  easier  to  use  for  q>plications. 

The  coterie  network  of  the  lUA,  shown  in  Fig.  8,  is  part  of 
the  lUA  that  provides  network  support  for  region-based  op¬ 
erations.  For  example,  on  the  lUS’s  lowest  level,  a  program 
can  segment  an  image  into  regions,  and  then  issue  one  in¬ 
struction  to  compute  the  size  of  a  region  which  is  simul¬ 
taneously  executed  in  all  regions.  The  Coterie  Network 
explicitly  connects  pixels  in  each  region  (and  conversely 
inserts  breaks  between  regions).  . 

The  lUA  development  is  taking  place  at  three  sites;  the 
University  of  Massachusetts  at  Amherst,  Hughes  Research 
Laboratories  in  Malibu,  California,  and  Amerinex  Artifi¬ 
cial  Intelligence  Inc  at  AmhersL 


FIGURE  7.lmage  Understanding  Architecture 


FIGURE  8.Coterie  Network 


6.2  The  Demo-ll  Program 


FIGURE  9.Mission  Context 


The  Demo-II  program  will  provide  the  operator  a  man- 
portable  operator  control  unit,  with  capabilities  for 
displaying  a  simulated  view  of  the  area  of  interest  of  the 
battlefield,  overlaid  on  request  with  direct  video  feedback 
from  one  or  more  RSTA  or  NAV  system  sensors  carried  by 
the  SS  Vs.  Figure  9  illustrates  the  mission  planning  context 
that  might  be  presented  to  the  operator,  providing  normal 
military  symbology  for  boundaries,  assembly  areas, 
objectives,  phase  lines,  and  so  forth.  The  detailed  terrain 
model  underlying  the  planning  system  will  be  acquired  by 
UAV  overflight  of  the  area  of  interest,  and  transmittal  of 
overhead  imagery  to  the  OCU  for  further  processing  to 
develop  a  properly  annotated  terrain  model.  The  dashed 
lines  in  Figure9  illustrate  the  individual  vehicle  paths  that 
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are  computed  using  “Stealth  Terrain  Navigation” 
procedures,  such  as  those  described  in  these  proceedings 
by  Davis  [Davis  92]. 

The  operator  interacts  with  the  OCU  to  provide  the 
specification  of  the  plan,  verifies  acceptability  of  the  path 
plans  from  the  planner,  and  commands  the  OCU  to 
download  the  individual  vehicle  plans  onto  each  of  the 
vehicles  participating  in  the  current  mission. 

6.2.1  Offensive  Operation 

Demo-ll  will  operationally  demonstrate  a  movement  to 
contact  by  a  team  of  four  cooperating  UG  Vs.  The  team  will 
conduct  a  screening  operation  for  a  manned  force  using 
bounding  overwatch,  or  similar  military  tactics.  Semi- 
autonomous  navigation  will  over  semi-arid  terrain 
traversing  a  distance  of  some  10  KM  at  an  average  speed, 
while  maneuvering,  of  40  KMH.  Once  in  overwatch 
positions,  the  team  will  employ  the  RSTA  mission  module 
to  observe  threats;  locate,  detect,  assess,  and  designate  for 
engagement  by  indirect  fire. 


response  to  a  DARPA  mandate  to  construct  a  high-level 
view  of  the  system  architecture.  The  goal  of  this  effort  was 
to  capture  the  essence  of  the  technical  elements  of  the 
program,  provide  a  means  of  comprehensible  explanation, 
and  resolve  the  seemingly  diminishing,  yet  ever-present 
conflicts  between  layered  and  behaviorist  approaches  to 
robotics  architectural  specification. 

The  resulting  architectural  specification  provides  a  clear 
framework  for  describing  the  technological  approach  of  the 
program.  Based  on  a  three-tiered  hierarchy,  the  essence  of 
the  specification  is  to  include  behavioral  generation  and 
command  arbitration  elements  in  the  central  layer.  The  sep¬ 
aration  into  layers  is  not  intended  to  be  a  subsystem  par¬ 
titioning,  but  rather,  represents  a  way  of  classifying 
subsystems  arxl  managing  their  interfaces.  Elements  of  the 
layers  (boxes  in  the  diagram)  represent  subsystems.  Ele¬ 
ments  of  the  subsystems  (boxes  underlying  the  boxes  in 
Figure  1 1)  are  objects. 

FIGURE  11.SSV  Top-Level  Architecture 


RGURE  10.  Navlab-ll 


6.2.2  Defensive  Operation 

Demo-Il  will  also  operationally  demonstrate  a  retrograde 
operation  by  a  team  of  four  cooperating  UGV’s.  The  team 
will  screen  a  manned  force  by  sequentially  occupying 
preplanned  defensive  positions,  or  order,  to  maximize 
degradation  of  enemy  forces.  Once  the  UGV-supported 
commander  determines  that  enemy  forces  have  been 
attrited,  the  team  will  move  to  pre-planned  locations  in  the 
main  battle  area  to  designate  and  defeat  remnants  of 
advancing  enemy  forces. 

6.2.3  Architectural  Overview 

FigurelS  represents  the  SSV  Top-Level  Architecture.  This 
diagram,  and  substantial  accompanying  material,  was 
developed  by  the  SSV  Architectural  Tiger  Team*  in 
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Layer  View 

The  control  /oyer  contains  all  of  the  system’s  “autonomic” 
functions.  The  highest  rate  elements  of  the  system  reside 
here  —  nominally  all  servo  loop  closures  and  compensation. 
Processes  in  this  layer  tend  to  be  running  in  synchronous 
real  time,  facilitating  control  observability.  The  implemen¬ 
tation  will  likely  be  multi-rate,  since  different  control  sys¬ 
tems  have  differing  closure  loop  rate  requirements.  Low- 
level  safety  and  consistency  checks  are  also  included  in  this 
layer. 
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The  local  action  layer  of  the  system  contains  mainly  be¬ 
havioral  and  reflexive  elements.  Most  data  in  this  layer  (i.e. 
local  world  model)  is  ephemeral.  The  “world  model"  at  this 
level  is,  for  the  most  part,  implicit  in  the  intermediate  rep¬ 
resentation  of  this  layer.  For  instance,  grid-based  maps  are 
vety  finely  resolved  (perhaps  20  centimeters)  and  contin¬ 
uously  roUed  over  as  the  vehicles  traverses  to  prevent  data 
overload.  Functions  in  this  layer  operate  at  a  lower  rate  than 
the  control  layer,  but  nominally  still  in  synchronous  real 
time  (with  provision  forevent-  or  signal-driven  functions). 
This  layer  may  also  be  multimodal,  responding  to  mode 
commands  or  constraints  from  the  Global  Action  Layer. 

The  concept  of  a  “behavior”  is  loosely  defined  at  this  level. 
By  “behavior,”  we  refer  to  the  generation  of  some  com¬ 
mand  or  sequence  which  may  ultimately  be  committed  to 
action,  depending  on  the  priorities  and  modes  of  the  sys¬ 
tem’s  current  state.  These  behaviors  vary  from  the  very 
sin^>le  (such  as  emergency  stop  for  sudden  appearance  of 
an  obstacle  directly  in  the  vehicle’s  path)  to  fairly  complex 
(such  as  formation  following  with  other  vehicles).  The  dif¬ 
ferent  levels  of  complexity  will  in  all  likelihood  require 
multi-rate  operation  when  implemented  in  real-time  pro¬ 
cesses. 

Though  at  first  glance  this  layer  appears  to  imply  a  strictly 
behaviorist  approach,  provision  is  made  for  a  purely  pro¬ 
jective  planning  ^proach  through  the  implementation  of  a 
‘Tlan  Execution”  behavior.  The  “behavior”  becomes  dom¬ 
inant,  though  some  low-level  reflexive  behaviors,  such  as 
emergency  stop,  may  still  be  active  for  vehicle  safety  rea¬ 
sons.  In  this  mode,  the  planning  function  in  the  Global  Ac¬ 
tion  Layer  develops  highly  integrated  motion  and  activity 
plans,  which  are  executed  by  Plan  Execution  as  long  as  the 
“Situation  Assessment”  function  is  satisfied  that  current 
plans  can  be  satisfied.  Lower-level  planning  functions 
(such  as  trajectory  planning)  may  be  implemented  in  the 
Local  Action  Layer  due  to  information-sharing  or  timing 
requirements.  Also  in  this  mode,  “Command  Arbitration” 
is  simplified,  as  there  is  little  to  arbitrate  other  than  the  re¬ 
flexive  behaviors. 

The  global  action  layer  contains  the  “cognitive”  or  “rea¬ 
soning”  elements  of  the  system.  It  incorporates  object  rec¬ 
ognition,  ATR  functions,  landmark  recognition  and  pose 
refinement,  and  other  “sensor  fusion”  tasks.  Projective 
planning  elements  (route  planning,  path  planning,  and  ac¬ 
tivity  planning)  are  found  in  this  layer,  as  well  as  the  per¬ 
sistent  data  (the  explicitly  represented  world  model)  and 


any  “reasoning”  about  that  model,  such  as  required  for 
multivehicle  coordination,  situation  assessment,  mission 
and  plan  monitoring.  These  functions  are  the  lowest  rate  el¬ 
ements  of  the  system,  and  tend  to  be  event-driven,  ^ri- 
odic  processes. 

Module  View 

The  sensor  managment  module  is  responsible  for  nranag- 
ing  and  directing  the  data  intensive  sensory  input  from  the 
navigation  and  RSTA  mission  modules.  The  navigation 
sensor  package  is  equipped  with  a  stereo  pair  color  camera, 
a  stereo  pair  image  intensifier,  a  ladar,  accelerometers  and 
position  encoders  to  report  current  position  and  velocity 
measurements. 

The  feature  extraction  and  local  mapping  tiKxlule  encom- 
pases  a  very  broad  range  of  utilities,  i;''cluding  image  pre- 
porcessing,  ortho-rectification,  local  map  matching  and 
building,  first  alert  to  significant  events,  terrain  classifica¬ 
tion,  annotated  map  maintenance,  image  stabilization,  sub¬ 
image  selection,  image  warping,  stereo  range  extraction, 
motion  analysis,  road  edge  detection,  and  also  serves  as  the 
imput  layer  for  combined  neuraly  rtetwork/rule  based  ^ 
proaches  to  navigating  in  structured  domains. 

The  behavior  generation  module  supports  cross-country 
trajectory  evaluation,  road  tracking,  object  tracing,  stereo 
system  control,  emergency  action  generation,  and  low-level 
planning  for  obstacle  avoidance. 

The  command  arbitration  module  supports  plan  execution, 
and  arbitration  modules  for  speed,  steering,  sensor  pointing 
and  access. 

The  symbolic  recognition  module  supports  landmark  rec¬ 
ognition,  object  recognition,  map  data  abstraction,  target 
acquisition,  target  tracking,  and  target  recognition. 

The  assessment  and  planning  module  encompasses  sensor 
management,  mission  evaluation,  behavior  prioritization, 
route  and  path  planning,  activity  scheduling,  situation  as¬ 
sessment,  plan  monitoring,  and  multivehicle  coordination. 

World  model  maintenance  includes  global  map  updating 
functions,  global  position  maintenance,  object  list  mainte¬ 
nance  and  tracking,  and  also  supports  ttuq>  database  query 
handling. 
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The  effector  and  mobility  controls  include  functional  con¬ 
trol  of  pan/tilt  actuation  loop  closures,  discrete  controls 
such  as  sensor  on/off,  driving  controls,  vehicle  plant  con¬ 
trols,  lens  and  camera  controls,  vergence  controls  for  the 
stereo  systems  supporting  driving  and  RSTA,  manipulator 
controls,  and  safety  features  and  authority  limits  supporting 
robust  safety  considerations  for  the  vehicles. 


6.2.4  Phase  I  and  Demo  A  -  Basic 
Navigation,  Positioning  and  Path 
Following 

Phase  I  of  the  Demo-II  program  will  culminate  in  Demo  A, 
to  be  conducted  in  September  1992.  This  effort  will 
integrate  basic  mechanical  and  electrical  components, 
automotive  controls,  basic  color  and  infrared  visual 
sensors,  GPS,  INS,  odometry,  and  intervehicle 
communications  required  to  support  the  foundation  of 
autononr  ous  navigation .  Many  known  research  results  will 
be  ported  onto  the  STVs,  including  autonomous  inertial 
path  following,  and  autonomous  visual  path  following 
based  on  spectral,  model,  and  neural  net  approaches.  After 
Demo  A,  the  vehicles  built  for  Demo-II  will  be  known  as 
Surrogate  Semi- Autonomous  Vehicles  (SSVs)*. 

Demo-A  will  provide  for  independent  verification  of  local 
action  and  control  layer  navigation  “behaviors.”  The  initial 
behavior  to  be  verified  incorporates  the  basic  mobility 
functions  of  accurate  positioning  and  path  following.  These 
follow  from  research  conducted  by  Stentz  and  Whittaker  at 
Carnegie  Mellon  University  known  as  “FASTNAV.” 
FASTNAV  is  based  on  Kalman  state  space  estimation  from 
inertial  measurement  units,  GPS  receivers,  and  odometry. 
The  path  following  behavior  computes  error  terms  from 
desired  spatio-temporal  positions  with  respect  to  the  path 
plan  loaded  on  the  vehicle  by  the  mission  planning  system. 
These  error  terms  are  used  in  conjunction  with  combined 
feedback  and  feed-forward  controllers  to  implement  a  path- 
tracker. 

A  second  navigation  behavior  to  be  evaluated  is  based  on 
research  in  Autonomous  Land  Vehicle  in  a  Neural 
Network,  or  ALVINN  [Pommerleau  90,  Pommerleau  92], 
see  Figure  12.  The  color  refiectance  values  found  in  the 


1.  The  Surrogate  Teleoperated  Vehicle,  built  for  Eady  User  Test 
and  Evaluation,  will  be  fiber-optically  linked  to  a  single  operator, 
with  vision  based  tele-operation  the  principal  mode  of  conuol. 
The  Surrogate  Semi- Autonomous  Vehicle,  built  for  Demo-II,  will 
utilize  the  same  chassis  as  the  STV,  but  will  replace  the  optional 
driven  compartment  with  a  navigation  sensor  suite,  and  replace 
the  fiber-optic  payout  system  with  computer  rack  carrying  up  to 
60  VME  Cards  supporting  on-board  navigation,  planning  and 
control  decisions. 


video  frame  buffer  are  filtered  in  the  blue  spectral  region, 
then  a  weighted  average  of  pixels  is  computed,  mapping  the 
input  video  buffer  to  a  30  x  32  video  input  retina  in 
ALVINN.  Each  of  the  960  nodes  in  the  input  retina  is  fully 
connected  to  nine  hidden  units,  each  of  which  is  in  turn 
fully  connected  to  all  of  the  output  units.  The  output  layer  is 
a  linear  representation  of  the  currently  appropriate  steering 
direction.  The  centermost  output  unit  represents  the  “travel 
straight  ahead”  condition,  while  units  to  the  left  and  right  of 
the  center  represent  successively  sharper  left  and  right 
turns.  The  steering  direction  dictated  by  the  network  may 
serve  to  keep  the  vehicle  on  the  road  or  to  prevent  it  from 
colliding  with  nearby  obstacles,  depending  on  the  type  of 
sensor  input  and  the  driving  situation  that  the  network  has 
been  trained  to  handle. 

RGURE  12.  ALVINN  Architecture 


ALVINN  is  trained  by  driving  the  Navlab  over  terrain  of 
interest  under  control  of  a  human  driver.  During  recent 
months,  ALVINN  has  resulted  in  some  remarkable  results, 
most  notably  high  speed  operation  at  velocities  up  to  62 
miles  per  hour,  and  long  distance  robustness,  demonstrated 
over  continuous  runs  up  to  22.4  miles  in  length. 

6.2.5  Phase  II  and  Demo  B  -  Integrated 
Navigation,  RSTA  and  Mission 
Planning 

Phase  II  will  culminate  in  Demo  B,  to  be  conducted  in  May 
1993.  Phase  II  will  integrate  basic  map  management 
functions  including  single  vehicle  rtussion  planning,  off¬ 
road  semi-autonomous  navigation,  on-road  semi- 
autonomous  navigation,  robust  high-speed  capability, 
landmark  detection  and  identification,  and  rudimentary 
behaviors  for  following  rules  of  the  road. 
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6.2.6  Phase  III  and  Demo  C  -  Two 
Cooperating  SSVs 

Phase  III  will  culminate  in  Demo  C,  to  be  conducted  in 
October  1993.  Phase  II  will  integrate  multiple  vehicle 
control  from  a  man-portable  operator  control  unit,  utilizing 
an  updated  map  management  and  map  building  subsystem, 
which  will  rely  on  map  information  ftom  overhead  visual 
imagery,  such  as  obtained  from  an  unmanned  aerial 
vehicle.  Additionally,  this  phase  will  integrate  the  results  of 
research  in  multiple  agent  interactions.  Demo  C  will 
demonstrate  two  semi-autonomous  vehicles  executing 
coordinated  navigation,  including  the  use  of  the  multiple 
vehicle  operator  control  unit. 

Demo  C  will  also  be  utilized  as  an  opportunity  to  evaluate 
the  comparative  merits  of  two  emerging  scalable  parallel 
processors,  the  Intel/CMU  iWarp,  and  the  Hughes/UMass 
Image  Understanding  Architecture  (lUA). 

6.2.7  Phase  IV  and  Demo  II  -  Four 
Cooperating  SSVs 

Phase  IV  will  culminate  in  OSD  Joint  Unmanned  Ground 
Vehicle  Demonstration  II,  to  be  conducted  in  October 
1994.  Phase  IV  will  integrate  a  reconnaissance, 
surveillance  and  target  acquisition  subsystem,  with 
technology  derived  from  a  synthesis  of  Image 
Understanding  and  Automatic  Target  Recognition  efforts. 
This  will  result  in  a  multiple  vehicle  mission  subsystem 
with  the  capability  of  robustly  navigating  a  team  of  four 
vehicles  as  a  screening  force.  All  SSVs  will  be  equipped 
with  the  processor  deemed  most  appropriate  during  Demo- 
C.  The  SSV  team  will  be  demonstrated  in  support  of  both 
offensive  and  defensive  scenarios. 

7.0  Conclusion 

The  Tactical  Unmanned  Ground  Vehicle  Program  is  a 
coordinated  evaluation  and  development  program  with  the 
objective  of  first  fielding  of  an  unmanned  system  by  1998. 
The  TUGV  program  is  being  planned  and  managed  with  an 
awareness  that  it  represents  an  initial  step  in  the  evolution 
and  fielding  of  UGVs  for  combat  applications  and  that  its 
success  or  failure  may  have  far-reaching  consequences. 
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RADIUS 

The  Government  Viewpoint 
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Abstract 

The  Office  of  Research  and  Development, 
with  major  involvement  and  support  from  the 
Defense  Advanced  Research  Projects  Agency 
(DARPA),  has  begun  a  highly  applications- 
oriented  project  intended  to  provide  Image  Un¬ 
derstanding  (lU)  technology  in  a  fully  and 
semi-automated  support  system  of  Human- 
Machine  Interface  (HMI)  interactive  tools 
to  the  Photo  Interpreter  (PI)  and  Imagery 
Analyst  (lA).  The  central  concept  of  RA¬ 
DIUS  (Research  and  Development  for  Im¬ 
age  Understanding  Systems)  is  that  of  Model 
Supported  Exploitation  (MSE).  Two-  and/or 
three-dimensional  site  models  are  developed 
and/or  maintained  by  analysts  using  imagery 
source  data,  imagery-derived  information,  and 
appropriate  non-imagery  sourced  information 
(often  called  “collateral”).  lU  technology  and 
necessary  non-IU  technology  are  used  where 
feasible  to  integrate  this  base  of  information 
which  can  be  accessed  spatially  via  the  now- 
developed  site  model  and  displayed  or  ren¬ 
dered  in  support  of  the  lA  during  the  imagery 
exploitation  and  reporting  process.  As  new 
imagery  is  obtained,  it  may  be  registered  to 
the  site  model,  or  through  the  site  model  to 
other  images,  to  support  the  specific  exploita¬ 
tion  tasks  and  applications  which  will  be  devel¬ 
oped.  The  current  effort  is  the  Concept  Defini¬ 
tion  Phase.  This  phase  will  determine  the  vi¬ 
ability  of  current  technology  to  perform  these 
tasks  and  will  define  future  activities. 

1  Introduction 

An  important  goal  within  the  Intelligence  Community  is 
to  provide  greater  assistance  to  the  lA  in  analysis  of  im¬ 
agery,  as  well  as  in  reporting  (communicating)  informa¬ 
tion  which  is  derived  during  this  process.  The  increasing 
availability  and  deployment  of  sensors  capable  of  provid¬ 
ing  imagery  source  data  in  digital  form,  or  the  recent 
and  expected  near  term  technology  advances,  have  cre¬ 
ated  the  environment  and  established  the  time  to  begin 
serious  planning  and  research  for  capabilities,  worksta¬ 
tions,  software,  and  tools  that  can  better  support  the  lA 


as  he/she  exploits  multiple  source  image  data. 

The  opportunities  to  assist  in  the  derivation  of  infor¬ 
mation  from  softcopy  imagery  are  boundless,  and  much 
has  already  been  done.  In  RADIUS,  we  are  attempting 
to  move  to  a  new  plateau;  one  which  will  provide  the 
lA  a  greater  level  of  capability  to  manipulate  both  the 
imagery  being  analyzed  and  information  already  known 
about  the  site  it  represents.  RADIUS  will  draw  on  previ¬ 
ous  efforts,  and  on  many  years  of  lU  research,  to  produce 
this  new  capability. 

The  primary  goal  of  RADIUS  is  to  provide  the  I A  a 
new  and  powerful  approach  with  which  to  derive  cur¬ 
rent  information  from  imagery.  This  approach  consists 
of  modeling  sites  of  interest  and  using  these  models  in 
the  imagery  exploitation  process.  RADIUS  is,  therefore, 
focusing  on  model  supported  imagery  exploitation. 

2  Background 

lAs  use  information  from  a  variety  of  sources  in  exploit¬ 
ing  imagery:  information  obtained  from  both  imagery 
and  non-imagery  sources.  This  information  often  has 
positional  or  locational  elements  in  common.  In  addi¬ 
tion,  many  issues  of  importance  to  national  security  and 
defense  policy  can  be  related  to  a  single  site,  a  geographic 
area/region,  or  a  collection  of  associated  sites. 

Informal  site  models  are  now  employed  in  imagery 
analysis  and  presentation  tasks  in  several  forms  and  ren¬ 
derings:  maps,  annotated  reference  images,  line  draw¬ 
ings,  physical  small-scale  models,  and  image  perspective 
transformations.  Models  such  as  these,  and  far  more  for¬ 
mal  models,  can  and  will  be  used  more  widely  in  the  im¬ 
agery  exploitation  process,  but  only  if  the  time  and  cost 
of  building  and  maintaining  them  can  be  significantly 
reduced,  and  the  quality  of  information  maintained. 

We  believe  that  MSE,  based  on  three  dimensional 
models  defined  for  sites  of  interest,  has  the  potential 
for  becoming  a  unifying  concept  for  several  elements  of 
the  imagery  exploitation  process.  For  example,  MSE 
can  support  the  development  and  use  of  reference  aids, 
semantic  structuring  of  databases,  cmd  development  of 
imagery-derived  products.  Taken  together,  these  ele¬ 
ments  represent  a  significant  portion  of  the  imagery  ex¬ 
ploitation  process. 

Most  efforts  to  define  and  test  concepts  for  integrating 
the  tools  used  in  softcopy  imagery  exploitation  -  textual 
and  formatted  databases  and  softcopy  imagery  exploita- 
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tion  workstations  -  have  recognized  the  need  (though 
not  explicitly)  for  a  model-based  approach.  One  project 
investigated  ways  of  accessing  “collateral”  information 
via  icons  located  on  graphic  representations  of  the  im¬ 
ages  being  exploited;  and  another  developed  a  concept 
of  tying  relevant  information,  including  exploitation  re¬ 
quirements,  to  the  image  at  hand.  It  also  considered  the 
possibility  of  expressing  the  information  derived  from 
the  image  as  icons  placed  on  the  image  and,  if  appropri¬ 
ate,  generating  a  narrative  report  directly  from  this  pro¬ 
cess.  While  these  efforts  demonstrated  interactive  meth¬ 
ods  for  associating  imagery  with  collateral  information, 
approaches  to  directly  assist  the  lA  in  exploiting  current 
imagery  are  still  needed. 

3  Approach 

RADIUS  builds  on  these  earlier  efforts.  It  explicitly  pos¬ 
tulates  site  models  as  an  essential  element  of  any  future 
softcopy  exploitation  operations  concept.  An  MSE  ap¬ 
proach  to  softcopy  exploitation  has  several  benefits: 

•  It  provides  a  common  ground  for  communication 
between  technologists  interested  in  this  aspect  of  lU 
and  end  users  of  the  technology. 

•  Ik  focuses  these  lU  technology  interests  on  an  im¬ 
portant  national  problem. 

•  It  encourages  system  developers  to  see  end-user  data 
integration  issues  in  terms  recognizable  by  both  the 
users  and  the  technologists. 

The  approach  is  to  take  maximum  advantage  of  the 
available  imagery  by  improving  analyst  productivity, 
providing  a  more  timely  product,  and  potentially  im¬ 
proving  the  quality  of  the  product  as  well.  The  focus  of 
the  approach  is  to  assist  the  lA  in  more  effective  use  of 
imagery  and  other  information  that  is  available. 

A  significant  lU  research  challenge  is  to  find  ways  of 
using  pre-knowledge  about  a  site  to  help  build  a  model 
of  the  site  automatically.  Such  assistance  might  con¬ 
sist  of  existing  informal  models,  such  as  maps  or  draw¬ 
ings,  more  formal  representations  of  pre-knowledge  such 
as  structured  databases,  or  knowledge  in  the  analyst’s 
head,  which  is  used  to  guide  the  machine  process.  Use 
of  this  pre-knowledge  may  be  essential  in  order  to  adapt 
existing  lU  techniques  to  the  operational  needs  of  im¬ 
agery  intelligence  within  the  next  decade. 

The  RADIUS  challenge  goes  beyond  that  of  an  inter¬ 
esting  lU  problem.  While  lAs  are  by  nature  graphically 
oriented,  and  have  always  used  informal  models  of  the 
type  noted  above,  we  are  just  beginning  to  explore  the 
concept  of  building,  maintaining,  and  using  more  for¬ 
mal  site  models.  The  technology  of  site  model  building 
and  an  operations  concept  for  the  use  of  MSE  are  being 
developed  concurrently.  We  have  support  from  many 
organizations  within  the  intelligence  and  imagery  com¬ 
munities.  Presently,  the  RADIUS  contractor  team  and 
the  U.S.  Government  have  been  working  together  in  con¬ 
ducting  surveys  and  experiments  with  lAs  to  assess  their 
exploitation  tasks,  needs,  and  preferences.  We  will  soon 
be  investigating  with  analysts  further  definition  of  var¬ 
ious  concepts  for  HMI  support  to  building/maintaining 


models  and  for  rendering  and  displaying  the  results  of 
the  modeling  process. 

As  part  of  this  process,  we  will  examine  several  specific 
applications  of  site  models  in  the  imagery  exploitation 
process  including: 

•  Registration,  which  in  this  context  means  the  pro¬ 
cess  of  mathematical  mapping  of  points  on  an  image 
(from  any  type  of  sensor,  possibly  including  ground 
photos)  to  points  on  a  model.  Currently,  the  regis¬ 
tration  process  in  softcopy  requires  selecting  pairs  of 
tie  points  (the  same  features  on  each  image  are  iden¬ 
tified  by  the  user).  This  can  be  a  time  consuming 
task,  especialy  for  certain  sensors,  and  when  inte¬ 
grating  sensor  data.  Fully  automated  or  highly  au¬ 
tomated  registration  has  important  assistance  value 
to  lAs  in  certain  applications  and  for  many  exploita¬ 
tion  tasks. 

•  Site  Baselines,  which  are  presently  predominantly 
textual  data,  may  have  graphical  components  or 
depictions  that  are  responsive  to  the  exploitation 
requirements  to  which  the  lA  is  responding,  and 
to  the  lA’s  understanding  of  these  requirements 
and  his/her  individual  needs,  knowledge,  and  ex¬ 
perience.  BaseUnes  narratively  describe  the  gen¬ 
eral  layout,  certain  objects  and  features,  and,  where 
appropriate,  the  table  of  organization  and  equip¬ 
ment  (TO&E)  found  at  this  site.  Baselines  typically 
contain  normalcy  guidelines  and  overview  sections 
where  unusual  activity  or  events  that  have  been 
identified  during  the  reporting  period  are  noted 
and  maintained.  Subsequent  remarks  or  comments 
added  to  it  are  responsive  to  the  exploitation  re¬ 
quirements  describing  information  needs  and  will 
accrue  over  a  period  of  time  until  updated  or  a  new 
baseline  is  produced.  This  process  provides  one  of 
several  mechanisms  to  communicate  historical  and 
highly  current  information  to  the  users  of  imagery- 
derived  data,  to  support  trend  analysis,  or  to  relate 
activity  and  change  detected  at  this  site  which  may 
be  applicable  elsewhere. 

•  Site  Folders,  which  are  reference  aids  to  assist  lAs  in 
the  exploitation  of  imagery  of  specific  sites.  These 
folders  usually  consist  of  imagery  relating  to  a  spe¬ 
cific  site.  Maps,  charts,  drawings,  messages,  notes, 
and  clippings  from  open  literature  are  often  in¬ 
cluded.  Site  folders  assist  I  As  in  performing  analy¬ 
sis  of  changes  and  significant  activity  at  these  sites. 
These  reference  aids  are  particularly  useful  for  lAs 
who  are  not  very  familiar  with  the  site  in  question. 
This  factor  is  particularly  important  when  lAs  must 
be  able  to  exploit  imagery  from  many  sites  ouickly. 
Both  Site  Folders  and  Baselines  exist  today  i  hard 
copy.  RADIUS  site  models  will  provide  improved 
and  integrated  access  to  selected  Site  Baseline  and 
Site  Folder  data  which  can  be  or  will  be  maintained 
in  various  forms,  predominantly  digital,  in  the  fu¬ 
ture  MSE  environment  that  is  anticipated. 

•  Perspective,  geometric  modeling,  and  orientation 
are  three  closely  related,  overlapping  concepts  that 
have  evolved  from  early  experiments  with  proto- 
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types  and  other  specialized  equipment,  and  from  op¬ 
erational  experience  with  softcopy  exploitation  sys¬ 
tems.  This  early  experience  shows  promise  in  en¬ 
hancing  analysis  techniques  and  in  presenting  and 
reporting  analysis  results.  In  brief,  these  concepts 
use  imagery  and  models  for  the  following  capabili¬ 
ties.  Perspective:  “Show  me  what  the  site  looks  like 
from  this  new  viewing  position.”  Geometric  model¬ 
ing:  “Show  me  what  the  site  might  look  like  un¬ 
der  selected  sensor  acquisition  parameters  or  envi¬ 
ronmental  conditions.”  Orientation:  “Help  me  com¬ 
pare  this  image  of  the  site  with  another  image  (or 
model,  or  map)  by  matching  positions  on  one  with 
the  other,”  (or  simply,  “put  a  north  arrow  on  the 
image”).  All  three  concepts  share  the  need  for  pro¬ 
cessing  multiple  images  and  for  occasionally  produc¬ 
ing  3D  renderings  of  site  models  and/or  registered 
images. 

4  Programmatic  Approach 

RADIUS  is  intended  to  provide  research  results  within 
five  years  that  can  be  developed  and  deployed  opera¬ 
tionally  within  ten  years.  The  current  RADIUS  project 
is  Phase  I  of  a  potential  three-phase  effort.  This  first 
phase  will  last  for  two  years,  and  will  result  in:  1)  a 
decision  regarding  the  level  of  maturity  of  lU  for  an  op¬ 
erational  application,  and  2)  a  design  for  a  demonstra¬ 
tion/testbed  system  to  be  built  during  the  second  phase. 
Phase  II  will  develop  a  system  on  which  an  lA  can  uti¬ 
lize  MSE  in  a  simulated  operational  context.  During  the 
later  part  of  Phase  II,  detailed  planning  for  a  system  de¬ 
sign  which  would  lead  to  a  Phase  III  RADIUS-like  MSE 
operational  deployment  will  begin,  so  that  funding  deci¬ 
sions  might  be  initiated  for  the  development,  transition, 
and  integration  of  the  RADIUS  proven  and  cost-effective 
lU  capabilities.  If  Phase  II  is  successful,  a  decision  will 
be  made  regarding  a  Phase  III  operational  deployment. 
The  Phase  I/Phase  II  roadmap  is  depicted  in  Figure  1. 

The  Phase  I  RADIUS  contract  has  two  distinct  com¬ 
ponents:  the  Concept  Definition  which  will  identify  user 
needs  and  define  the  technical,  performance  and  pro¬ 
grammatic  requirements  for  Phase  II;  and  Technology 
Development,  which  will  identify  required  technologies, 
determine  their  efficacy,  determine  required  improve¬ 
ments,  and  either  perform  in-house  research  or  define 
needed  research  to  accomplish  the  improvements.  There 
are  also  three  concurrent  efforts  supporting  RADIUS: 
1)  the  RADIUS  Common  Development  Environment 
(RCDE),  2)  development  of  an  experimental  data  set, 
and  3)  three  DARPA  contracts  for  research  supporting 
RADIUS  technology  development.  The  RCDE  will  pro¬ 
vide  a  common  environment  for  transfer  and  integration 
of  all  supporting  technology  during  RADIUS  Phase  II. 
The  RCDE  contains  the  basic  manual  tools  for  build¬ 
ing  three  dimensional  models  of  terrain,  buildings,  and 
other  man-made  structures.  It  also  will  provide  the  nec¬ 
essary  interfaces  and  “hooks”  to  add  lU  technology  that 
more  fully  automates  the  human  steps  now  necessary  for 
building  and  maintaining  site  models,  and  for  support¬ 
ing  selected  lA  exploitation  tasks  involving  the  applica¬ 
tion  and  use  of  such  models.  The  RCDE  is  described 


in  greater  detail  in  a  related  paper  in  these  proceedings 
[Mundy  92].  The  experimental  data  set  will  be  available 
for  use  by  all  participants  in  RADIUS-related  efforts. 
This  unclassified  data  set  will  consist  of  a  substantial 
number  of  images,  associated  parameters,  and  support¬ 
ing  data  which  should  usefully  represent  or  characterize 
the  imagery  used  and  site  complexity  typical  of  many 
operational  imagery  exploitation  environments. 

A  RADIUS  Working  Group  has  been  formed  to  co¬ 
ordinate  the  RADIUS  efforts  discussed  above.  It  meets 
bimonthly,  and  consists  of  key  Government  and  contrac¬ 
tor  personnel  participating  in  the  RADIUS  and  RCDE 
projects.  Researchers  from  the  DARPA  lU  RADIUS- 
related  studies  will  be  added  upon  award  of  these  con¬ 
tracts.  The  Working  Group  also  provides  an  opportunity 
to  report  progress  to  those  not  directly  participating  in 
RADIUS,  such  as  the  DARPA  lU  Community,  and  rep¬ 
resentatives  of  potential  Phase  II  contractors. 

5  Conclusions 

The  Phase  I  RADIUS  contract  was  awarded  to  Hughes 
Aircraft  Company  in  July  1991.  Subcontractors  are 
BDM,  Inc.,  Control  Data  Corporation,  Hughes  Research 
Laboratories,  and  the  University  of  Southern  California. 
The  RCDE  contract  has  been  awarded  to  General  Elec¬ 
tric,  with  SRI  International  as  a  subcontractor.  This 
team  of  seven  contractors  has  been  charged  with  accom¬ 
plishing  the  RADIUS  Phase  I  activities;  their  plans,  ap¬ 
proaches,  and  project  status  are  being  presented  at  this 
workshop  or  reported  in  other  papers  in  these  proceed¬ 
ings  (Edwards  92). 
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Abstract 

The  Research  and  Development  for  Image 
Understanding  Systems  (RADIUS)  project  will  focus 
lU  research  on  the  needs  of  operational  imagery 
analysts  (lAs).  Model-supported  exploitation  (MSE) 
was  selected  by  the  government  as  the  underlying 
concept  to  be  validated,  developed  and  demonstrated 
in  RADIUS.  lU-based  tools  for  the  creation  and 
maintenance  of  site  models  are  the  core  MSE 
technologies.  Additional  model-based  image 
exploitation  tools  will  also  be  examined  under  this 
contract. 

Hughes  Aircraft  Company  is  teamed  with  BDM 
International,  Inc.  (BDM),  McLean,  Virginia;  Control 
Data  Corporation  (CDC),  Minneapolis,  Minnesota; 
Hughes  Research  Labs  (HRL),  and  the  University  of 
Southern  California  (USC).  BDM  is  coordinating  a 
series  of  experiments  at  the  National  Exploitation 
Laboratory  (NEL);  the  other  team  members  are 
involved  in  technology  development. 

1.  Introduction 

lAs  are  charged  with  providing  accurate  and 
timely  information  based  on  a  detailed  examination 
of  image  and  collateral  data.  Collateral  may  take 
the  form  of  reports  derived  from  previous  images, 
descriptions  of  the  imaged  site  and  normally 
observed  activities,  line  drawings,  maps  and  other 
textual  or  graphical  materials.  An  accurate, 
geographically  referenced  3D  site  model  can  serve  to 
tie  imagery  and  collateral  together  in  a  consistent, 
intuitive  manner  and  provide  uniform  database 
access.  In  addition,  site  models  are  expected  to 
support  new  analytic  and  presentation  techniques, 
such  as  data  fusion  and  animated  reports.  Site 
nnodels  will  also  support  other  lU-based  exploitation 
tools  that  will  act  as  intelligent  assistants  to  the  lA. 

In  selecting  MSE  as  the  focus  for  RADIUS,  the 
government  is  acknowledging  the  maturity  of  lU  and 
its  potential  to  provide  valuable  assistance  to  an 
operational  intelligence  task.  By  the  same  token,  the 


government  anticipates  that  considerable 
application-oriented  development  will  be  required  to 
integrate  lU  capabilities  into  an  experimental 
testbed. 

The  goal  of  RADIUS  is  to  demonstrate  the 
viability  of  utilizing  image  understanding 
technology  to  meet  the  needs  and  desires  of  the 
imagery  analyst  conununity.  RADIUS  was  conceived 
as  a  two-phased  project  to  develop  a  demonstration 
system  testbed  that  enables  lAs  to  perform 
exploitation  tasks  using  lU  technology.  After 
considerable  analysis,  the  government  selected  MSE 
as  the  operational  concept  on  which  RADIUS  would 
be  based.  The  applications  that  will  be  included  are 
Site  Folders  and  ^selines;  Registration;  Perspective, 
Geometric  Modeling  &  Orientation;  and  the 
capability  to  build  and  maintain  site  models  to 
support  these  applications. 

Phase  1  of  RADIUS  is  devoted  to  the  development 
of  requirements  and  operations  concepts  for  the 
testbed  system  and  to  characterizing,  enhancing,  and 
developing  lU  technology  that  is  required  to  meet 
those  requirements.  Phase  2  of  RADIUS  will  be 
devoted  to  developing  the  testbed  system  and 
continuing  the  research  and  development  in  lU 
technology  required  to  meet  the  system  requirements. 

The  following  sections  treat  the  various  elements 
that  must  be  considered  in  arriving  at  a  testbed 
design;  lA  requirements,  current  lU  capabilities,  and 
the  possibilities  for  technology  development. 

2.  The  lA's  Perspective 

RADIUS  proposes  the  introduction  of  automated 
tools  for  creating,  maintaining,  and  using  site  models 
as  a  means  of  increasing  the  productivity  of  the 
analyst  community.  Analysts  must  accomm<xlate  an 
evolving  exploitation  environment  and  mission, 
including  increases  in  image  volume,  a  rise  in  per- 
analyst  workload,  and  new  sensors  providing 
unfamiliar  data. 
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The  introduction  of  MSE  tools  will  require  changes 
in  the  way  analysts  currently  perform  their  duties. 
These  changes  are  understandably  points  of  concern 
for  the  individual  analyst.  While  the  promised 
benefits  of  MSE  tools  are  large,  the  efforts  required  to 
initially  implement  and  periodically  maintain  the 
associated  site  models  can  also  appear  large.  It  is 
important  to  emphasize  that  the  analysts'  greatest 
concern  is  to  complete  their  assigned  exploitation 
tasks  in  a  timely  and  accurate  fashion  —  a  job  which, 
from  their  point  of  view,  has  been  achievable  in  the 
absence  of  site  models  and  associated  tools  for  years. 

Accordingly,  a  primary  goal  of  RADIUS  must  be  to 
demonstrate  to  the  analyst  community  that  the 
increases  in  productivity  resulting  from  site-model 
supported  exploitation  far  outweigh  the  efforts 
invested  to  create  and  maintain  the  site  models. 
Discussions  and  demonstrations  must  be  framed 
within  the  context  of  a  typical  analyst's  daily  tasks 
and  responsibilities.  The  following  sections  will 
summarize  several  overall  exploitation  functions 
currently  performed  by  analysts,  and  will  explore 
specific  tasks  targeted  under  RADIUS  to  provide 
automated  support. 

2.1.  The  lA's  Tasks 

While  the  analyst  community  is  gradually  moving 
toward  softcopy  imagery  exploitation,  the  current 
analyst  environment  is  still  dominated  by  hardcopy 
imagery,  text  based  reporting,  and  text  based 
supporting  information. 

Within  this  environment,  analysts  conduct  a  broad 
variety  of  exploitation  functions  characterized  by 
the  amount  of  time  available  to  produce  an 
intelligence  product.  Time  requirements,  in  turn, 
generally  dictate  the  level  of  detail  that  the  product 
can  entail.  Phase  1  exploitation  is  most  time  critical; 
analysts  typically  review  images  of  many  different 
sites,  therefore,  they  must  rely  on  hardcopy  maps  and 
reports  for  rapid  orientation  to  each  site.  The  main 
Phase  1  task.  Indications  &  Warnings  (I&T),  looks 
for  evidence  of  activities  that  are  of  immediate 
concern  for  national  security.  Phase  2  analysts  have  a 
more  consistent  set  of  sites  for  which  they  are 
responsible,  have  less  severe  time  constraints,  and 
examine  imagery  in  greater  detail.  Their  duties 
include  Search,  in  which  imagery  covering  large 
geographical  area  is  examined  for  particular  objects 
or  activities,  and  Surveillance,  in  which  a  history  of 
activity  for  a  particular  location  is  kept.  Phase  3 
tasks  focus  on  larger  intelligence  problems,  such  as 
trend  analysis  or  the  analysis  of  new  equipment  at  a 
site  (i.e..  Science  &  Technology  tasks).  Textual 
readouts  for  examined  inugery  are  stored  in  a  shared 
database  (NPIC  Data  System),  while  additional 


overall  sununaries  and  visual /graphical  products  arc 
prepared  and  distributed  off-line. 

2.2.  Model  Supported  Exploitation 

The  RADIUS  RFP  identified  ten  specific  analyst 
activities  amenable  to  automation.  The  list  includes 
the  Site  Modeling  task  itself,  the  Site  Model  Update 
task,  and  eight  exploitation  tasks  collectively 
identified  as  the  MSE  applications:  Site  Folders  & 
Baselines;  Registration;  Perspective,  Geometric 
Modeling  &  Orientation;  Detection  &  Counting; 
Negation;  Change  Detection;  Trend  Analysis;  and 
Recognition  Guides.  A  subset  of  these  applications 
was  identified  by  the  Hughes  team  as  striking  the 
best  balance  between  applications  offering  the  most 
payoff  to  the  analyst  community  and  project 
resources.  The  following  four  applications  provide 
analysts  a  basic  suite  of  tools  for  accessing  static  and 
previously  extracted  data,  exploiting  multiple 
images  in  a  coordinated  fashion,  and  visualizing  sites 
with  new  degrees  of  freedom. 


23.  The  Successful  Introduction  of  MSE 

The  question  of  whether  or  not  the  MSE  Concept  is 
useful  and  desirable  is  directly  related  to  the  lA's 
assessment  of  its  payoff  relative  to  the  amount  of 
effort  required  to  initiate,  maintain,  and  use  the 
provided  capabilities.  The  exploitation  community 


Site  Modeling  &  Update  —  construction  and 
nriaintenance  of  a  2D/3D  site  model  from  multiple 
images  of  a  site. 

Site  Baselines  &  Folders  —  use  of  site  model  as  a 
graphical  front  end  to  access  databases  of  site 
information,  collateral,  history,  functional 
descriptions;  "best  image"  tutorials  to  acquaint 
analysts  with  an  unfamiliar  site. 

Registration  —  matching  corresponding  ports  between 
an  image  and  a  site  model.  Supports  image-based 
access  to  databases;  indirectly  provides  3D  image-to- 
image  registration,  tracking  cursors,  coordinated 
exploitation  of  multiple  images. 

Perspective,  Geometric  Modeling  &  Orientation  — 
generation  of  synthetic  imagery  to  depict  a  site  from 
arbitrary  view  points,  provide  fly-throughs  and 
calculate  time-lapse  sequences. 

These  four  applications  provides  a  significant 
package  of  capabilities  for  model-supported  image 
exploitation.  The  other  MSE  applications  are 
considered  to  be  directly  dependent  upon  the 
capabilities  provided  by  this  set.  Additional 
research  or  focused  development  can  build  upon  this 
foundation  of  generic  capabilities  to  provide  more 
highly  automated  tools  addressing  the  other 
applications. 
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believes  that  analysts  are  hired  primarily  for 
exploiting  imagery,  not  building  and  updating  site 
models.  Analysts  may  be  motivated  to  perform  these 
latter  functions  as  well,  if  the  resulting  benefits 
permit  them  to  perform  their  primary  duties  better 
and/or  more  quickly.  A  successful  RADIUS  Phase  1, 
leading  to  a  Phase  2  testbed  system,  is  dependent 
upon  the  developers  identifying  and  validating 
technologies  that  can  demonstrably  show  a  path 
toward  a  rational  and  consistent  level  of  interaction 
between  analysts  and  automated  tools. 

3.  The  DARPA  lU  Research  Community  Role 

DARPA's  role  in  the  RADIUS  project  is  important 
both  from  the  standpoint  of  DARPA's  role  as  co¬ 
sponsor  of  the  program  and  as  the  sponsor  of  lU 
projects  which  have  pushed  the  envelope  of  lU 
technology  to  its  current  level  of  maturity.  Much  of 
the  expertise  for  advancing  the  areas  of  lU 
technology  needed  by  RADIUS  resides  in  the  DARPA 
research  community. 

lU  research  spans  the  spectrum  from  being  highly 
theoretical,  where  results  are  analytical  in  nature,  to 
being  application  specific,  where  a  well-defined 
capability  is  shown  to  exist  over  a  well-specified 
range  of  domain  imagery.  Developed  technologies 
may  be  tested  on  anywhere  from  a  few  available 
images  to  a  fairly  complete  set  of  images  showing  a 
full  range  of  image  acquisition  and  domain  conditions. 
These  technologies  range  from  the  algorithm  to  the 
system  level,  and  support  semi-  to  fully-  automated 
functional  capabilities. 

Most  lU  research  was  performed  independent  of 
knowledge  of  RADIUS  needs,  and  without  exposure 
to,  or  use  of,  RADIUS  application  imagery  or  its 
unclassified  equivalent.  Therefore,  one  of  the  first 
tasks  to  perform  during  the  initial  phase  of  RADIUS 
is  the  identification  of  those  technologies  which  can 
support  RADIUS  capability  needs  and  the 
characterization  of  the  range  of  conditions  under 
which  these  technologies  function.  As  a  result  of  this 
process,  potential  gaps  and/or  shortfalls  in  lU 
technology  capabilities,  in  terms  of  RADIUS 
technology  needs,  may  be  uncovered. 

With  initiatives  such  as  the  DARPA  Image 
Understanding  Environment  (lUE)  project,  work 
developed  at  different  institutions  will  be  able  to  be 
exercised  and  combined  relatively  easily.  (See  the 
paper  by  Mundy,  et  al,  "The  DARPA  Image 
Understanding  Environment  Project",  in  this 
proceedings.)  Since  no  such  standard  environment  or 
methodology  currently  exists,  different  systems  will 
need  to  be  exercised  on  different  platforms,  and  data 
formats  may  need  to  be  changed  when  coupling  of 
algorithms  is  required. 


DARPA  continues  to  sponsor  lU  technology 
projects,  including  RADIUS  related  lU  research.  As 
gaps  and  shortfalls  are  recognized,  information  will 
be  fed  to  DARPA.  lU  research  efforts  may  be  used  to 
help  fill  these  technology  gaps,  overcome  technology 
shortfalls,  or  support  development  of  application 
concepts  not  being  addressed  under  the  current 
RADIUS  contract.  Hughes  will  work  with  research 
institutions  during  RADIUS  Phase  1  to  ensure  a 
smooth  transfer  of  technology. 

4.  The  Contractor’s  Perspective 

The  goal  of  RADIUS  is  to  integrate  the  needs  and 
desires  of  the  image  analyst  community  with  the 
research  performed  by  the  lU  community. 
Coordination  of  these  non-aligned  interests  is 
considered  a  keystone  to  project  success.  From  a 
system  engineering  standpoint,  intimate  knowledge  of 
analyst  activities  and  priorities  is  vital  to  designing 
a  useful  testbed  system.  From  a  technological 
perspective,  access  to,  and  incorporation  of  leading- 
edge  developments  is  critical  to  designing  a  powerful 
and  robust  system. 

4.1.  Outcome  of  RADIUS  Phase  1 

The  major  goal  of  RADIUS  Phase  1  is  to  develop 
requirements  for  the  testbed  system  that  will  be 
developed  during  the  second  phase  of  the  project.  A 
significant  number  of  the  requirements  must  use  lU 
technology  as  a  solution. 

With  limited  time  and  budget,  resources  must  be 
applied  to  projjer  areas  of  research  to  ensure  success. 
TTierefore,  during  Phase  1  the  project  will  first  look 
at  existing  technologies  that  address  the  RADIUS 
problems. 

4.2.  Methodology 

The  large-scale  evaluation  and  gradual 
optimization  (LEGO)  methodology  being  used  on 
RADIUS  was  designed  to  accommodate  the  time  and 
level-of-effort  constraints  inherent  to  research 
projects.  It  assumes  that  maximal  return  on 
investment  is  realized  when  work  is  focused  on  the 
enhancement  of  applicable  pre-existing  technologies 
and  the  development  of  new  capabilities  where 
required. 

The  first  phase  of  LEGO  requires  a  large-scale 
evaluation  of  both  system  needs  and  the  current  state 
of  research  technology.  The  MSE  tasks  selected  for 
RADIUS  Phase  1  will  be  decomposed  to  their 
functional  components,  making  a  comparison  with 
existing  lU  technology  possible.  lU  systems  will  be 
mapped  onto  the  functional  components  to  determine 
potential  areas  for  MSE  automation,  as  well  as  those 
areas  for  which  no  lU  technology  exists.  In  addition. 


the  lU  systems  will  be  characterized  to  identify 
their  operational  range  of  conditions. 

RADIUS  functional  requirements  will  be 
identified  through  a  series  of  experiments  being 
conducted  with  image  analysts  at  the  NEL.  The 
extent  to  which  these  requirements  are  not  met  by  the 
characterized  systems  will  motivate  technology 
enhancement  and  development.  Technology 
enhancement  and  development  proposals  will  be 
prioritized  based  on  their  importance  to  I  As,  as 
determined  by  the  concept  validation  experiments, 
and  their  potential  for  realization  within  the 
RADIUS  time  frame. 

43.  RADIUS  Related  lU  Research  Community  Work 

In  this  section  we  include  the  preliminary  list  of 
technologies  we  believe  can  support  needed  RADIUS 
capabilities,  and  how  they  map  into  the  structure 
referred  to  in  the  previous  section.  This  is  by  no 
means  a  final  or  complete  list  and  we  encourage  the 
community  to  help  us  to  identify  other  work  which 
provides  needed  capabilities.  Likewise,  we  would 
appreciate  as  much  feedback  as  possible  with  respect 
to  the  technologies  we  included  in  order  to  help  us  to 
identify  how  best  they  fit  into  the  RADIUS  puzzle. 

4.3.1.  Registration  Application 
Registration  Area  Reduction: 

IHAC  I  SCORPIUS  I  Cloud  Detection  | 


Model-to-Image  Registration; 


GE 

Thompson  &  Mundy 

Extended  Vertex  Pair 
Matching 

GE 

Nguyen,  Mundy  & 
Kapur 

Constraint  System 

SRI 

Barrow, 

Tenenbaum,  Bolles 
and  Wolf 

Chamfer  Matching 

HAC 

SCORPIUS 

Image  to  2D  Model 

use 

Medioni  &  Nevada 

Image  to  Map  -  linear 
features 

HRL 

Silberberg, 

Harwood  &  Davis 

Image  to  3D  Model 

43.2.  Site  Modeling  Application: 
Image-to-lmage  Registration: 


CMU 

Perlant  &  McKeown 

Coarse  Registration 
via  spatial  database. 
Fine  Registration 
using  automatic 
selection  of  control 
points  (e.g..  Shadow 
Comers) 

use 

Medioni  &  Nevatia 

Registration  using 
linear  features 

HRL 


Zikan  &  Silberberg 


Local  feature  based 
matching _ 


Object  Modeling: 

Broad  Area  Delineation: 


CMU 

McKeown  & 
McDermott 

Rule-based 
segmentation  and 
interpretation  system 
for  Airports 

Region- 

Based  Segmentation 

Methodologies  based 
upon  texture, 
intensity,  etc... 

Buildin] 

;  Detection  and  Delineation: 

CMU 

Irvin  &  McKeown 

BABE,  Shade, 
Grouper,  Shave,... 

use 

Huertas  &  Nevatia 

Building  Detection 
using  Shadows 

use 

Mohan  &  Nevatia 

Rectangular  roof 
detection  and  stereo 
matching 

CMU 

Herman  &  Kanade 

3D  Mosaic  Scene 
Understanding 

System 

UofM 

Venkateswar  & 
Chellappa 

Building  Detection 
using  Shadows 

SUNY 

Liow  &  Pavlidis 

Building  Extraction 
using  Shadows 

UofM 

Hwang,  Davis  & 
Matsuyama 

2D  detection  of  houses 
and  roads 

Roads  and  RRs: 


CMU 

McKeown 

Road  Network  Finder 

HAC 

Gee 

Corridor  Finder 

SRI 

Fischler, 

Tenenbaum,  &  Wolf 

Roads  and  RRs 

HRL 

Silberberg,  Kim  & 
Olin 

Road  Detection 

UofM 

Canning,  Kim, 
Netanyahu  & 
Rosenfeld 

Road  Detection 

UofM 

Hwang,  Davis  & 
Matsuyama 

Detection  of  Roads 
and  Houses 

Runways  and  Taxiways: 


CMU 

McKeown  & 
McDermott 

SPAM 

use 

Huertas,  Cole  & 
Nevatia 

Runway  detection 

Ports: 

I  use  I  Huertas  |  Piers  | 
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Terrain  Elevations: 


CMU 

Hsieh,  Perlant  & 
McKeown 

SI  and  S2  Stereo 
Systems 

use 

Cochran  &  Medioni 

Area/Feature  Based 
Stereo  System 

SRI 

Hannah 

Stereosys 

SRI 

Barnard 

Cyclops  Stereo 

System 

GD 

Stereo  System 

MIT 

Crimson 

Feature  Based  Stereo 

4.4.  Characterization  of  Existing  Technology 

The  major  objective  of  the  system  characterization 
task  is  to  provide  the  RADIUS  community  with 
performance  measures  of  various  image  understanding 
systems.  These  characterizations  will  be  used  as 
input  to  determine  technology  shortfalls  with  respect 
to  RADIUS  system  requirements.  A  set  of  metrics  will 
be  determined  for  each  class  of  system  whose 
performance  is  to  be  characterized.  For  example, 
road  finders  would  form  one  class  of  systems,  with 
percent  of  roads  found  forming  one  of  the  performance 
metrics. 

To  support  testing  of  lU  systems  in  unclassified 
environments,  a  set  of  model  b^ard  and  aerial  images 
will  be  provided  by  the  government.  The  intent  of 
the  test  set  is  to  provide  numerous  examples  covering 
a  variety  of  site  types,  plus  a  range  of  natural 
conditions  such  as  clouds,  haze,  textured  areas  and 
water. 

In  most  cases,  the  developer  of  the  program  will 
perform  the  first  set  of  performance  tests.  In 
cooperation  with  the  government,  a  mechanism  will 
be  established  to  distribute  the  unclassified  imagery 
set  to  interested  lU  developers.  Based  on  the  level  of 
performance  achieved  on  the  unclassified  test  set,  the 
system  will  then  be  ported  to  a  classified  platform 
for  further  tests  against  classified  imagery.  A  final 
report  of  the  system's  performance  characterization 
will  be  made  for  each  system  tested.  The  system’s 
developers  will  be  kept  involved  in  the  entire  testing 
process.  In  selected  cases,  RADIUS  personnel  may 
visit  the  developer  for  orientation  and  training  on 
their  system.  As  testing  progresses,  the  developer 
will  be  consulted  for  help  with  parameter  setting, 
problems,  and  advice  as  to  what  is  required  to 
optimize  system  performance. 

43.  Development  of  Requirements  —  lA  Experiments 

In  offering  automated  tools  for  image  exploitation, 
it  is  necessary  that  the  customer  and  developer 
specify  the  capabilities  such  tools  will  provide  and 
the  requirements  they  will  meet.  While  these 
requirements  define  the  scope  of  the  work  to  be  done, 
perhaps  more  importantly,  the  development  of  such 
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requirements  also  offers  an  opportunity  to  validate 
that  the  proposed  tools  will  indeed  provide 
capabilities  that  are  useful  and  amenable  to  the  end 
user.  As  one  of  its  nnain  goals,  RADIUS  will  consult 
the  analyst  community  as  a  sounding  board  to 
validate  proposed  areas  of  technology  development. 

A  formal  series  of  Imagery  Analyst  experiments 
will  be  coordinated  by  BDM  to  be  performed  at  the 
National  Exploitation  Laboratory  (NEL).  The 
experiments  will  seek  to  identify  and  characterize 
key  lA  issues  and  concerns  regarding  the  use  of 
proposed  MSE  tools.  Results  from  these  experiments 
will  be  incorporated  into  the  Phase  2  system 
requirements  document  and  will  help  to  determine 
which  lU  technologies  will  be  developed. 

4.5.1.  Sites  and  Site  Objects 

MSE  tools  developed  for  imagery  analysts  must  be 
designed  in  light  of  the  types  of  objects  that  lAs  will 
require  in  site  models.  The  required  content  for  a  site 
model  will  vary;  it  is  clearly  influence  by  the 
geographical  location  of  a  particular  site,  the 
activities  and  objects  typical  at  that  site,  the 
intelligence  significance  of  that  site,  and  the  nature 
of  the  exploitation  efforts  that  will  be  directed 
against  the  site. 

Characterizing  site  model  content  has  important 
ramifications  for  the  technologies  that  are  being 
pursued  under  RADIUS.  Site  modeling  tools  should 
be  capable  of  operating  on  as  many  types  of  objects  as 
possible  to  reduce  the  amount  of  manual  modeling  and 
correction.  A  preliminary  survey  of  the  6  RADIUS 
sites  identified  a  large  number  of  terrain  types,  object 
classes,  object  densities,  weather  conditions  and 
viewing  geometries.  While  only  some  objects  at  a  site 
will  be  modeled,  all  objects  visible  at  the  site 
contribute  to  the  overall  complexity  of  the  image 
being  exploited;  all  of  the  objects  produce  signatures 
and  features  that  automated  lU  algorithms  must  sift 
through  in  identifying  the  significant  objects.  Note 
that  most  imagery  used  by  lU  developers  does  not 
reflect  ihe  complexities  evident  in  the  customer’s 
typical  operational  imagery.  It  is  important  then, 
for  developers  of  RADIUS  related  lU  to  consider  the 
effect  of  differences  between  this  operational 
imagery  and  any  test  imagery  they  are  using  to 
validate  their  work. 

The  required  content  of  a  site  nuxiel  is  likely  to  be 
augmented  to  include  objects  other  than  those 
required  by  analysts.  Such  objects  will  include,  for 
example,  thotx;  required  by  the  lU  algorithms  making 
up  the  automated  exploitation  tools.  The 
Registration  application  is  a  good  example  of  this; 
algorithms  may  require  that  various  insignificant, 
large  scale  objects  be  added  to  the  model  to  expedite 


an  initial  coarse  registration.  Even  when  refining  the 
registration  with  smaller  objects,  the  algorithms  may 
require  that  specific  objects  with  strong  signatures  be 
included  in  the  model  despite  their  unimportance. 

A  primary  goal  of  the  first  NEL  analyst 
experiments  will  be  to  assess  the  specific  types  of 
object  classes  that  analysts  want  to  see  in  site  models. 
The  results  will  aid  in  prioritizing  the  types  of  lU 
systems  that  will  be  pursued  for  validation  under 
Phase  1. 

4,5.2.  Acctuacy  and  Level  of  Detail 

Related  to  the  question  of  what  objects  should  be  in 
a  site  model  is  the  question  of  how  accurately  they 
must  be  modeled  and  to  what  level  of  detail.  As 
before,  the  tradeoff  here  is  between  the  quality  of  an 
lU  product  (e.g.,  a  site  model  or  a  registration  result) 
and  the  resources  required  to  deliver  that  quality  of 
product.  An  analysis  of  the  latter  must  assess  the 
likelihood  of  successfully  developing  the  required  lU 
technology,  as  well  as  the  cost  and  time  required  to  do 
so.  Other  related  factors,  considered  less  significant 
by  the  customer  at  this  early  stage  of  the  project,  are 
the  predicted  run-times  of  the  automated  tools  and 
the  hardware/software  environment  required  to 
support  them. 

Another  facet  of  the  first  NEL  experiment  will 
seek  to  investigate  these  issues.  Imagery  analysts 
will  be  formally  consulted  to  assess  the  level  of 
accuracy  and  level  of  detail  that  they  need  when 
constructing  site  models  and  using  them  with 
associated  automated  capabilities.  Note,  however, 
that  the  final  requirements  for  accuracy  and  detail 
are  likely  to  continue  to  evolve  as  the  program 
progresses.  It  is  anticipated,  for  example,  that 
integration  and  testing  of  Phase  2  testbed  components 
will  bring  out  and  motivate  additional  issues  that 
will  refine  requirements. 

4.53.  Analyst  Interface 

As  previously  discussed,  analysts'  acceptance  of 
Site  Modeling  and  the  MSE  Applications  will,  to  a 
large  extent,  be  determined  by  their  perception  of 
whether  or  not  there  is  substantial  benefit  in  using 
and  maintaining  these  capabilities.  An  important 
goal  of  RADIUS  will  be  to  assess  what  role  an 
analyst  is  willing  to  assume  in  constructing, 
maintaining,  and  using  site  models. 

Some  issues  coming  out  of  this  assessment  are 
likely  to  be  motivated  by  human  psychology;  some 
may  appx!ar  irrational  or  contradictory  from  the  view 
of  technology  developers.  For  example,  previous 
analyst  interface  research  has  revealed  a  dislike  for 
the  manual  selection  of  tie-peints  for  the  purpesc  of 
image-to-image  registration.  Based  upon  this  result. 


it  is  likely  that  this  process  will  have  to  be  totally 
automated  in  order  to  be  accepted.  If,  on  the  other 
hand,  an  automated  registration  tool  can  initially 
align  a  site  model  to  an  image  with  a  relatively 
"good"  accuracy,  analysts  may  be  willing  to  make 
minor  translational  corrections  to  fine  tune  it,  thus 
relieving  the  automation  requirements  of  this  task. 

The  second  NEL  experiment  will  investigate 
analysts'  preferences  in  interfacing  to  automated 
exploitation  tools.  A  goal  will  be  to  assess  their 
willingness  to  assist  automated  tools  in  the  task  of 
site  modeling;  the  typ>es  and  amounts  of  information 
an  analyst  is  willing  to  supply  will  be  explored. 

4.6.  Strawman  System  Requirements  and  Capabilities 

Requirements  for  the  RADIUS  Phase  2  testbed 
will  evolve  out  of  a  compromise  between  lU 
capabilities  and  lA  requirements.  While  the  lAs 
may  desire  a  completely  automatic  system  to 
generate  site  models,  they  would  probably  not  be 
willing  to  accept  the  level  of  performance  for  such  a 
system,  given  the  current  state  of  technology. 
Assuming  a  semi-automated  system,  a  balance  must  be 
struck  between  the  level  of  performance  lU  can 
provide  and  the  level  of  interaction  required  of  the 
lA. 

4.6.1.  Quality  of  lA  Interaction 

One  ingredient  of  this  balance  involves  the 
perceived  role  of  the  lA  in  the  site  modeling  process. 
On  the  one  hand,  the  lA  can  have  extensive 
interaction  with  an  lU  system  during  the  modeling 
process,  providing  it  with  sufficient  information  so 
that  its  results  require  minimal  correction  or  editing. 
On  the  other  hand,  highly  automated  systems  may 
keep  lA  interaction  to  a  minimum,  leaving  the  lA  to 
monitor  the  final  output  and  correct  it  as  needed.  We 
anticipate  that  an  lA's  preference,  as  recorded  during 
the  lA  exp)eriments,  will  depjend  on  the  error  rate  of 
the  system;  the  lA  would  prefer  a  more  interactive 
approach  rather  than  spjend  much  time  correcting 
errors. 

4.6.2.  Targeted  Level  of  Automation 

With  these  issues  in  mind,  the  strawman  level  of 
automation  draws  from  both  styles  of  interaction. 
The  strawman  defines  a  pattern  of  interaction  for  the 
lA  that  is  consistent  across  the  various  site  modeling 
tasks.  It  also  takes  a  conservative  view  of  the  lU 
capabilities  available  to  solve  the  range  of  site 
modeling  tasks.  This  estimate  is  not  intended  to 
place  an  uppxir  limit  on  the  level  of  automation  made 
available  to  a  site  model  builder.  It  is  intended  to 
focus  attention  on  the  issue  of  consistency  of  user 
interaction,  an  impx)rtant  factor  in  lA  acceptance. 
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The  strawman  level  of  automation  requires  the 
user  to  specify  some  attributes  of  the  object  to  be 
modeled,  such  as  its  shape  and  where  that  object  is  in 
the  image.  For  each  site  model  object,  the  lA  selects  a 
geometric  primitive.  Simple,  frequently  encountered 
shapes  will  be  represented  within  the  site  modeling 
tool.  For  complex  shapes,  a  specification  tool  to 
compose  custom  primitives  will  be  made  available  to 
the  lA.  Upon  selecting  the  shape,  the  lA  delineates 
the  image  area  within  which  the  site  object  exists. 
Once  the  geometric  primitive  and  image  area  are 
specified,  the  system  automatically  determines  the 
orientation  and  scale  of  the  site  object,  and  returns  the 
parameters  that  reflect  the  optimal  fit  between  site 
object  and  geometric  primitive.  This  result  is 
graphically  displayed  to  the  lA,  who  may  then 
correct  the  fit,  embellish  it  with  additional  detail, 
or  attach  semantic  information. 

We  expect  that  this  strawman  would  apply  to  a 
large  subset  of  the  complete  range  of  image  conditions 
and  site  object  types.  Optimal  image  conditions  and 
site  objects  composed  of  simple  geometric  shapes  are 
probably  amenable  to  a  more  completely  automated 
approach,  while  maintaining  the  required  level  of 
performance.  More  complex  conditions  would, 
conversely,  require  more  lA  interaction.  Therefore, 
we  may  expect  several  approaches  to  modeling  the 
same  site  objects  to  be  included  within  any  deployed 
system. 

4.7.  Technology  Development 

The  identification  of  technology  areas  to  enhance 
or  develop  follows  from  the  decomposition  of  MSE 
tasks,  the  characterization  of  existing  technology 
and  the  determination  of  lA  requirements.  Whether 
or  not  any  of  the  characterized  systems  meet  each  of 
the  application  requirements  wilt  be  determined.  If 
any  requirement  is  inadequately  met  there  is  a 
shortfall.  When  it  is  not  met  at  all  there  is  a  gap. 

RADIUS  must  next  make  recommendations 
concerning  the  course  of  action  to  be  followed  with 
respect  to  each  shortfall  and  gap.  A  cost  /  benefit 
analysis  will  be  performed  based  on  criteria  such  as 
importance  to  lAs,  frequency  of  occurrence  during  MSE 
tasks,  implications  to  other  MSE  applications, 
availability  of  work  arounds  and  cost.  Determining 
the  answers  to  some  of  these  questions  will  prove  to  be 
difficult,  particularly  cost  and  risk.  For  this  reason, 
the  DARPA  researchers,  especially  those  whose 
work  is  applicable  to  the  MSE  tasks,  will  be  enlisted 
to  help  answer  these  questions  on  every  occasion 
possible.  Upon  sponsor  approval  of  the  selected 
technology  development  areas,  Hughes  will  then 
make  recommendations  as  to  who  should  pursue  each 
enhancement.  This  will  be  based  on  familiarity  with 


the  technology,  availability  of  people  and 
equipment,  cost  effectiveness,  and  whether  they  are 
currently  under  contract. 

5.Sununaiy 

CXir  approach  to  reaching  Phase  1  goals  includes 
concept  validation  experiments  to  determine  critical 
lA  requirements,  lU  system  characterizations  to 
assess  the  maturity  of  key  technologies,  and  the 
identification  of  critical  areas  of  lU  research  to  be 
pursued.  We  look  forward  to  working  cooperatively 
with  the  lU  community  to  accomplish  the  RADIUS 
Phase  1  goals. 
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Abstract 

The  Image  Understanding  Environment(IUE) 
project  is  a  five  year  program,  sponsored  by 
DARPA,  to  develop  a  common  software  envi¬ 
ronment  for  the  development  of  algorithms  and 
application  systems.  This  paper  reviews  the 
history  of  the  project  and  provides  an  overview 
of  some  the  data  structures  which  are  currently 
evolving  as  a  specification  for  the  lUE.  The  ulti¬ 
mate  goal  of  the  project  is  to  provide  the  basic 
data  structures  and  algorithms  which  are  re¬ 
quired  to  carry  out  state  of  the  art  research  in 
image  understanding. 

1  Introduction 

1.1  The  Nature  of  lU  Research  Software 

Image  understanding  research  is  typically  carried  out  by 
individual  contributors  who  develop  a  specialized  soft¬ 
ware  environment  for  implementing  and  evaluating  new 
approaches  and  concepts.  In  most  cases,  the  software 
environment  is  thrown  away  as  new  approaches  are  con¬ 
sidered  or  the  researcher  moves  to  a  new  research  prob¬ 
lem  or  new  hardware  platforms  are  introduced.  A  clear 

*The  committee  effort  has  been  funded  by  numerous 
DARPA  grantirand  associated  funding  partners  under  con¬ 
tract  to  each  institution. 


example  of  the  latter  case  is  the  decline  of  the  Symbolics 
Lisp  Machine  as  a  platform  for  rapid  prototyping  of  lU 
algorithms. 

This  volatility  of  research  software  is  not  particular  to 
image  understanding  research,  but  is  perhaps  character¬ 
istic  of  software  systems  in  general.  These  systems  evolve 
rapidly  with  time  as  new  languages,  programming  tech¬ 
niques  and  hardware  platforms  emerge.  However,  this 
volatility  and  diversity  places  a  heavy  toll  on  the  effi¬ 
ciency  of  the  image  understanding  research  community. 
It  is  difficult  to  share  and  evaluate  new  research  ideas  be¬ 
cause  they  can  be  demonstrated  only  in  the  specialized 
environments  generated  by  the  originator  of  the  idea.  A 
steep  learning  curve  is  encountered  in  any  attempt  to 
acquire  and  operate  these  specialized  and  often  fragile 
environments. 

Another  major  source  of  inefficiency  is  that  most  lU 
experiments  require  an  extensive  set  of  rather  standard 
algorithms  and  data  structures  to  reach  the  level  of  pro¬ 
cessing  and  feature  extraction  upon  which  the  new  ideas 
can  be  tried.  Typical  examples  are  image  smoothing, 
edge  detection,  curve  fitting,  feature  grouping  and  cam¬ 
era  calibration.  Since  each  environment  is  somewhat  dif¬ 
ferent  in  its  design  of  the  beisic  data  structures,  it  is  nec¬ 
essary  for  the  algorithm  developer  to  reimplement  the 
basic  lU  infrastucture  in  order  to  evaluate  a  complex 
algorithm.  This  process  is  repeated  over  and  over  at 
dozens  of  lU  research  laboratories  each  year  as  new  re- 
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search  projects  start. 

Finally,  it  is  now  widely  recognized  that  significant  ap¬ 
plications  of  lU  research  can  only  be  realized  in  the  con¬ 
text  of  large  systems.  Examples  of  such  applications  are 
the  UGV(Unmaned  Guided  Vehicle)  project  wliich  will 
exploit  image  understanding  for  navigation  and  surveil¬ 
lance  tasks,  and  the  RADIUS( Research  and  Develop¬ 
ment  for  Image  Understanding)  project  which  is  focused 
on  the  application  of  Image  Understanding  to  photoint¬ 
erpretation.  These  extensive  application  projects  cannot 
be  realized  without  a  common  software  environment  to 
provide  the  integration  of  diverse  system  components, 
developed  at  different  research  institutions. 

1.2  Initiation  of  the  lUE 

In  late  1989,  Rand  Waltzman  of  DARPA,  then  manager 
for  Image  Understanding  programs  conceived  and  devel¬ 
oped  a  new  program  called  I4US.  The  decoding  of  this 
acronym  is  Intelligent  Integrated  Interactive  Image  Un¬ 
derstanding.  The  name  has  since  been  shortened  to  lUE, 
for  Image  Understanding  Environment. 

The  lUE  program  was  announced  at  a  meeting  for 
DARPA  Principal  Investigators  in  Scottsdale,  Arizona 
at  the  end  of  February,  1990.  The  project  goal,  as  an¬ 
nounced  by  Rand,  was  a  five  year  f  -ogram  to  design 
and  implement  a  common  software  environment  for  the 
development  and  demonstration  of  image  understanding 
algorithms  and  techniques. 

At  the  time  of  the  meeting,  there  were  three  environ¬ 
ments  which  had  reached  a  reasonable  maturity  and  had 
attracted  a  sufficient  number  of  users  to  demonstrate 
that  the  idea  of  an  lUE  was  feasible.  The.se  three  sys¬ 
tems  and  general  characteristics  are  summarized  below. 

•  The  Cartographic  Modeling  Environment(CME) 
has  been  developed  by  Lynn  Quam  of  SRI  Inter¬ 
national  over  the  past  decade  [l].  The  focus  of 
CME  is  the  efficient  handing  of  large  images  and  the 
representation  of  configurations  of  3D  object  mod¬ 
els  on  the  earth’s  surface  under  perspective  view¬ 
ing.  in  support  of  site  modeling  for  cartography  and 
photoreconnaissance.  CME  is  implemented  in  Lisp 
on  the  Symbolics  Lisp  Machine.  CME  is  currently 
playing  a  central  role  in  the  RADIUS  program  as  a 
proposed  development  environment  for  RADIUS  ex¬ 
periments.  This  development  environment  is  called 
RCDE,  or  RADIUS  Common  Development  Envi¬ 
ronment. 

CME  is  currently  being  ported  to  the  SUN-UNfX 
platform  under  X-Windows  and  Common  Lisp. 

•  KB  Vision  (Knowledge- Based  Vision)  is  a  product 
of  Amerinex  Artificial  Intelligence  Inc.,  and  is  losely 
based  on  the  VISIONS  system  developed  at  the  Uni¬ 
versity  of  Massachusetts  at  Amherst  over  the  last 
fifteen  years  [3].  KB  Vision  combines  Lisp  and  C 
components.  C  is  used  primarily  for  numerical  and 
image  feature  processing  tzisks  while  Lisp  is  for  high 
level  reasoning  about  image  content.  KB  Vision 
is  widely  distributed  among  users  of  image  under¬ 
standing  technology  and  provides  an  effective  inter¬ 
face  for  developing  image  feature  .segmentation  and 
feature  grouping  algorithms.  KB  vision  is  currently 


being  extended  to  provide  the  programming  envi¬ 
ronment  for  the  Image  Understanding  Architecture 
or  lUA,  a  highly  parallel,  multi-granularity  design. 

•  Power  Vision  was  developed  by  Advanced  Decision 
Systems( ADS),  now  a  division  of  Booz,  Allen  and 
Hamilton,  starting  about  1986  [2].  Power  Vision 
provides  an  object-oriented  programming  environ¬ 
ment  in  Symbolics  Flavors  for  a  wide  range  of  Image 
Understanding  data  structures.  An  effective  user 
interface  has  been  developed  for  displaying  graph¬ 
ical  interpretations  of  relationships  between  image 
features.  Power  Vision  was  used  by  ADS  in  many 
of  their  application  studies,  including  their  work  on 
the  Autonomous  Land  Vehicle  Project(A^'^). 

These  systems  provided  the  conceptual  basis  fc .  the  lUE, 
but  the  scope  of  the  lUE  project  is  much  broader  than 
any  of  these  existing  systems.  The  goal  is  to  provide 
an  environment  which  can  cover  the  full  spectrum  of 
lU  research  and  support  both  C  and  Lisp  application 
development. 

To  initiate  the  project,  Rand  Waltzman  convened  and 
chaired  three  meetings  during  the  1990-1991  period  to 
develop  an  consensus  in  the  lU  community  about  the 
requirements  for  the  lUE  *.  A  number  of  teams  were 
established  to  suggest  specific  application  scenarios  and 
propose  skeleton  architectures  for  the  lUE.  In  April  1991, 
these  team  reports  were  reviewed,  and  the  lUE  commit¬ 
tee  was  formed  from  representatives  of  each  team.  Since 
April,  O'-  lUE  committee  has  met  five  times  and  has 
produced  a  requirements  and  design  specificatioi?  for  the 
lUE.  The  design  consists  of  over  a  hundred  classes  at  the 
time  of  this  writing  and  over  400  pages  are  required  to 
document  the  classes!  The  extensive  nature  of  the  design 
illustrates  one  facet  of  the  complex  nature  of  the  Image 
Understanding  problem. 

2  The  lUE  Program  Schedule 

The  current  lUE  schedule  is  summarized  in  figure  1.  The 
draft  lUE  specification  was  distributed  for  review  in  De¬ 
cember  1991  to  a  review  board  consisting  of  experienced 
lU  system  developers  and  researchers  for  comment.  The 
responses  will  be  incorporated  in  the  a  revised  document 
This  final  version  will  be  broadly  distributed  in  February 
1992.  At  the  same  time,  an  RFP  will  be  issued  to  se¬ 
lect  an  integrating  contractor  for  the  lUE.  The  actual 
software  development  is  to  be  carried  out  at  a  num¬ 
ber  of  lU  research  laboratories  under  the  coordination 
of  the  integrating  contractor  who  will  supervise  the  cod¬ 
ing  standards  and  provide  the  configuration  control  and 
documentation  of  the  system.  It  is  hoped  that  memy 
university  course  activities  will  be  initiated  to  evaluate 
the  lUE  design  and  provide  code  for  the  system^. 

It  is  expected  that  the  lUE  will  be  developed  over 
a  two  >ear  time  period  and  the  initial  release  will  be 

‘The  lUE  project  is  currently  monitored  by  Oscar 
Firschein  who  replaced  Rand  Waltzman  as  the  lU  program 
manager  at  DARPA  in  June  1991. 

^Fveith  Price  of  USC  is  already  experimenting  with  the 
draft  specification  as  a  means  for  supervising  algorithm  de¬ 
velopment  by  new  graduate  students  in  lU. 
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available  at  the  end  of  1994.  It  is  planned  to  broadly 
distribute  the  lUE  to  lU  researchers  throughout  the  US 
at  that  time.  The  system  will  continue  to  be  maintained 
by  DARPA  until  the  end  of  1996.  At  that  point,  it  is 
assumed  that  the  lUE  will  be  fully  supported  by  the  user 
community  through  maintenance  charges  and  technical 
support  fees.  The  lUE  Committee  will  stay  in  existence 
during  the  same  time  frame  to  provide  design  guidance 
and  to  develop  an  lUE  standards  process. 

The  following  sections  provide  an  overview  of  the 
scope  of  the  lUE  and  a  summary  of  the  draft  object 
hierarchies  which  will  guide  the  development  of  the  sys¬ 
tem. 

3  Scope  of  the  lUE 

The  primary  purpose  of  the  Image  Understanding  Envi- 
ronment(IUE)  is  to  facilitate  exchange  of  research  results 
within  the  lU  community.  The  lUE  will  provide  a  plat¬ 
form  for  various  demonstrations  and  tools  for  DARPA 
applications.  These  demonstrations  and  tools  will  be¬ 
come  a  primary  channel  for  lU  technology  transfer.  The 
lUE  will  also  serve  as  a  conceptual  standard  for  lU  data 
models  and  algorithms.  The  availability  of  standard 
implementations  for  basic  lU  algorithms  will  facilitate 
performance  evaluation  of  new  techniques  and  to  track 
progress  in  algorithm  improvements.  The  lUE  is  de¬ 
signed  to  support  significant  evolution  of  lU  approaches 
and  an  effective  programming  environment  for  rapid  pro¬ 
totyping. 

The  lUE  is  not  intended  to  be  a  real  time  system  al¬ 
though  tools  will  be  provided  for  the  simulation  of  real 
time  applications  such  as  navigation.  The  lUE  will  not 
support  special  hardware  accelerators  but  a  standard  im¬ 
age  processing  interface  will  be  provided.  There  is  no 
intention  to  generate  a  design  suitable  for  embedding 
in  larger  systems,  although  object  class  components  can 
certainly  be  used  in  the  construction  of  new  systems. 

An  important  aspect  of  the  development  of  the  lUE 
is  the  support  of  various  application  scenarios.  The  fol¬ 
lowing  application  areas  have  been  selected  to  guide  the 
design  of  the  lUE  environment. 

•  Photo-interpretation  -  Support  the  analysis  of 
large  aerial  images  with  a  variety  of  sensor  modes 
including  Electro-Optical(EO),  Synthetic  Aperture 
Radar(SAR)  and  Multi-Spectral(IR)  images.  The 
main  tasks  include  image  registration  and  fusion, 
object  recognition,  change  detection  and  mensura¬ 
tion. 

•  Smart  Weapons  -  Provide  target  identification 
and  tracking  algorithms  to  demonstrate  the  appli¬ 
cation  of  lU  techniques  in  missile  guidance.  Appli¬ 
cation  tasks  include  image  sequences,  active  vision 
and  object  recognition. 

•  Cartography  -  Provide  tools  for  the  construction 
of  cartographic  databases  from  images.  Applica¬ 
tion  tasks  include  stereo,  camera  modeling  and  au¬ 
tomated  image  feature  extraction. 

•  Visual  Navigation  -  Support  ongoing  develop¬ 
ment  in  land  reconnaissance  vehicles.  Application 


tasks  include,  road  tracking,  terrain  analysis,  ob¬ 
stacle  avoidance,  and  object  recognition. 

•  Industrial  Vision  -  It  is  anticipated  that  DARPA 
will  form  a  new  thrust  in  design  and  manufacturing 
technology  within  the  the  time  frame  of  the  lUE  de¬ 
velopment.  Application  tasks  in  this  area  include, 
range  image  sensor  development  and  range  data 
analysis,  image  segmentation,  automated  model 
learning  and  visual  feedback. 

4  Design  Principles 

Object-Oriented 

The  central  approach  to  the  design  of  the  lUE  is  the 
use  of  object-oriented  design  principles.  Briefly,  an  ob¬ 
ject  is  a  data  structure  with  associated  operations,  or 
methods,  which  are  naturally  defined  for  the  particular 
structure.  The  design  is  specified  in  terms  of  an  object 
class  hierarchy  which  represents  abstraction  relations  be¬ 
tween  classes.  For  example  a  “T”  junction  is  a  special 
type  of  junction.  The  internal  details  of  an  object  are 
hidden  as  much  as  possible  so  that  new  implementations 
can  be  “plugged  in”  without  significantly  affecting  the 
rest  of  the  system. 

Standard  Workstation  Environment 

The  lUE  will  be  built  using  off-the-shelf  workstation 
components  as  much  as  possible  to  reduce  development 
cost.  The  system  will  use  the  UNIX  operating  system 
and  the  user  interface  will  be  built  on  X-Windows.  Stan¬ 
dard  widgit  toolkits,  programming  environments  auid  de¬ 
bugging  tools  will  be  used  in  the  system  even  if  license 
fees  are  involved.  In  addition,  source  code  may  not  be 
available  for  all  of  these  support  components. 

Languages 

The  system  will  support  LISP  and  C  by  providing 
parallel  object  class  hierarchies  and  a  mechanism  for 
conununicating  between  the  two  language  environments. 
Common  Lisp  Object  System  (CLOS)  and  C-t-l-  are  the 
object-oriented  language  standards  used  in  the  lUE. 

Interactivity 

The  lUE  will  make  extensive  use  of  graphical  inter¬ 
action  to  support  the  examination  of  features  and  to 
provide  convenient  tools  for  model  construction,  recog¬ 
nition  and  etc.  These  tools  will  be  constructed  within  a 
uniform  user  interface  methodology  and  will  allow  con¬ 
venient  selection  and  modification  of  graphic  items. 

Isolation  Layers 

Major  subsystems  of  the  lUE  will  be  isolated  by  stan¬ 
dard  interface  protocols  to  help  insulate  the  system  de¬ 
velopment  from  rapidly  changing  component  designs. 
For  example,  the  Programmer’s  Interface  Kernel  (PIK) 
might  serve  as  an  interface  to  image  processing  libraries. 

Algorithm  Evaluation  Tools 

The  lUE  will  provide  support  for  comparison  and  test¬ 
ing  of  image  understanding  algorithms.  This  support 
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Figure  1:  The  development  schedule  for  the  lUE. 


will  include  database  management  for  test  suites  of  im¬ 
ages  and  other  data  as  well  as  results  of  standard  algo¬ 
rithms  on  the  same  data.  Statistical  tools  will  be  pro¬ 
vided  to  assist  in  the  determination  of  classification  rates 
and  algorithm  reliability. 

5  The  lUE  Abstraction  Hierarchy 

Object  Abstraction  for  the  lUE  lays  out  the  trunk  of 
the  class  hierarchy  and  proposes  software  development 
guidelines.  Its  aim  is  to  smooth  the  way  for  indepen¬ 
dently  developed  lUE  components  to  work  together.  The 
lUE  class  hierarchy  is  organized  around  a  single  root 
class  (Object),  a  metaclass  (Class),  and  a  core  set  of  clas¬ 
sical  mathematics,  physics,  and  information  processing 
related  classes.  Guidelines  for  extending  the  class  hier¬ 
archy  are  also  included. 

The  lUE  will  be  a  large  and  extensible  system  jointly 
developed  at  numerous  sites  throughout  the  country. 
Bringing  order  to  the  lUE  build  requires  coordination 
well  beyond  generic  good  intentions  of  “object-oriented 
programming”  or  “00  design”.  The  lUE  will  ultimately 
rest  on  a  foundation  of  one  specific  class  hierarchy  and 
one  specific  set  of  design  principles.  That  foundation 
(not  the  somewhat  vague  property  of  being  00)  must 
support  the  built-in  lUE  capabilities  and  accommodate 
new  additions.  A  good  foundation  will  keep  overall 
building  costs  down  and  improve  chances  for  long-term 
satisfaction. 

Object  Abstraction  aims  to  help  developers  provide 
consistent  capabilities  and  names  for  classes  and  meth¬ 
ods.  It  also  aims  to  ensure  that  capabilities  are  compre¬ 


hensive  and  logically  arranged  with  well-defined  paths 
to  obtaining  maximal  efficiency.  The  overall  effect  is  to 
allow  each  new  development  to  be  added  to  the  class  hi¬ 
erarchy  at  the  logically  correct  point  (rather  than,  for 
example,  making  a  new  incompatible  class  hierarchy  or 
work  with  none  at  all). 

Object  Abstraction  is  essentially  one  Object  class, 
one  Class  class,  a  Collection  class  hierarchy,  and  de¬ 
sign/programming  guidelines  for  the  lUE  build.  These 
foundation  classes  define  Ein  extensive  set  of  methods 
for  interacting  with  environment-level  tools,  for  tailoring 
specialized  classes  via  parametric  hooks,  and  for  consis¬ 
tently  accessing  common  mathematics  and  physics  oper¬ 
ations.  Figure  2  outlines  the  class  hierarchy.  The  figure 
is  no  more  than  an  outline;  the  lUE  has  far  more  classes 
than  shown,  name  choices  are  in  flux,  specific  inheri¬ 
tance  paths  are  changing,  and  the  the  lUE  uses  multiple 
inheritance  (albeit  quite  selectively). 

The  notation  used  in  the  diagram  follows  the  conven¬ 
tion  of  a  object-oriented  design  tool  called  Object  Man¬ 
agement  Tool  or  OMT.  The  defintion  for  the  icons  used 
in  constructing  the  object  hierarchies  is  shown  in  fig¬ 
ure  3. 

5.1  Object  and  Class 

Together,  Object  and  Class  define  the  environment-level 
behavior  shared  by  all  objects.  All  lUE  classes  inherit 
from  Object  and  are  associated  with  a  unique  instance  of 
the  class  Class.  These  classes  allow  the  environment  to 
examine  instances  and  configure  operations  (especially 
1/0,  copy,  and  display /editing)  based  on  different  types 
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Figure  3;  The  icons  used  to  construct  the  object  hierarchy  diagrams  used  throughout  the  text. 


and  classes.  This  two-class  approach  is  closely  aligned 
with  systems  such  as  CLOS,  Smalltalk,  and  the  National 
Institute  of  Health  C++  Class  Library.  Applying  the  ap¬ 
proach  to  C++  requires  substantial  infrastructure  and 
discipline,  but  will  be  valuable  in  organizing  the  lUE 
and  providing  powerful  interactive  capabilities. 

5.2  Math,  physics,  and  information  processing 
classes 

The  math,  physics,  and  information  processing  classes 
represent  such  fundamental  concepts  as  images,  ex¬ 
tracted  features,  world  objects,  pure  geometry,  trans¬ 
formations  (including  coordinate  systems),  sensors,  sets, 
sequences,  relations,  and  networks  of  relations.  The  Col¬ 
lection  class  is  basic  to  much  of  the  class  hierarchy. 

Collection  is  parameterized  with  functions  (including 
equivalence,  insert  compatibility,  and  union  compatibil¬ 
ity)  so  that  it  can  cleanly  specialize  in  multiple  direc¬ 
tions: 

•  Finite,  countably  infinite,  and  uncountable  (at  least 
conceptually  so)  numbers  of  elements 

•  06ject-valued  or  language-primitive-valued  (e.g., 
int)  element  types 

•  Constraints  on  types  or  values  of  elements  to  be 
inserted. 

Early  versions  of  the  lUE  design  had  nearly  every  class 
branching  of  the  trunk  on  Figure  2  inheriting  directly 
from  Object.  The  current  designs  enforce  much  greater 
uniformity  among  important  mathematics  and  physics 
classes  by  moving  them  to  more  meaningful  positions 
further  down  the  class  hierarchy. 


5.3  Development  guidelines 


The  development  guidelines  of  Object  Abstraction  in¬ 
clude: 

•  A  dictionary  of  translating  between  terminology  for 
C++  ,  CommonLisp,  and  standard  OOP. 

•  Naming  conventions  for  methods  —  these  specify 
such  characteristics  as  return  type,  inlining,  image 
boundary  handling,  immutable  versions,  etc. 

•  Notions  of  abstract  type  and  implementation  hier¬ 
archies  embedded  in  the  class  hierarchy  —  this  sep¬ 
arates  method  definitions  from  slot  definitions  so 
that  highly  constrained  objects  far  down  in  an  in¬ 
heritance  hierarchy  can  have  the  union  of  all  super¬ 
type  methods  names  without  carrying  around  extra 
slots  defined  in  many  unrelated  implementations  of 
supertypes. 

•  Use  of  source  preprocessing  tools  to  aid  C++  devel¬ 
opment. 

•  Use  of  persistent  object  techniques. 

•  Views  of  objects  —  views  objects  wrap  around  other 
instances  to  change  the  apparent  set  of  methods  or 
interface  (e.g.,  a  view  is  a  low-cost  and  consistent 
way  of  creating  a  vector-valued  image  from  a  se¬ 
quence  of  images  and  visa  versa). 

The  guidelines  aim  at  producing  efficient  class  hierar¬ 
chies  free  of  semantic  conflicts  on  methods  or  slots  (i.e., 
potential  conflicts  due  to  multiple  inheritance  are  not 
left  to  chance). 
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6  The  User  Interface 

6.1  Scope 

The  user  interface  supports  many  different  functions:  al¬ 
gorithm  and  system  development,  visualizing  and  mon¬ 
itoring  execution,  interacting  with  lUE  objects,  find¬ 
ing  results  and  information,  generating  animations  and 
video,  and  many  others.  In  our  design  we  have  studied 
several  different  lUEs  to  determine  the  critical  function¬ 
ality  that  we  want  to  incorporate  into  the  DARPA  lUE. 
We  have  also  stressed  creating  an  interface  which  will  be 
supported  by  ongoing  and  future  developments  in  the 
software  world  at  large.  To  achieve  this,  we  have  orga¬ 
nized  the  interface  into  a  relatively  small  set  of  objects 
which  can  be  built  on  top  of  existing  interface  packages 
and  interface  construction  toolkits. 

6.2  Conceptual  Description 

The  Interface  is  described  in  terms  of  three  levels  (Fig¬ 
ure  4).  The  Graphics  Level  is  the  underlying  "ma¬ 
chine  independent”  package  for  basic  display  and  graphic 
operations  and  telling  the  screen  what  to  do.  The  In¬ 
terface  Kit  Level  consists  of  existing  packages  for  the 
creation  and  rapid  prototyping  of  user  interfaces  and  re¬ 
lated  tools  on  top  of  graphics  level  software.  This  also  in¬ 
cludes  tools  found  in  the  selected  software  development 
environment  such  as  editors  and  debuggers.  The  Im¬ 
age  Understanding  Environment  User  Interface 
(lUEUI)  Level  consists  of  the  objects  in  the  user  inter¬ 
face.  This  includes  such  things  as  object  displays,  plot¬ 
ting  displays,  several  types  of  browsers,  and  structures 
for  describing  the  interface  context.  The  lUE  interface 
objects  are  organized  into  three  basic  classes  (See  Fig¬ 
ure  5).  The  first  class  consists  of  displays  and  browsers. 
These  are  the  basic  tools  for  viewing  an  object  and  in¬ 
specting  it’s  symbolic  attributes  and  relations  (There  are 
many  commonalities  between  these  objects  that  suggest 
a  meaningful  and  general  lUE  interface  object),  xhe 
major  portion  of  what  a  user  does  with  the  interface  will 
be  based  upon  these  objects.  The  second  class  are  the 
objects  commonly  used  in  the  supporting  user  interface 
mechanisms  provided  by  the  graphics  and  toolkit  level 
(menus,  widgets,  icons,  etc.)  but  with  simplified  com¬ 
mands  so  they  can  be  manipulated  directly  by  lUE  users. 
The  third  class  are  support  objects  for  such  things  as  de¬ 
scribing  the  current  interface  context,  the  mapping  from 
an  spatial  object  onto  a  display  window,  links  between 
lUE  interface  objects,  animation  files,  and  several  other 
things.  Many  of  these  are  not  necessarily  full-blown  ob¬ 
jects,  but  common  data  structures. 

6.2.1  Object  Display  and  Browsers 

Object  Displays:  These  are  for  viewing  objects 
which  have  coordinate  systems  associated  with  them  and 
mapping  them  onto  a  2D  display.  It  includes  such  things 
as  images,  curves,  regions,  object  models,  surfaces,  vec¬ 
tor  fields,  etc.  They  support  several  types  of  operations 
for  controlling  the  mapping  of  an  object  to  be  viewed 
in  a  window  and  for  interacting  with  a  displayed  ob¬ 
ject.  There  are  several  subclasses  of  displays  that  will 
appear  to  the  user  to  occur  in  the  same  type  of  window. 
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Figure  4:  A  conceptual  portrayal  of  the  various  levels  of 
the  lUE  user  interface. 


They  are  primarily  distinguished  by  the  types  of  meth¬ 
ods  they  understand  and  all  inherit  a  large  number  of 
similar  methods  from  the  general  display  class.  For  ex¬ 
ample,  the  overlay  method  means  something  different  in 
the  context  of  a  surface  display  than  in  the  context  of  an 
image  display.  The  pixel  display  class  is  for  viewing  im¬ 
ages  and  image  registered  features.  The  local  graphics 
display  class  displays  objects  by  mapping  their  values 
onto  graphic  objects  such  as  lines  and  cubes.  Examples 
are  displaying  vector  fields  and  edges.  The  surface  dis¬ 
play  class  if  for  displaying  objects  that  get  mapped  onto 
mesh  or  rendered  surfaces.  There  are  several  different 
types  of  plot  display:  ID,  2D,  3D  graphs,  histograms, 
scatter  grams,  perspective  views  of  functions  and  tables. 

Browsers:  These  are  used  for  actions  such  as  queries 
over  set  of  objects,  determining  and  inspecting  relation¬ 
ships  between  objects,  process  monitoring,  and  inspect¬ 
ing  values  in  an  object.  There  are  2  different  types  of 
browsers:  Field- Browsers  and  Graph-Browsers. 

Field  Browsers  consist  of  a  regular  array  of  fields. 
Fields  can  be  filled  with  text,  icons,  colors,  colored  text, 
text  in  particular  fonts.  Fields  can  have  actions  associ¬ 
ated  with  them  when  they  are  selected  or  a  user  changes 
the  values  in  them.  We  distinguish  between  four  types 
of  Field  browsers  which  inherit  from  the  general  Field 
browser  class: 

•  Set/Database  Browser:  This  is  presented  as  an 
array  of  fields.  Each  row  of  fields  corresponds  to 
selected  attributes  of  a  particular  object  and  each 
column  corresponds  to  common  attributes  over  the 
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Figure  5;  lUEUI  Object  Hierarchy. 


set  (or  database)  of  objects.  An  example  would  be 
browsing  the  database  which  describes  the  current 
active  object  in  the  lUE  to  find  the  most  recently 
created  image  from  some  operations. 

•  Single  Object  Browser;  each  row  corresponds  to 
the  value  of  an  attribute  for  an  object.  This  is  used 
for  inspecting  a  single  object. 

•  Hierarchical  Browser;  Useful  for  text  based  in¬ 
spection  of  graph  structures  and  trees.  When  an 
item  is  selected,  the  related  items  (along  some  rela¬ 
tional  dimension)  are  displayed  in  the  next  column 
(something  like  the  directory  browser  on  the  NeXT 
machine). 

•  Object-Registered  Browser;  this  contains  val¬ 
ues  extracted  from  a  spatial  object,  such  as  the  in¬ 
tensity  values  in  some  square  neighborhood  of  an 
image.  Depending  on  the  dimensionality  of  the  ob¬ 
ject  (or  relationships  between  component  objects), 
this  can  be  presented  as  a  ID  array,  a  2D  Array, 
or  multiple  2D  arrays  and  describes  curves,  images, 
image  sequences,  pyramids.  There  are  restrictions 
on  whether  it  is  possible  to  interactive  change  val¬ 
ues  in  the  fields  of  an  array  browser.  It  should  be 
possible  to  apply  operations  directly  to  the  values 
in  the  array  browser  to  see  the  effect  of  an  operation 
in  a  restricted  neighborhood  of  an  object. 

Graph  Browsers;  These  are  for  the  display  of  graphs 
and  networks,  generally  representing  an  object  as  a  node 
and  links  to  describe  relations  to  other  objects.  Nodes 
are  similar  to  fields  in  field  browsers  and  can  be  filled 


with  text,  icons,  colors,  colored  text,  text  in  particular 
fonts.  Nodes  can  also  have  actions  associated  with  them 
when  they  are  selected  or  a  user  changes  the  values  in 
them.  Links  can  also  be  colored  and  selected.  A  typical 
use  would  be  for  the  display  of  a  constraint  network. 

An  important  type  of  graph  eind  graph  browser  is  a 
metrically  embedded  graph  wherein  the  nodes  are 
restricted  to  occur  at  positions  with  respect  to  a  coordi¬ 
nate  system.  This  type  of  graph  inherits  properties  from 
both  the  Graph  Browser  and  a  general  spatial  object 
which  can  be  viewed  in  a  display  window.  An  example 
would  be  an  image  registered  network  which  describes 
potential  links  between  extracted  features  for  displaying 
grouping  operations.  An  important  attribute  of  metri¬ 
cally  embedded  graphs  is  that  they  can  be  viewed  as  an 
object  display  for  operations  such  as  zooming  and  having 
access  to  the  underlying  context  in  an  image. 

6.2.2  Simplified  access  to  underlying  GUI 
objects 

Gizmos  and  Widgets;  The  lUE  will  provide  sim¬ 
plified,  interactive  access  to  the  interface  objects  found 
in  GUI  Kits.  Such  things  as  sliders,  knobs,  buttons, 
text  input/output  fields,  menu  creation  and  personaliza¬ 
tion.  This  will  involve  commands  for  creating  gizmos 
and  widgets,  for  positioning  and  scaling  them,  for  at¬ 
taching  them  to  parameters,  for  reading  and  writing  to 
them.  An  example  would  be  creating  a  slider  and  then 
getting  values  for  an  interactive  thresholding  operations 
from  it. 

Menus  The  lUE  will  provide  simplified  interactive  ac¬ 
cess  to  menus  in  the  GUI  kits.  This  involves  being  able 


192 


to  extend  menus,  create  pop-up  menus,  associate  aictions 
with  menu  items.  A  critical  design  task  is  deciding  what 
goes  into  system  level  menus  and  how  they  are  organized 

Icons  The  lUE  will  provide  simplified  interactive  ac¬ 
cess  to  icons  in  the  GUI  kits. 

6.3  Support  Objects  and  Common  Data 
Structures 

There  are  also  several  objects  that  are  used  and  manip¬ 
ulated  as  part  of  the  interface.  Some  of  these  are: 

•  Display* Look- Up- Table:  A  generalization  of  a 
color  look  up  table  that  describes  how  to  map  ob¬ 
ject  values  onto  screen  values.  It  can  also  include 
functions. 

•  Object-Display-Mapping:  A  structure  which  de¬ 
scribes  the  mapping  from  an  object  onto  a  display. 
This  includes  both  the  position  and  values  of  how 
the  object  is  displayed  and  a  reference  to  a  particu¬ 
lar  Display  Look  Up  Table. 

•  Object-Browsing-Mapping:  A  structure  which 
describes  the  mapping  from  an  object  or  database 
onto  a  browser 

•  Object  Display  Links:  A  structure  which  de¬ 
scribes  the  concatenation  of  a  display  or  browsing 
operation  between  lUE  interface  objects.  Thus  a 
link  between  display  windows  wl  and  w2  with  an 
associated  zoom  and  pan  would  display  an  object  in 
wl  with  wl’s  object  display  mapping  and  then  dis¬ 
play  the  same  object  in  w2  by  concatenating  onto 
the  object  display  mapping  for  wl,  the  specified 
zoom  and  pan  operation. 

•  Interface  layout:  A  structure  which  describes  the 
object  instances  in  a  particular  instantiation  of  the 
interface.  Users  may  prefer  different  interfaces  (ar¬ 
rangement  and  instantiation  of  the  basic  lUEUI  ob¬ 
jects)  depending  on  the  task  or  level  of  sophistica¬ 
tion. 

•  Display  Context:  A  structure  which  describes 
current  context  for  a  display.  Such  things  as  the  cur¬ 
rent  window,  the  current  object,  the  current  object 
display  mapping,  the  current  display  command,  the 
current  mouse-selected  object  position  and  value, 
and  others.  Display  operations  can  use  defaults 
based  upon  these. 

•  Browse  Context:  A  similar  structure  for  browsing 
operations.  Such  things  as  the  current  browser,  the 
current  data  base,  the  query  history,  and  others. 

•  Display  Snapshot:  What  is  produced  when  the 
current  display  is  written  to  a  file.  It  is  just  what 
appears  on  the  screen  and  not  the  actual  objects 

•  Animation  File:  A  sequence  of  display  snapshots 

The  interface  has  an  interactive  command  language 
for  interacting  with  objects  and  sending  messages  to  dis¬ 
plays  and  browsers.  These  commands  can  also  occur  in 
code  for  creating  scripts.  The  interface  supports  Inter¬ 
active  Command  Buffers  which  look  like  WYSIWYG 
text  editors.  Textual  outputs  can  be  written  to  the  Inter¬ 
active  Command  Buffer.  It  has  a  vertical  scroll  bar  for 


accessing  previously  written  commands  and  allows  oper¬ 
ations  like  cutting,  pasting,  etc.  Another  interface  com¬ 
ponent  is  the  Tool  Box.  There  are  probably  hundreds 
of  nice  interactive  controls  for  displays  and  visualization 
that  lU  reseeu-chers  are  familiar  with,  such  as  interac¬ 
tively  manipulating  the  object-value  to  screen-intensity 
function  by  interactively  shaping  the  function;  selecting 
color  look-up  tables;  modifying  color  look-up  tables;  in¬ 
teractively  building  display  commands  using  templates 
or  command  browsers;  floating  tool  palette  of  interactive 
drawing  tools;  etc.  The  Display  Tool  Box  is  a  menu  of 
such  tools,  organized  into  functioned  groupings  of  inter¬ 
active  tools  for  manipulating  the  current  display.  From  a 
functional  point  of  view,  it  allows  potentially  redundant 
access  to  the  display  methods  without  using  the  interac¬ 
tive  command  buffer.  It  somewhat  like  the  system  con¬ 
trol  menu  on  the  Macintosh  and  the  system  preference 
menu  on  the  NeXT  machine.  Users  can  also  select  par¬ 
ticular  interaction  tools  and  have  them  occur  as  floating 
palette  (in  cases  where  the  user  want  to  interact  with 
multiple  tools  from  different  sets  of  tools  at  the  same 
time). 

There  are  several  methods  for  the  general  display  ob¬ 
ject  (and  related  ones  for  browsers).  They  are  organized 
into  different  groups: 

Methods  for  Manipulating  the  Current  Ob¬ 
ject  display  position  mapping:  This  includes 
operations  such  as  panning,  zooming,  perspective 
views,  and  warping.  These  are  methods  that  control 
how  positions  in  the  specified  object(s)  get  mapped 
onto  a  display  window 

Methods  for  Manipulating  the  Current  Ob¬ 
ject  value  mapping:  This  includes  operations 
such  as  overlays,  mapping  onto  different  color 
bands,  transparency,  and  others.  These  are  meth¬ 
ods  that  control  how  values  in  the  specified  object(s) 
get  mapped  onto  screen  attributes  such  as  color  and 
intensity 

Methods  for  setting  the  current  display  map¬ 
ping  table:  how  to  configure  planes  in  the  screen 
buffer  for  the  display  of  color  images;  how  many 
panes  to  use  for  overlays;  particular  functions  and 
conditions  to  apply  to  object  values  prior  to  di  splay 

Methods  for  Screen  Attributes:  These  involve 
controlling  attributes  of  the  window  the  display  is 
mapped  onto  and  includes  such  things  as  position, 
size,  attributes  of  the  title  bar,  event  handling  for 
the  mouse 

Methods  for  Links:  Linking  display  transforma¬ 
tions  in  different  windows.  Operations  include  cre¬ 
ating  links  and  associating  position  and  value  map¬ 
pings  with  the  links 

Methods  for  Interaction:  These  involve  interac¬ 
tion  and  manipulation  of  displayed  object(s)  in  the 
display.  Operations  include  recovering  object  posi¬ 
tion  and  value  from  a  mouse  click,  applying  func¬ 
tions  to  selected  objects,  applying  functions  using 
selected  information. 


Methods  for  History:  Methods  to  coordinate  dis¬ 
plays  overtime,  such  as  cycling  through  an  image 
sequence,  playing  an  animation  of  displays 

Methods  for  Graphics:  These  involve  access¬ 
ing  display  registered  graphics  packages  for  drawing 
lines,  text,  and  other  things.  These  occurs  in  four 
different  modes:  1)  relative  to  the  window  of  the  dis¬ 
play;  2)  Relative  to  the  entire  screen;  3)  the  specified 
object  or  coordinate  system;  or  4)  for  instantiating 
lUE  objects  corresponding  to  the  graphic  displays 

Methods  for  printing,  writing  to  file,  anima¬ 
tion 

6.4  Applications 

One  aspect  of  the  interface  design  has  been  to  compile 
examples  for  typical  or  interesting  interactions  that  cor¬ 
respond  to  task  a  user  may  want  to  perform.  Examples 
are  such  things  as  different  ways  of  displaying  an  image; 
establishing  processing  links  between  multiple  displays 
and  browsers;  interactively  applying  a  function  to  a  lo¬ 
cal  neighborhood  of  a  displayed  lUE  object;  display  of  a 
vector  field  and  3D  surface  orientation;  finding  a  previ¬ 
ously  generated  result  and  displaying  it;  viewing  a  sur¬ 
face  display  of  image  intensity  in  the  neighborhood  of 
an  occluding  edge  over  a  sequence  of  images;  and  sev¬ 
eral  others.  We  expect  this  to  expand  to  several  hundred 
examples  by  the  time  the  specification  is  completed. 

For  example,  the  following  prototype  commands; 

[p  wl  inageD 

Ci  vl  :1  Cs  82  inaga  : location  inage. z  inage. y 
: distance  200 
:thetal  30 
:theta2  0 
: clear  t]] 

would  display  an  image  in  window  1  and  then  allow 
the  user  to  interactively  select  locations  and  values  from 
the  image  by  mouse  clicks  in  the  window.  When  the 
user  hits  the  number  1  terminal  key,  a  surface  display 
of  the  mouse-selected  area  is  generated  in  the  display  in 
window  2. 

For  general  applications,  particular  emphasis  is  being 
placed  on  1)  allowing  the  user  to  configure  the  interface 
from  the  underlying  objects  and  2)  also  having  default 
interface  layouts  and  scripts  so  applications  can  be  used 
by  non-specialists. 

6.5  Extensions 

There  are  several  extensions  we  intend  to  investigate 
when  the  initial  specification  is  completed: 

•  Compatibility  with  different  visualization  and  plot¬ 
ting  packages 

•  Incorporating  Hypermedia  and  Cooperative  Work 
tools  for  communication  between  lU  researchers  and 
on-line  tutorials 

•  Compatibility  with  diverse  interaction  devices  (Cur¬ 
rently  we  are  assuming  only  a  multi-button  mouse 
and  a  keyboard).  Future  devices  interfaces  will  al¬ 
most  certainly  involve  gesture  and  voice 


7  Tasks  and  Tools 

7.1  Scope 

Tasks  are  envisioned  as  the  object-oriented  mech2inisms 
that  are  used  to  represent  large-grain  lU  processes;  i.e. 
the  Task  class  is  used  to  implement  lU  algorithms  within 
the  object  hierarchy  of  the  lUE.  By  large  grain  processes 
we  mean  those  processes  that  operate  over  a  domain  con¬ 
sisting  of  a  significant  number  of  pixels  or  other  lU  ob¬ 
jects.  Thus  a  function  that  applies  a  convolution  kernel 
to  an  image  would  be  considered  to  be  an  lUE  Task,  but 
a  function  that  scales  the  value  of  an  individual  pixel 
would  not. 

The  concept  of  a  Task  as  an  object  rather  than  as  a 
method  associated  with  the  traditional  lU  objects  such 
as  images  or  lines  is  not  necessarily  obvious.  However 
there  are  severad  significant  aspects  of  these  processes 
that  lead  us  to  this  model  of  a  Task  as  an  lU  object. 

Large  grain  III  operators  typically  have  complex  pa¬ 
rameter  structures.  A  significant  portion  of  the  time 
spent  in  developing  lU  applications  is  spent  in  the  explo¬ 
ration  of  the  search  space  defined  by  the  parameters  of 
the  operators.  For  example,  a  common  type  of  question 
that  needs  to  be  answered  by  lU  researcher  is:  ’’What  is 
the  most  effective  Laplacian  radius  when  performing  a 
Zero  Crossing  segmentation  on  aerial  images  of  cities?” . 
A  great  deal  of  time  and  effort  is  spent  in  determining 
appropriate  values  for  parameters  of  a  particular  opera¬ 
tor  in  a  particular  domain.  Once  these  parameter  values 
have  been  determined,  it  becomes  natural  to  think  of  the 
parameterized  operator  as  a  new  entity  that  is  different 
from  the  unspecialized  generic  operator.  This  view  leads 
to  the  concept  of  an  operator  as  a  Task  object  that  can 
use  inheritance  and  specialization  to  represent  these  pa¬ 
rameter  structures. 

lU  research  also  involves  a  very  large  amount  of  pro¬ 
cessing  and  data  generation;  in  this  type  of  environment 
it  becomes  important  to  be  able  to  examine  the  pro¬ 
cessing  history.  The  Task  class  readily  supports  the 
maintenance  of  a  processing  history  through  the  explicit 
representation  of  Task  parameters.  A  Task  instance  can 
exist  in  one  of  three  states  depending  on  the  specification 
of  the  input  and  output  parameters  for  the  Task:  it  may 
either  be  partially  specified,  fully  specified,  or  completed. 
In  this  way.  Task  instances  describe  both  the  complete 
input  and  output  specification  and  the  processing  sta¬ 
tus  of  Tasks.  The  Task  objects  thus  provide  a  complete 
description  of  the  large  granularity  image  understanding 
processing  that  has  occurred  in  a  user  environment. 

Another  aspect  of  large  grain  lU  processes  is  that  they 
are  used  by  the  lU  researcher  as  a  set  of  tools  within 
an  experimental  toolbox.  Researchers  require  a  flexi¬ 
ble  mechanism  for  control  and  data  chuning  that  allows 
them  to  construct  experiments  that  combine  individual 
Tasks  into  more  complex  algorithms.  The  Task  class 
hierarchy  provides  this  mechanism  through  the  Com- 
poundTask  and  DataflowGraph  object  classes.  With 
these  classes,  the  user  may  chain  together  individual 
Task  objects  either  through  programs  or  through  the  use 
of  a  interactive  graphical  interface. 

DataflowGraphs  allow  the  user  to  specify  data  path- 
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ways  between  Task  objects.  The  Tasks  in  a  Dataflow- 
Graph  can  have  some  subset  of  their  input  parameters 
specified  dynamically;  the  input  values  of  the  dynamic 
parameters  are  specified  by  the  values  of  output  param¬ 
eters  of  other  Task  objects.  Whenever  values  have  been 
specified  for  all  required  input  parameters,  the  Task  ob¬ 
ject  is  executed.  Other  DataflowGraph  objects,  such  as 
DataflowConditional  nodes  and  DataGenerator  nodes, 
provide  the  control  constructs  that  make  the  Dataflow- 
Graph  an  effective  programming  tool.  A  DataflowGraph 
may  be  constructed  either  in  C,  LISP,  or  at  the  interface 
level,  to  form  these  complex  processes.  In  C  or  Lisp,  this 
complex  Task  control  can  be  also  implemented  through 
a  message  passing  paraidigm  in  which  Task  instances  are 
parameterized  and  controlled  through  messages  (generic 
function  calls)  from  a  controlling  program. 

Programming  modularity  is  provided  through  the 
CompoundTask  class.  Whether  the  process  is  speci¬ 
fied  through  a  DataflowGraph  or  through  a  message¬ 
passing  program,  the  entire  process-specification  may  be 
abstracted  into  a  single  CompoundTask  entity.  These 
CompoundTasks  may  then  be  appropriately  parameter¬ 
ized  and  used  as  Task  objects  in  the  interface. 

Another  specialized  aspect  of  lU  algorithms  is  that 
they  typically  have  high  computational  demands 
and  can  benefit  greatly  from  multiprocessor  configura¬ 
tions  or  special  purpose  hardware.  Because  the  Task 
model  separates  the  I/O  specification  from  the  process 
specification,  the  execution  of  a  Task  object  can  become 
independent  of  the  host  language  and  processor.  In  this 
way,  it  becomes  possible  for  the  user  to  work  within  a 
single,  unified  environment,  but  at  the  same  time  have 
access  to  a  wide  variety  of  processing  options. 

7.2  Conceptual  Description 

The  significant  classes  of  the  task  object  hierarchy  are 
Task,  ProcessSpecification,  TaskGroup,  and  Dataflow- 
Graph. 

Task  A  Task  is  the  object  level  representation  of  an 
large  grain  lU  algorithm.  It  explicitly  contains  a 
complete  description  of  the  I/O  parameters  for  the 
process  corresponding  to  the  algorithm.  It  also  con¬ 
tains  a  reference  to  the  executable  code  for  the  algo¬ 
rithm  in  the  form  of  a  ProcessSpecification  object. 
Because  this  ProcessSpecification  is  separate  from 
the  Task  itself,  the  Task  model  supports  a  wide  va¬ 
riety  of  processing  environments. 

ProcessSpecification  A  ProcessSpecification  object  is 
designed  to  be  a  point  of  access  to  the  executable 
code  which  implements  the  algorithm  for  a  Task. 
This  specification  allows  the  actual  code  to  be  spec¬ 
ified  either  in  the  native  language  (Lisp  or  C)  for  ex¬ 
ecution  within  the  current  process,  in  the  non-native 
language  for  execution  within  the  current  process, 
or  in  either  language  for  execution  within  an  exter¬ 
nal  process  that  may  (potentially)  run  on  a  separate 
processor. 

TaskGroup  A  TaskGroup  is  the  object  level  repre¬ 
sentation  of  a  collection  of  related  teisk  objects. 
TaskGroup  objects  provide  an  graphically  oriented 


experimental  vehicle  for  the  easy  exploration  of  a 
variety  of  related  algorithms  and  also  provide  a  rep¬ 
resentational  hierarchy  for  the  Tasks  of  the  lUE. 
Typically,  the  set  of  Tasks  specified  in  a  TaskGroup 
will  possess  very  similar  I/O  parameter  specifica¬ 
tion,  thereby  making  it  easy  to  switch  between  the 
various  Tasks. 

DataflowGraph  A  DataflowGraph  object  is  the  rep¬ 
resentation  of  the  dataflow  between  Tasks  in  a 
ProcessSpecification.  Specifically,  it  is  a  graph  of 
DataflowNodes.  The  graph  describes  the  data  de¬ 
pendencies  between  Tasks.  The  DataflowNodes 
themselves  describe  the  exact  mapping  between  the 
output  parameter  values  of  one  Task  and  the  input 
parameter  values  of  another. 

7.3  Applications 

The  Tasks  that  will  be  supported  by  the  lUE  will  cover 
a  wide  range  of  algorithms  and  tools.  It  will  be  expected 
that  the  set  of  Tasks  that  are  included  with  the  lUE  will 
expand  rapidly  as  the  lUE  begins  to  receive  wide  use  and 
the  research  groups  using  the  system  begin  to  contribute 
their  own  research  tools. 

The  following  sections  are  suggested  as  a  quick  first 
pass  at  the  set  of  standard  TaskGroups  that  will  be  sup¬ 
plied  as  standard  components  of  the  lUE. 

Browsers  Interactive  graphical  tools  for  textual  and 
symbolic  examination. 

Display  Tools  for  display  of  image  and  symbolic  data. 

Editing  Tools  for  the  creation  and  manipulation  of  ini 
age  and  symbolic  data. 

ImageProcessing  Tools  that  map  image  data  to  image 
data. 

ImageSegmentation  Tools  that  map  image  data  to 
symbolic  data. 

PerceptualOrganization  Grouping  tools  to  map  sym¬ 
bolic  data  to  symbolic  data. 

GeometricFitting  Tools  that  fit  geometric  entities  to 
symbolic  data. 

ObjectMatching  Tools  mapping  object  descriptions 
to  symbolic  data. 

ModelConstruction  Tools  for  creation  and  meuiipula- 
tion  of  object  descriptions. 

7.4  Extensions 

There  are  a  large  variety  of  extensions  that  can  be  made 
to  the  basic  Task  hierarchy.  One  such  class  of  extensions 
involves  control  of  complex  processing  models.  The  Task 
hierarchy  should  be  extended  to  provide  explicit  interac¬ 
tive  control  for  multithread  processes  and  for  massively 
parallel  processors.  The  hierarchy  should  also  support 
shared  memory  models  and  the  concept  of  server  tasks 
that  run  continually  and  provide  data  on  demand. 

A  second  class  of  enhancements  involves  software  and 
algorithm  development  tools.  The  Task  hierarchy  should 
support  complex  algorithm  development  tools  that  pro¬ 
vide: 


195 


History  mechanisms  Integrated  development  tools 
that  allow  the  user  to  select  sequences  of  opera¬ 
tions  from  the  processing  history  and  then  rerun 
the  Tasks  with  new  parameterizations. 

Data  management  Database  tools  that  allow  the  user 
to  easily  store  and  retrieve  experimental  data  based 
on  the  processing  history  and  vice  versa. 

Debuggers  Debugging  tools  that  allow  single  stepping, 
error  trapping,  watchpoints,  etc.  over  the  Dataflow- 
Graphs  as  well  as  the  ability  to  step  into  the 
Task  processes  that  are  spawned  from  the  Dataflow- 
Graph. 

8  Image  Classes 

Images  are  the  most  basic  and  fundamental  concept  in 
computer  vision.  They  provide  the  sensory  link  between 
events  in  the  world  and  the  internal  representations  of 
these  events  within  the  system.  In  many  cases,  an  im¬ 
age  can  be  simply  viewed  as  a  two-  or  three-dimensional 
array,  as  in  the  case  of  static  grey-scale  and  RGB  color 
images.  This  section  preserves  this  notion  of  simplicity 
of  representation  as  far  as  possible,  while  also  account¬ 
ing  for  the  wide  variety  of  image  structures  found  in  the 
vision  literature. 

In  the  image  object  hierarchy  shown  in  Figure  6,  im¬ 
ages  fall  into  one  of  two  subclasses:  simple  or  com¬ 
posite.  A  simple-image  is  a  specialized  form  of  an  N- 
dimensional  array  (N  =  2,3,4,...).  A  great  variety 
of  common  forms  fall  into  this  class;  color  and  grey¬ 
scale  images,  range  images,  CAT  images,  image  se¬ 
quences,  stereo  pairs,  etc.  The  composite-image  class 
supports  images  that  do  not  fit  naturally  into  a  sin¬ 
gle  N-dimensional  framework,  such  as  image  pyramids, 
or  mosaics  composed  of  partially  overlapping  images. 
Composite-images  are  essentially  sets  of  images,  with 
additional  slots  and  methods  to  support  specialized  se¬ 
mantics.  Since  composite-images  are  themselves  made 
up  of  simple-images  and  composite-images,  ultimately 
all  pixel  data  resides  in  simple-images. 

In  addition  to  the  inherent  structure  of  images,  two 
other  considerations  (efficiency  and  redundancy)  have 
influenced  the  design  of  the  image  object  hierarchy.  The 
object  oriented  approach  makes  it  imperative  that  multi¬ 
ple  image  objects  be  able  to  share  pixel  data.  Otherwise, 
it  would  be  necessary  to  redundantly  represent  parts  of 
images  in  order  to  apply  methods  defined  for  these  parts. 
To  give  an  example,  consider  the  red  plane  of  an  RGB 
image.  It  has  the  same  form  as  a  grey-scale  image,  a  2- 
dimensional  array  of  values.  To  apply  a  grey-scale  image 
method  to  the  red  plane  data  a  grey-scale  image  object 
must  be  created.  To  avoid  redundant  copies  of  the  data, 
this  new  object  should  point  to,  and  hence  share,  the 
same  underlying  data. 

Sharing  is  a  desirable  property  of  arrays  in  general, 
not  just  images.  This,  in  addition  to  efficiency  consid¬ 
erations,  is  why  arrays  are  included  here.  All  the  mech¬ 
anisms  that  support  data  sharing  between  multiple  ob¬ 
jects  are  provided  by  the  subclass  shared-array,  and  are 
thus  inherited  by  simple-images.  Shared-array  objects, 
in  turn,  use  raw-data  objects  to  implement  sharing;  two 


or  more  shared-arrays  share  data  by  pointing  to  a  com¬ 
mon  raw-data  object. 

There  are  three  distinct  interfaces  associated  with 
simple-images.  Simple-images  inherit  the  image  inter¬ 
face  from  the  root  image  class.  They  inherit  an  array 
interface  from  the  array  class.  Finally,  a  very  low  level 
but  very  efficient  interface  is  provided  through  the  raw- 
data  objects  themselves. 

From  the  root  image  class,  simple-images  inherit  slots 
and  methods  ctssociated  with  the  sensor  used  to  pro¬ 
duce  the  image.  Also  from  the  root,  simple-images  in¬ 
herit  pixel  semantics,  including  pixel  access.  Since  most, 
though  not  all,  image  sensors  utilize  a  two  dimensional 
sensor  grid,  and  since  pixels  are  commonly  associated 
with  individual  cells  in  this  grid,  most  images  2ire  two 
dimensional.  To  illustrate,  get-pixel  with  arguments  (x, 
y)  returns  a  three  place  vector  for  an  RGB  image. 

From  the  shared-array  class  simple-images  inherit 
methods  for  multi-dimensional  array  access.  They  also 
inherit  the  map-slice  and  map-copy  methods.  These 
methods  create  new  images  which  map  onto  subspaces  of 
the  first.  The  map-slice  method  does  this  without  copy¬ 
ing  the  data  and  hence  the  new  objects  share  elements 
with  the  original.  To  illustrate,  an  RGB  image  is  a  three 
dimensional  shared-array.  Array  access  with  three  ar¬ 
guments,  (x,  y,  c),  may  be  used  to  retrieve  values  from 
the  image.  The  map-slice  facility  may  be  used  to  create 
a  new  simple-image  containing  only  the  data  associated 
with  a  single  color  plane.  As  a  second  illustration,  map- 
slice  may  be  used  to  generate  a  temporal  slice  through 
a  motion  sequence.  The  map-slice  facility  is  quite  gen¬ 
eral,  and  can  be  used  to  create  a  new  image  containing 
elements  in  any  subrange  of  the  original  image.  It  also 
supports  subsampling. 

The  third  interface  is  directly  through  the  underlying 
raw-data  object.  A  raw-data  object  is  a  first  class  object; 
it  is  not  eui  object  the  casual  user  should  be  concerned 
with.  It  is  provided  for  the  benefit  of  sophisticated  users 
who  need  direct  access  to  data.  A  raw-data  object  main¬ 
tains  a  pointer  to  a  block  of  memory,  or  perhaps  a  set 
of  blocks.  It  keeps  a  list  of  eill  other  objects  which  point 
to  this  memory.  It  also  maintains  information  about  the 
layout  of  the  data  within  the  block.  Layouts  for  many  of 
the  standard  types  of  shrired-arrays  will  be  published  in 
appendices  to  the  lUE  Programmers  Guide.  Using  this 
information,  a  systems  programmer  can  write  code  that 
performs  its  own  pointer  arithmetic  to  directly  access 
the  data.  Shared-arrays  interpret  the  data  in  a  raw-data 
block  using  a  set  of  index  vectors.  Each  shared-array  has 
its  own  set  of  index  vectors,  and  uses  them  to  map  be¬ 
tween  multi-dimensional  queries  and  a  single  sequential 
index  into  the  raw-data  block. 

Composite-images,  unlike  simple-images,  do  not  have 
a  well  defined  multi-dimensional  form.  A  composite- 
image  is  essentially  just  an  ordered  set  of  images.  The 
class  is  important  because  it  provides  flexibility.  For  ex¬ 
ample,  there  is  no  efficient  way  to  represent  an  image 
pyramid  within  an  N-dimensional  array.  However,  since 
composite-images  are  not  shared-arrays,  the  slice  meth¬ 
ods  are  not  available. 

The  methods  defined  for  the  image  object  class  in- 
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elude  those  required  for  creating,  deleting,  modifying,  ac¬ 
cessing,  querying,  and  displaying  instances  of  the  object 
class.  The  image  object  class  interacts  with  the  other 
objects  in  the  lUE  system  through  the  methods  which 
create  instances  of  these  objects,  e.g.  regions,  edges, 
lines,  and  various  other  geometric  and  non-  geometric 
features.  An  open  question  before  the  lUE  committee  is, 
among  the  perhaps  thousands  of  candidate  algorithms  in 
the  vision  literature,  which  will  become  the  basic  meth¬ 
ods  of  the  core  lUE  system.  One  possible  solution  is 
to  include  in  the  core  those  basic  techniques  which  are 
common  to  a  wide  variety  of  algorithms,  such  as  convolu¬ 
tion,  histogram  creation  and  manipulation,  filtering  and 
enhancement,  etc.  The  more  specialized  methods,  such 
as  the  Ohlander-Price  region  segmentation  algorithm  or 
the  Burns  line  extraction  algorithm,  would  be  included 
in  libraries  of  methods  available  outside  the  core  system. 

9  Image  Intensity  Features 

9.1  Edge  Events 

9.1.1  Scope 

Image  features  play  a  central  role  in  image  understand¬ 
ing  research.  The  reliable  detection  of  image  intensity 
boundaries  has  been  the  goal  of  many  image  understand¬ 
ing  research  projects.  The  approach  taken  in  the  lUE  is 
to  provide  a  basic  framework  for  representing  intensity 
events  and  to  supply  the  most  widely  used  image  feature 
segmentation  algorithms. 

A  great  deal  of  initial  discussion  during  the  formula¬ 
tion  of  the  lUE  centered  on  approaches  to  the  represen¬ 
tation  of  an  edge],  since  there  is  a  diverse  spectrum  of 
opinion  within  the  lU  community  as  to  the  proper  de¬ 
scription  of  an  intensity  event. 

The  description  of  step  edge  elements,  and  other 
boundary  events,  in  the  lUE  is  in  terms  of  the  geom¬ 
etry  of  the  intensity  surface  in  the  neighborhood  of  the 
the  event.  The  need  of  specific  algorithms  for  additional 
attributes  is  met  by  providing  a  flexible  attribute  mech¬ 
anism  so  that  special  properties  can  be  attached  to  the 
basic  geometry  description.  A  major  contribution  of  the 
lUE  committee  is  to  define  a  standard  set  of  attribute 
names  so  that  other  algorithms  associated  with  image 
events  can  rely  on  a  consistent  definition  of  attributes. 

9.1.2  The  Edgel 

The  class  hierarchy,  shown  in  figure  7,  illustrates  a 
definition  for  a  step  edge.  The  description  includes  the 
position  and  orientation  of  the  edge  as  well  as  attributes 
of  the  intensity  cross  section  at  the  edge  point.  Examples 
such  parameters  are  the  average  intensity,  the  intensity 
change  across  the  edge  and  the  slope  of  intensity  along 
the  boundary.  The  definition  of  edge  filtering  kernel  is 
provided  by  the  lUE  filter  object  hierarchy.  The  edgel 
object  lists  the  kernel  type  as  an  optional  attribute. 

It  is  often  necessary  to  provide  statistics  of  these  prop¬ 
erties  in  the  neighborhood  around  the  edge  position.  A 
convenient  approach  is  to  define  a  covariance  matrix 
which  expresses  the  variance  and  correlation  between 
image  features.  These  statistics  can  be  used  later  in  a 
decision  procedure  for  classifying  the  image  events. 


9.1.3  Spatial  Index 

It  is  usually  necessary  to  group  edgels  into  a  connected 
chain  according  to  a  geometric  relation  such  as  collinear- 
ity.  This  grouping  process  is  facilitated  by  a  spatial  in¬ 
dexing  mechanism.  In  the  case  of  collinearity,  an  appro¬ 
priate  spatial  index  is  the  Hough  array.  If  the  edgels  are 
sorted  into  the  array,  then  grouping  on  collinearity  be¬ 
comes  quite  efficient.  The  design  allows  for  other  types 
of  grouping,  such  as  convexity,  with  the  definition  of  an 
associated  spatial  index. 

9.1.4  Edge  Sequences 

The  grouping  of  edgels  has  as  a  goal  the  formation  of 
a  one  dimensional  chain  which  terminates  at  junctions 
or  end  points.  Junctions  are  formed  by  the  intersec¬ 
tion  of  two  or  more  edge  chains  and  endpoints  are  where 
a  boundary  simply  terminates.  These  sequences  may 
have  missing  edge  segments  where  there  v/as  no  strong 
evidence  for  a  boundary,  but  the  grouping  mechanism 
introduced  a  link  for  the  purposes  of  continuity. 

9.1.5  Extensions 

There  are  many  variations  on  the  basic  representation 
described  here.  In  the  case  of  range  data,  it  is  desir¬ 
able  to  provide  a  second  order  surface  description  of  the 
neighborhood  around  the  edge  event.  The  average  sur¬ 
face  normal,  and  curvatures  on  each  side  of  the  boundary 
can  be  provided.  Similarly,  the  details  of  a  peak  type  of 
edge  event  will  require  descriptions  of  curvature  in  the 
edgel  neighborhood.  Also  the  position  and  orientation 
pareimeters  should  include  provision  for  subpixel  inter¬ 
polation. 

9.2  Regions 
9.2.1  Scope 

A  Region  is  an  area  of  an  image  (or  surface)  produced 
by  operations  such  as  histogram  peirtitioning,  region¬ 
growing,  model  surface  projection,  and  many  others.  Re¬ 
gions  are  two  dimensional  bounded  surfaces  and  have 
many  common  methods  with  the  general  surface  object. 
For  example,  the  composite  regions  produced  by  image 
segmentation  are  a  multiply  connected  face  with  a  set 
of  1-cycles  enclosing  pixel  areas  in  the  image  plane.  Re¬ 
gions  have  been  specialized  since  they  are  a  basic  type  of 
feature  extracted  from  images  and  projected  from  sur¬ 
faces  and  volumes.  For  example,  composite  regions  have 
attributes  and  methods  to  access  properties  of  adjacent 
regions  sharing  a  common  boundary  for  use  in  grouping 
operations.  Non-geometric  attributes  can  be  added  to 
regions  using  the  mechanisms  for  associating  attributes 
and  values  with  the  general  lUE  object.  The  mapping 
composition  method  is  used  to  register  a  region  with 
an  image  and  to  access  the  corresponding  image  values. 
This  is  used  for  computing  attributes  such  as  the  aver¬ 
age  or  variance  of  image  features  in  the  area  defined  by 
a  region. 

Beside  being  a  basic  feature  type,  the  lUE  region  class 
is  interesting  for  several  reasons; 

•  Regions  are  very  similar  to  several  other  lUE  classes 
such  as  curves,  surfaces,  and  geometric  features. 
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as  an  Image  r^ion  boundary. 


Figure  7:  A  partial  hierachy  for  edge  events. 
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This  has  suggested  underlying  abstractions  common 
to  all  these  objects. 

•  Regions  are  useful  for  spatial  queries,  especially  for 
grouping  operations 

•  Many  image  operations  can  be  specified  as  though 
they  occur  with  respect  to  a  region  which  is  then 
iteratively  positioned  and  registered  with  an  image. 
A  typical  example  is  a  mask  or  extraw;ting  some  por¬ 
tion  of  one  image  as  a  region  and  then  matching  it 
with  another  image. 

9.2.2  Conceptual  Description 

The  general  region  object  is  described  by  a  closed 
outer  boundary,  a  set  of  interior  boundaries,  a  coordi¬ 
nate  system,  and  a  set  of  points  contained  in  the  re¬ 
gion.  Not  all  of  these  are  necessarily  specified  for  a 
particular  type  of  region  (In  fact  a  region  can  just  be 
described  as  a  point-set  of  image  positions  it  occupies). 
Some  regions  do  not  require  an  explicit  boundciry  (Bit- 
Mask  Regions).  Some  do  not  require  an  explicit  point-set 
for  points  contained  in  the  region  (Analytic  ^gions). 
Some  do  not  require  an  explicit  reference  to  a  coordi¬ 
nate  system  because  the  coordinate  system  will  default 
to  that  of  the  image  from  which  the  regions  were  ex¬ 
tracted.  The  general  region  object  has  attributes  for 
Area,  Number-of-Holes,  Minimum-Bounding-Rectangle, 
Centroid,  Scatter-Matrix-of-Pixel-Positions,  Compact¬ 
ness.  These  also  may  or  may  not  be  specified  in  a  region 
instance.  This  list  is  by  no  means  exhaustive.  The  lUE 
will  define  a  large  number  of  other  region  attributes  such 
as  intensity  co-occurrence  matrices,  texture  primitives 
and  color  classification  parameters. 

Region  boundaries  are  instances  of  curve  objects.  Dif¬ 
ferent  subclasses  of  regions  are  distinguished  by  the  types 
of  closed  curves  which  describe  the  boundary:  a  pixel 
chain;  a  parametric  curve;  a  connected  sequence  of  line- 
segments;  a  piece-wise  polynomial.  Many  of  the  instan¬ 
tiation  methods  for  regions  involve  filling  a  closed  curve 
or  fitting  a  region  to  a  set  of  points  or  a  set  of  curves. 
Boundary  type  also  effects  determining  whether  a  point 
is  inside  a  region. 

Analytic,  composite  boundary,  and  discrete  regions 
correspond  to  the  simileir  classes  found  in  the  curve  class. 
BitMap  Regions  are  binary  arrays  with  values  corre¬ 
sponding  to  which  pixels  are  occupied  in  the  minimum 
bounding  rectangle  of  the  region.  They  also  have  a  spec¬ 
ified  position  with  respect  to  an  image.  BitMap  Regions 
can  be  used  for  rapid  determination  of  intersections,  and 
searching  operations. 

The  methods  associated  associated  with  regions  fall 
into  several  basic  types; 

•  Creation  methods:  These  are  for  creating  instances 
of  region  objects.  Many  of  these  involve  creating 
regions  of  one  subclass  from  another  subclass.  The 
different  analytic  classes  can  be  instantiating  by  fit¬ 
ting  them  to  a  discrete  region.  A  discrete  region 
can  be  instantiated  by  sampling  an  analytic  region 
at  some  resolution.  A  polygon  can  be  instantiated 
from  a  sequence  of  points  or  from  a  convex  hull  fit¬ 
ting  routine  applied  to  a  point  set.  Composite  re¬ 


gions  can  be  instantiated  from  a  connected  compo¬ 
nents  image. 

•  Combination  methods:  These  are  for  creating  re¬ 
gions  from  a  set  of  regions  using  operations  such  as 
intersection,  union,  difference,  merging  (combining 
adjacent  regions  into  a  composite  region) 

•  Mapping  methods:  The  are  general  methods  based 
upon  the  semantics  of  mathematical  mappings  for 
operations  such  as  composition,  embedding,  and  it¬ 
eration.  For  example,  a  region  can  be  composed 
with  an  image  such  that  positions  in  the  region  can 
be  used  to  access  values  in  the  image.  Or  a  region 
can  be  iteratively  moved  with  respect  to  an  image 
to  perform  operation  with  respect  to  a  swept  area. 

A  partial  hierarchy  for  region  geometries  is  shown  in 
figure  8. 

9.3  Extensions 

Additional  study  is  required  to  arrive  at  a  suitable  ob¬ 
ject  hierarchy  to  reflect  the  variations  in  region  grouping 
mechanisms  and  to  account  for  different  region  elements. 
It  is  expected  that  regions  will  branch  similarly  to  image 
object  types,  e.g.  intensity  regions,  RGB  regions.  At 
the  time  of  this  writing  little  attention  has  been  given 
to  texture  properties  and  texture  grouping  mechanisms. 
It  is  expected  that  the  final  hierarchy  will  express  the 
basic  texture  grouping  mechanisms  and  texture  element 
categories. 

10  Geometric  Features 

The  type  of  lUE  objects  that  are  described  under  the 
heading  of  Geometric  Features  are  a  subclass  of  general 
Spatial  Objects  and  include  simple  features  (e.g.  lines) 
extracted  from  an  image  and  large  or  small  collections 
or  combinations  of  these  kinds  of  features.  A  sample  hi¬ 
erarchy  is  shown  in  figure  9.  Since  these  are  the  type  of 
features  that  play  role  in  the  low-level  to  mid-level  pro¬ 
cessing  of  image  understanding,  these  features  interact 
with  other  object  classes  that  provide  low-level  and  mid¬ 
level  representations.  A  collection  of  geometric  features 
is  usually  derived  from  an  image  object  and  shares  prop¬ 
erties  of  an  image,  but  not  usually  the  iconic  represen¬ 
tation  associated  with  an  image  object.  Image  features, 
such  as  in  the  edgel  class,  are  extracted  from  the  image 
and  processed  to  form  the  line  or  curve  segments  dis¬ 
cussed  here.  Lines  and  curves  are  general  object  classes 
also  discussed  in  another  section  and  form  the  basis  for 
the  descriptions  used  here.  Other  image  derived  features 
such  as  regions  are  also  described  elsewhere. 

10.1  Basic  Classes  of  Geometric  Features 

We  have  divided  geometric  features  into  three  primary 
classes  of  objects;  the  curve  segment  (an  interval  on  a 
curve  bounded  by  two  vertices),  the  curve  segment  group 
(grouping  of  individual  curve  segments  by  some  criteria), 
and  the  geometric  object  image  (the  set  of  such  features 
extracted  from  an  image).  The  first  is  the  basic  element 
used  in  the  later  descriptions,  the  second  is  a  small  col¬ 
lection  of  such  elements,  and  the  third  is  a  large  scale 
collection  of  both  basic  and  grouped  elements. 
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Figure  8:  A  partial  hierarchy  for  image  regions. 


Figure  9:  An  object  hierarchy  for  geometric  features. 
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The  curve  segment  class,  and  its  more  common  spe¬ 
cialization  the  line  segment  class,  stores  the  representa¬ 
tion  of  curves  (or  lines)  extracted  from  an  image.  The 
simplest  form  of  this  representation  needs  only  the  be¬ 
ginning  and  ending  points  of  the  interval,  and  a  rep¬ 
resentation  of  the  curve  for  general  curves.  In  normal 
usage,  other  information  must  be  stored  because  of  the 
difficulty  of  recomputing  without  resorting  to  other  in¬ 
formation  (e.g.  integrated  edge  strength,  list  of  corre¬ 
sponding  edge  elements)  or  because  of  the  expense  of 
recomputing  frequently  used  values  (e.g.  length,  orien¬ 
tation).  We  have  thus  defined  a  more  complete  repre¬ 
sentation  of  curve  and  line  segments  than  will  normally 
be  used  in  order  to  accommodate  these  other  values  and 
to  give  names  for  the  value  access  or  value  computing 
functions.  A  second  class  associated  with  the  segment 
class  is  the  linked  segment  class  which  connects  segments 
into  ordered  lists  such  as  would  be  derived  when  chains 
of  edge  elements  are  converted  into  straight  line  approx¬ 
imations. 

The  curve  segment  group  class  includes  small  collec¬ 
tions  of  segments  that  have  been  grouped  according  to 
some  grouping  function.  In  general  this  object  class  al¬ 
lows  for  an  arbitrary  grouping  function  with  an  arbitrary 
number  of  grouped  elements,  but  the  more  common  seg¬ 
ment  groups  are  those  with  two  or  three  segments  such 
as  anti-parallels  (two  nearby  segments  with  opposite  di¬ 
rection),  junctions  of  lines  (L-junction,  Y-junction,  etc.), 
etc.  where  the  grouping  function  is  implied  by  the  object 
class.  Other  more  complex  perceptual  groups  contain 
similar  information,  but  also  must  contain  a  description 
of  the  perceptual  grouping  function,  which  in  an  extreme 
case  may  be  a  pointer  (e.g.  a  text  string)  to  a  large  user 
program  with  certain  arguments  specified. 

An  image  is  used  to  store  an  iconic  representation  of 
some  scene  produced  by  a  sensor.  A  geometric  feature 
image  is  a  collection  of  geometric  features  that  corre¬ 
spond  to  some  image,  thus  it  will  have  many  of  the 
properties  associated  with  an  image  in  the  lUE.  This 
is  treated  differently  than  a  perceptual  group  where  the 
grouping  function  is  based  on  derivation  from  some  im¬ 
age  because  it  is  the  common  form  for  referring  to  the 
large  collection  of  geometric  features  generated  while 
processing  an  image. 

10.2  Applications  of  Geometric  Feature 
Classes 

A  typical  use  of  these  object  classes  is  in  the  extraction 
of  ribbon-like  structures  in  an  image.  One  possible  way 
for  this  to  proceed  is  to  apply  an  edge  operator  to  an 
image  followed  by  a  procedure  to  extract  straight  line 
segments  (specialized  from  of  the  curve  segment  class) 
from  the  set  of  linked  (adjacent)  edge  elements.  This 
line  segment  image  is  then  processed  to  extract  a  large 
number  of  pairs  of  segments  that  are  anti-parallel  and  a 
specified  distance  apart  (according  to  the  ribbon  width), 
which  are  objects  of  the  apar  feature  class  and  are  stored 
in  another  geometric  feature  image.  These  basic  (two  el¬ 
ement)  groups  are  then  combined  to  form  a  small  number 
of  much  larger  perceptual  groups  (members  of  a  percep¬ 
tual  group  class  that  stores  the  members  and  the  group¬ 


ing  function)  using  a  complex  process  that  depends  on 
colinearity,  properties  of  gaps,  etc.  and  these  few  groups 
are  then  used  by  the  program  for  further  processing. 

10.3  Expected  and  Planned  Extensions 

Since  one  of  the  goals  of  the  lUE  is  extensible  descrip¬ 
tions,  these  must  always  be  allowed.  These  classes  are 
meant  to  serve  first  as  a  b2isis  for  implementing  programs 
and  second  as  a  model  for  adding  new  object  classes 
when  needed.  Obvious  extensions  include  the  addition 
of  other  line  junction  types  if  needed  for  a  particular  ap¬ 
plication,  other  (non-intersecting)  small  segment  groups, 
and  specialized  groups  such  as  symmetries.  These  could 
all  be  added  by  modifying  the  similar  classes  that  are 
already  defined.  The  more  complex  extension  to  higher 
dimensions  would  require  cidding  new  classes  that  inherit 
from  the  existing  classes  and  from  the  higher  dimension 
spatial  feature  class.  Few,  if  any,  slots  would  be  added 
since  the  endpoints  au-e  adready  treated  as  a  vertex,  which 
can  alreaidy  be  of  any  dimension. 

11  Curves 

11.1  Scope 

In  this  section  we  describe  the  necessary  objects  that  are 
used  in  computer  vision  literature.  We  provide  a  brief 
description  of  the  objects  and  the  motivation  behind  our 
choices.  The  objects  described  includes  arc  represen¬ 
tations  as  well  as  data  structures,  such  as  points  and 
pointsets,  from  which  such  arc  representations  may  be 
computed.  A  portion  of  the  object  hierarchy  for  curves 
is  shown  in  figure  10. 

11.2  Points  and  Pointsets 

A  point  is  a  spatial-object.  We  classify  points  as  at¬ 
tributed  or  non-attributed  points.  A  non-attributed 
point  contains  only  the  coordinates.  The  attributes  are 
required  since  many  algorithms  use  additional  informa¬ 
tion  such  as  edge  direction,  edge  contrast,  gray  tone  sur¬ 
face  curvature,  etc.  An  attributed  point  is  a  simple  point 
with  added  attributes.  Specifically,  an  attributed  point 
will  have  an  attribute  list  that  specifies  what  attributes 
are  stored  and  a  pointer  to  an  area  of  memory  contain¬ 
ing  attribute  values.  The  various  attributes  allowed  may 
include  (but  are  not  limited  to)  edge  direction,  edge  con¬ 
trast,  curvature,  accuracy /tolerance  zone  for  a  point,  co- 
variance  matrix,  transformation  matrix  etc.  Algorithms 
such  as  the  Hough  transform  method  use  edge  direction 
as  well  as  the  edge  strength.  Other  recent  robust  tech¬ 
niques  that  propagate  the  uncertainty  from  the  low-level 
stage  need  the  covariance  matrix  or  the  tolerance  zone  in¬ 
formation.  The  transformation  matrix  information  may 
be  needed  to  just  indicate  that  the  algorithms  operating 
on  the  data  have  to  use  the  transformed  values  while 
leaving  the  original  coordinate  values  stored  intact. 

A  pointset  is  a  collection  of  points  that  may  be  ordered 
or  unordered.  Points  in  an  ordered-pointset  are  ordered 
as  they  are  encountered  along  an  arc-sequence.  Ordered 
pointsets,  in  general,  are  the  inputs  to  algorithms  that 
perform  arc-segmentation.  An  ordered  pointset  is  a  dig¬ 
ital  arc.  A  pointset  may  be  attributed.  Each  point  in 
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Figure  10:  A  partial  hierarchy  for  curves. 


the  attributed  pointset  may  be  attributed,  or  there  may 
be  a  collection  of  attribute  values  that  are  common  to 
all  points  in  the  set. 

11.3  Arcs 

An  arc  is  a  spatial-object.  An  arc  may  be  a  digital-arc 
(an  ordered  sequence  of  points)  or  an  analytical-arc  (an¬ 
alytic  representation  of  an  arc).  The  class  “arc”  inherits 
all  attributes  of  “lUE-object”  and  in  addition  stores  co¬ 
ordinates  of  the  starting  point  and  the  ending  point  of 
the  arc.  An  ordered-pointset  is  a  digital-arc.  A  digital- 
arc-chain  is  a  segmentation  of  the  digital-arc  into  its  con¬ 
tinuous  subpieces.  A  typical  arc-segmentation  algorithm 
would  take  a  digital  arc  as  input  and  produce  a  digital 
arc  chsun  as  output.  Some  algorithms  take  a  digital-arc- 
chain  as  input  and  produce  a  new  digital-arc-chain  as 
output.  An  example  of  such  an  algorithm  would  be  the 
breakpoint  optimization  algorithm  which  operates  on  a 
segmented  arc  and  moves  the  breakpoints  to  produce  an 
output  segmented  arc. 

Analytical-arc,  an  analytic  representation  of  an  arc 
can  be  non-parametric,  or  parametric.  We  use  the  term 
“parametric”  in  the  sense  that  the  coordinate  values  of 
the  points  in  the  curve  are  functions  of  a  parameter  t  that 
takes  on  values  in  the  interval  (0, 1).  We  use  the  term 
“non-parametric”  to  include  all  other  representations  of 
the  curve.  For  example,  an  analytical-arc  may  be  spec¬ 
ified:  indirectly  as  the  intersection  of  two  surfaces,  or 
directly  by  giving  the  equation  of  the  curve.  Analytical- 
arcs  can  be  lines,  conics,  or  splines.  An  analytic-arc 
may  be  specified  implicitly  or  explicitly.  For  example, 


a  line  in  its  standard  form  is  specified  by  its  starting 
and  ending  points  and  a  line  in  its  implicit  form  is  speci¬ 
fied  as  the  intersection  of  two  planar  surfaces.  Similarly, 
the  general  representation  for  a  conic  is  as  a  polyno¬ 
mial.  Often  algorithms  for  curve  fitting  use  alternative 
representations.  For  example,  a  circle  is  specified  by  its 
center  and  radius,  and  an  ellipse  may  be  specified  by  the 
lengths  of  its  major  and  minor  axes,  the  orientation  of 
the  major  axis  and  its  center  point.  A  parametric-spline- 
segment  is  the  parametric  polynomial  representation  of 
an  eirc  segment.  A  spline  is  represented  as  a  sequence 
of  parametric-spline-segments.  The  segments  are  delim¬ 
ited  by  the  breakpoints  along  the  arc.  From  the  general 
object  “spline”  one  may  derive  specialized  instances  of 
splines  depending  on  the  spline  type,  spline  dimension, 
terminating  conditions  etc. 


A  composite-curve  is  an  ordered  list  of  the  basic  arc 
types,  namely:  point,  line,  circle,  conic,  or  spline.  If 
the  composite-curve  had  N  arcs  then  the  added  con¬ 
straint  is  that  the  tth  arc’s  endpoint  is  the  same  as 
the  i  +  1th  arc’s  startpoint.  A  digital-arc-chain  is  a 
composite-curve.  A  spline  is  also  a  composite-curve. 
The  maun  difference  between  the  Composite-Curve  and 
the  Linked-Curve-Segment  described  in  the  Geometric- 
features  section  is  that  a  Composite-Curve  is  connected, 
whereas  the  Linked-Curve-Segment  need  not  necessarily 
be  connected.  In  other  words,  a  Linked-Curve-Segment 
may  just  be  a  linked  list  of  curve  segments. 
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12  Surfaces 

12.1  Scope 

A  Surface  is  a  general  class  for  describing  piecewise  two 
dimensional  manifolds  and  different  discrete  samplings  of 
them.  Surfaces  are  used  in  computer  vision  for  describ¬ 
ing  models,  for  the  characterization  of  image  intensity  ar¬ 
eas,  for  fitting  parameterized  functions  to  data,  and  for 
reasoning  about  visibility  and  geometric  relationships. 
Surfaces  can  be  bounded  and  unbounded.  The  curve 
class  is  used  to  describe  the  boundaries  of  bounded  sur¬ 
faces. 

12.2  Conceptual  Description 

A  partial  hierarchy  for  surface  geometries  is  shown  in 
figure  11. 

Surfaces  are  broken  into  a  few  general  classes.  A  sim¬ 
ple  (or  elementary)  surface  can  be  described  by  an  ana¬ 
lytic  or  parameterized  expression.  Examples  are  planes, 
spheres,  and  quadrics.  Bounded  instances  of  the  sim¬ 
ple  surfaces  are  specified  by  range  constraints  on  their 
parameters  and  topological  representations  such  as  the 
/ace(see  the  section  on  solids).  Bounded  versions  of  para¬ 
metric  surface  patches  can  be  specified  by  regions  in  the 
UV-parameter  plane.  The  UV-parameter  plane  can  be 
treated  as  an  image  for  associating  surface  registered 
properties.  Ribbons  are  treated  as  a  kind  of  compos¬ 
ite  curve  based  upon  a  relation  between  an  eixis  curve 
and  a  sweeping  curve. 

Composite  surfaces  are  formed  by  connecting  sim¬ 
ple  surfaces  together.  The  general  composite  surface 
uses  a  relational  network  which  describes  the  underly¬ 
ing  topological  relations  between  vertices,  edges,  chains, 
and  faces.  Composite  surfaces  can  be  made  out  of  su¬ 
perquadrics,  polygonal  patches,  and  the  different  types 
of  parametric  patches.  Currently,  composite  surfaces  are 
only  made  from  patches  of  same  type.  The  different 
types  of  composite  surfaces  specialize  the  general  rela¬ 
tional  network  to  describe  different  blending  relations 
between  patches. 

Discrete  surfaces  are  described  by  binary  3D-arrays 
with  values  indicating  which  positions  are  occupied  in 
the  minimum  bounding  rectangular  prism  containing  the 
surface.  This  array  also  has  a  specified  position  with  re¬ 
spect  to  a  coordinate  system.  Discrete  surfaces  can  be 
used  for  rapid  determination  of  intersections,  and  search¬ 
ing  operations. 

The  methods  associated  with  surfaces  are  similar  to 
those  found  with  other  objects  such  as  curves  and  vol¬ 
umes:  methods  for  geometric  operations,  for  combining 
surfaces,  and  for  mappings  between  surfaces  and  other 
lUE  spatial  objects.  There  are  creation  methods  for 
instantiating  surface  objects.  Among  these  are  fitting 
methods  for  instantiating  a  surface  to  a  discrete  point 
set,  such  as  fitting  a  plane,  a  sphere,  or  a  polygonal 
mesh  to  a  discrete  3D  point  set.  There  are  also  sampling 
methods  for  creating  a  discrete  point  set  of  positions  on 
a  surface  such  as  regular  positions  on  a  plane  or  a  sphere. 

In  some  instances,  the  same  surface  can  be  described 
in  multiple  ways:  a  cylinder  is  can  be  defined  as  a  ruled 
parametric  patch  or  by  a  simple  parametric  expression. 


If  a  user  desires,  he  can  create  a  new  class  which  inherits 
from  both  descriptions.  There  are  significant  issues  if 
the  attributes  of  one  type  of  description  are  changed. 

12.3  Extensions 

There  is  a  strong  similarity  between  the  required  lUE 
functionality  for  curves,  surfsures,  and  volumes  and  what 
is  supported  in  different  graphics  and  CAD  packages 
(Renderman,  PHIGS,  the  ESPRIT  specification  of  a 
Neutral  File  for  CAD  geometry,  and  others).  There  are 
also  strong  differences;  in  computer  vision  the  emphasis 
is  not  realistic  scene  generation  but  on  such  things  as 
generating  a  predicted  segmentation  and  access  to  ob¬ 
jects  and  their  relative  depth  ordering  to  reason  about 
visibility  and  lighting  effects.  We  are  interested  in  a  deep 
linkage  with  existing  and  developing  steuidard  graphics 
packages  so  they  can  be  used  for  geometric  reasoning  and 
to  generate  lUE  objects  in  addition  to  realistically  ren¬ 
dering  an  image  of  a  scene.  For  example,  when  an  surface 
is  displayed,  we’d  like  to  also  get  image  registered  curve 
and  junction  objects  generated  as  a  predicted  segmenta¬ 
tion.  This  capability  would  be  very  useful  for  producing 
data  for  testing  and  image-driven  interactive  creation  of 
models. 

13  Solids 

13.1  Scope 

Many  Image  Understanding  techniques  depend  on  some 
description  of  volumes  of  material  in  space  and  at  the 
most  abstract  a  3D  model  can  simply  be  space  occu¬ 
pancy.  At  that  level,  the  type  of  operation  which  is  ap¬ 
propriate  is  whether  a  point  in  space  is  filled  or  empty 
and  if  filled,  what  kind  of  material  is  at  that  point.  One 
can  even  define  display  operations  at  this  level  without 
any  commitment  to  a  particular  specific  data  structure. 
It  is  also  possible  at  this  level  to  describe  simple  geomet¬ 
ric  and  topological  properties: 

•  Connected  Components 

•  Symmetries 

•  Moments  or  other  distributions  of  matter 

•  Intersection  of  Volumes(CSG  Ops) 

At  a  similar  level  of  abstraction  we  can  define  the 
points  in  space  which  lie  on  the  interface  between  differ¬ 
ent  materials  and  temporal  behaviors.  Again  we  don’t 
commit  at  this  level  to  a  specific  boundary  representa¬ 
tion,  but  recognize  that  the  concept  of  object  surface  is 
central  to  vision  representations.  At  this  abstract  level 
it  is  possible  to  define  many  generic  properties  and  op¬ 
erations.  For  example; 

•  Surface  Geometry  (normal,  tangent  plane,  curvar 
ture,  ...) 

•  Surface  Properties  (reflectance,  roughness,  ...) 

•  Surface  Intersections 

At  this  level  of  abstraction  we  can  introduce  the  com¬ 
mon  notions  of  topology  without  consideration  of  a  spe¬ 
cific  geometric  representation.  For  example,  the  2-cycle, 
which  a  closed  network  of  surface  regions,  does  not  make 
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Figure  11;  A  partial  hierarchy  for  surfaces. 


any  commitment  to  a  specific  representation  but  merely 
describes  the  closure  of  space. 

13.2  Conceptual  Description 

These  general  principles  are  followed  in  the  following 
class  hierarchy  so  that  may  different  types  of  3D  model 
representations  can  be  accommodated  and  still  permit 
a  relatively  small  set  of  generic  methods  to  encompass 
most  of  the  operations  required  for  vision  research. 

The  class  diagram  shown  in  figure  12  shows  the  most 
abstract  classes  for  a  number  of  3D  objects.  In  three 
dimensional  space  it  is  possible  to  have  objects  which 
are  described  by  zero,  one,  two,  or  three  parameters,  i.e. 
points,  curves,  surfaces  and  volumes.  Parametric  volume 
representations  often  arise  in  medical  image  analysis  as¬ 
sociated  with  tomographic  image  modalities  such  as  x- 
ray  tomography  and  magnetic  resonance.  For  example 
one  could  define  a  tissue  volume  with  a  distribution  of  x- 
ray  density  described  by  spherical  ha>-  ionics.  However, 
we  will  not  detail  these  types  of  three  parameter  volume 
descriptions  in  the  current  hierarchy  since  most  current 
vision  research  focuses  on  surface  representations  or  on 
volumes  which  enclose  uniform  solid  material.  Also  co¬ 
ordinate  frames  and  other  auxiliary  structures  for  estab¬ 
lishing  spatial  relations  in  3D  space  are  covered  in  other 
sections  of  this  document  and  will  not  be  discussed  fur¬ 
ther  here.  Figure  12  we  defines  two  basic  object  forms, 
the  volume  and  boundary  representation.  In  the  fig¬ 
ure,  the  boundary  representation  is  shown  in  more  de¬ 
tail.  The  standard  topological  representation  is  defined 
by  a  series  of  classes  as  follows: 


•  Vertex  -  A  zero  dimensional  topological  element 
which  defines  a  point  in  spzu:e  where  two  or  more 
entities  are  incident. 

•  Edge  -  A  one  dimensional  topological  element  which 
defines  a  bounded  curve.  The  curve  is  bounded  by 
a  vertex  at  each  end.  A  closed  curve  is  assumed  to 
have  an  embedded  vertex  to  define  the  boundary. 
Note  that  any  type  of  curve  can  be  used  to  represent 
the  geometry  of  an  edge. 

•  1-Cycle  -  A  closed  sequence  of  edges  with  a  common 
vertex  bounding  adjacent  edges.  A  related  notion 
is  often  introduced  called  the  1-Chain  which  is  a 
sequence  of  edges,  not  necessarily  closed. 

•  Face  -  A  two  dimensional  surface  which  is  bounded 
by  one  or  more  1-Cycles.  A  multiply  connected  face 
requires  an  outer  bounding  1-cycle  and  a  number  of 
interior  1-cycles.  The  surface  enclosed  by  the  face 
can  be  also  any  surface  representation.  If  the  surface 
is  intrinsicly  closed,  such  as  a  sphere,  it  is  assumed 
that  a  vertex  boundary  is  embedded  in  the  surface 
in  analogy  to  the  closed  edge  loop. 

•  2-CycIe  -  A  closed  sequence  of  faces.  Here  faces  are 
joined  by  a  common  edge.  Each  edge  is  adjacent  to 
exactly  two  faces. 

•  Block  -  The  block  is  an  enclosed  volume  of  space.  A 
block  can  be  multiply  connected  and  thus  the  block 
boundary  is  a  set  of  2-Cycles  in  a  similar  way  to  the 
multiply  connected  face. 

•  Object  -  Finally,  an  object  is  a  collection  of  blocks 
which  are  joined  at  faces. 
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Figure  12:  The  top  level  hierarchy  for  3D  solids 


The  representation  is  topologically  complete  and  can 
serve  as  a  medium  of  exchange  among  various  boundary 
representations.  A  major  problem  in  exchanging  bound¬ 
ary  descriptions  is  the  unambiguous  recovery  of  the  cor¬ 
rect  topology  for  the  surface.  There  are  many  choices 
for  a  complete  description,  such  as  the  winged-edge  for¬ 
mat,  but  the  proposed  representation  is  convenient  for 
carrying  out  boolean  operations  and  for  constructing  a 
boundary  description  from  a  wire  frame  representation. 
The  chain  data  structures  correspond  directly  to  classi¬ 
cally  defined  topology  and  chain  algebra  is  a  well  devel¬ 
oped  approach  to  topological  operations.  This  topolog¬ 
ical  description  is  described  in  more  detail  in  the  paper 
by  Wesley  and  Markowski  ^ 

Constructive  Solid  Geometry  The  other  major 
solid  representation  is  in  terms  of  solid  primitives  which 
can  be  considered  blocks  from  the  boundary  point  of 
view.  These  primitives  are  combined  by  attachment  and 
boolean  intersection  operations  to  form  composite  ob¬ 
jects.  The  resulting  composite  object  is  described  by  a 
tree  where  the  nodes  of  the  tree  represent  various  prim¬ 
itives  and  partial  constructions  and  the  arcs  of  the  tree 
represent  boolean  or  attachment  operations.  This  de¬ 
scription  is  referred  to  as  the  Constructive  Solid  Geom¬ 
etry,  or  CSG,  representation. 

Two  examples  of  volume  primitives  used  in  current 

’Markowsky,  G.,  Wesley,  M.A.  “Fleshing  Out  Wire 
Frames,”  IBM  Journal  of  Research  and  Development  24,  (5), 
1980. 


vision  research  are  the  superquadric  and  the  generalized 
cylinder.  The  superquadric  is  a  generalization  of  the 
quadric  surface  by  providing  variable  exponent  values 
on  the  quadric  terms.  The  generalized  cylinder  is  a  ma¬ 
jor  representational  approach  where  the  primitives  are 
defined  by  a  axis  which  can  be  a  general  space  curve  and 
a  sweeping  rule  which  defines  the  variation  of  the  object 
cross  section  along  the  axis.  For  example,  a  cone  is  a 
generalized  cylinder  with  a  circular  cross  section  and  a 
linear  sweeping  rule  along  a  straight  line  axis. 

A  specific  object  is  generated  by  combining  these 
primitives  in  a  CSG  tree.  This  representation  requires 
that  boolean  intersection  operations  are  defined  for  each 
primitive  which  allows  quite  complex  objects  to  be  de¬ 
fined  by  a  short  description.  For  example  a  cylinder  with 
a  hole  can  be  defined  as  the  subtraction  of  one  cylinder 
from  another.  Usually  each  primitive  is  associated  with 
a  bounding  box  (rectangular  prism)  which  facilitates  ef¬ 
ficient  checking  for  the  possibility  of  intersection. 

13.3  Extensions 

A  standard  exists  called  the  Initial  Graphics  Exchange 
Specification  or  IGES  for  exchanging  geometric  descrip¬ 
tions.  The  standard  is  supported  by  the  U.S.  department 
of  commerce  and  the  initial  purpose  of  IGES  was  to  ex¬ 
change  drafting  information  between  various  CAD  ven¬ 
dor  systems.  IGES  does  provide  standards  for  the  rep¬ 
resentation  of  some  solid  primitives  and  a  partial  treat¬ 
ment  of  a  solid  boundary  representation.  The  CSG  rep¬ 
resentation  is  more  complete  than  the  boundary  repre¬ 
sentation.  Post-fix  and  infix  notations  are  provided  for 
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the  CSG  tree  operations;  union,  intersection  and  differ¬ 
ence.  The  solid  primitives  supported  by  IGES  are  as 
follows; 

•  Rectangular  Parallelpiped 

•  Right  Angular  Wedge 

•  Right  Circular  Cylinder 

•  Right  Circular  Cone  [May  be  truncated] 

•  Sphere 

•  Torus 

•  Solid  of  Revolution 

•  Extruded  Solid 

•  Ellipsoid 

THe  lUE  committee  is  currently  debating  the  relation¬ 
ship  between  IGES  and  the  3D  solid  standards  needed 
by  the  lUE. 

14  Coordinate  Frames 

14.1  Scope 

The  geometric  relationship  between  sensors  and  scenes, 
eunong  physical  objects,  and  between  pixels  and  the 
world  has  been  a  core  component  of  the  science  of  im¬ 
age  understanding  since  its  inception.  Nearly  every  lU 
system  makes  use  of  coordinate  systems  and  transforms 
either  implicitly  or  explicitly.  The  multitude  of  repre¬ 
sentations  that  have  been  devised,  some  of  which  incor¬ 
porate  arbitrary  conventions,  has  been  a  key  obstacle 
precluding  the  transfer  and  sharing  of  code  and  results. 
In  this  section  we  describe  the  representations  of  coordi¬ 
nate  systems  and  transforms  that  eire  to  be  provided  by 
the  lUE.  It  is  expected  that  these  constructs  will  be  em¬ 
ployed  in  many  areas  of  the  lUE,  such  as  image  features, 
3D  models,  and  sensor  objects. 

14.2  Conceptual  Description 

We  begin  with  definitions  of  our  terminology; 

Coordinate  System:  A  coordinate  space,  in  the  math¬ 
ematical  sense.  It  is  represented  in  the  lU  environ¬ 
ment  by  an  instance  of  a  coordinate  system  class. 

Coordinate:  The  coordinate(s)  of  a  point  are  repre¬ 
sented  by  a  series  of  numbers,  and  are  implicitly 
associated  with  a  coordinate  system. 

Coordinate  Transform:  A  specification  of  a  mapping 
between  two  coordinate  spaces.  It  is  represented  in 
the  lU  environment  by  an  instance  of  a  coordinate 
transform  class. 

Conceptually,  the  relation  between  coordinate  systems 
and  coordinate  transforms  can  be  expressed  by  a  directed 
graph  in  which  a  coordinate  system  is  represented  by  a 
node  and  a  coordinate  transform  is  represented  by  a  di¬ 
rected  arc  between  two  nodes.  Both  coordinate  systems 
and  coordinate  transforms  are  represented  by  instances 
of  object  classes.  The  classes  of  transforms  that  can  re¬ 
late  two  coordinate  systems  is  governed  by  the  classes  of 
those  coordinate  systems. 

Figure  13  shows  the  class  hierarchy  for  coordinate  sys¬ 
tems.  The  leaves  of  this  tree  indicate  the  classes  that  can 


be  instantiated  by  an  lUE  user  to  specify  the  coordinate 
space  that  he  chooses  to  use.  The  lUE  programmer  can 
extend  the  hierarchy  by  adding  new  class  definitions  to 
allow  use  of  other  coordinate  systems.  As  can  be  seen  in 
the  figure,  both  Cartesian  euid  non-Cartesian  coordinate 
systems  are  supported,  including  a  number  of  coordinate 
systems  commonly  used  for  geographic  purposes  (such  as 
UTM  and  Geodetic). 

These  coordinate  systems  will  contain  slots  appropri¬ 
ate  to  their  class,  the  most  important  ones  being; 

•  Dimension  —  The  dimensionality  of  the  coordinate 
space. 

•  Related-coordinate-systems  —  A  data  structure 
that  specifies  the  coordinate  treinsforms  that  have 
been  defined  for  mapping  coordinates  in  this  system 
to  other  coordinate  systems.  Collectively,  these  con¬ 
stitute  the  coordinate  transform  graph  that  relates 
all  coordinate  systems. 

A  coordinate  system  object  supports  a  number  of  com¬ 
monly  useful  methods,  including; 

•  Find-transform-to  —  This  method  returns  a  trans¬ 
form  that  can  be  used  to  map  coordinates  from  this 
coordinate  system  to  a  second  coordinate  system, 
which  is  passed  as  an  argument. 

•  Find-transform-from  —  This  method  returns  a 
transform  that  can  be  used  to  map  coordinates  from 
the  coordinate  system  passed  as  an  argument. 

•  IVanslate,  rotate,  scale,  . . . —  A  collection  of  meth¬ 
ods  that  allow  the  relation  between  coordinate  sys¬ 
tems  to  be  modified,  using  operations  appropriate 
to  the  class  of  the  coordinate  system. 

Figure  14  depicts  the  class  hierarchy  for  the  coordi¬ 
nate  transforms  that  cfm  be  used  to  relate  coordinate 
systems  within  the  IDE.  This  hierarchy  includes  com¬ 
mon  matrix  transforms,  a  collection  of  transforms  for 
mapping  to  geographic  coordinate  systems,  and  other 
specialized  transforms  such  as  those  represented  inter¬ 
nally  by  quaternions  and  by  analytical  functions. 

Slots  that  are  defined  for  all  classes  of  coordinate 
transforms  include 

•  FYom-coordinate-system  —  The  domain  for  the 
mapping  specified  by  this  transform. 

•  To-coordinate-system  —  The  range  for  the  mapping 
specified  by  this  transform. 

•  Inverse  transform  —  A  transform  expressing  the  in¬ 
verse  mapping  (optional). 

All  coordinate  transform  classes  support  at  least  the 
following  methods. 

•  TVansform-point  —  The  method  that  actually  com¬ 
putes  the  coordinates  in  the  to-coordinate-system, 
given  a  set  of  coordinates  in  the  from-coordinate- 
system. 

•  Compose-transforms  —  A  method  which  collapses 
a  sequence  of  several  transforms  into  a  single  one. 
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Figure  13:  Class  Hierarchy  for  Coordinate  Systems 
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14.3  Applications 

It  is  anticipated  that  coordinate  systems  and  transforms 
will  be  used  to  relate  edl  geometric  objects  within  the 
lUE.  An  example  of  how  these  might  be  used  is  depicted 
in  Figure  15.  Coordinates  in  any  two  coordinate  systems 
can  be  related  by  composing  the  transforms  found  by 
traversing  the  graph. 

For  example,  the  UTM  coordinates  (CS-5)  of  a  point 
can  be  converted  to  latitude-longitude  (CS-4)  using 
transform  CT-4.  These  coordinates  can,  in  turn,  be  re¬ 
lated  to  a  local  Cartesian  coordinate  system  (CS-3)  and 
then  to  either  film  (CS-2)  or  image  (CS-1)  coordinates. 
Other  coordinate  systems  can  be  related  to  this  network 
using  additional  transforms  as  exemplified  in  Figure  15. 
The  lUE  will  provide  the  mechanisms  to  freely  move  be¬ 
tween  coordinate  systems  by  traversing  this  network. 

14.4  Extensions 

Speed  of  execution  is  essential  if  coordinate  transforma¬ 
tions  are  to  be  useful  within  a  computer  vision  system. 
While  our  main  emphasis  has  been  to  provide  for  a  gen¬ 
eral  and  extensible  collection  of  coordinate  systems  and 
transforms,  it  is  necessary  to  insure  that  the  requisite 
computations  can  be  carried  out  swiftly.  Several  mech¬ 
anisms  to  facilitate  this  will  be  incorporated  as  the  lUE 
design  is  fleshed  out: 

•  Composition  and  caching  of  long  chains  of  coordi¬ 
nate  transforms  into  a  single  transform.  Additional 
machinery  must  be  provided  to  keep  the  coordinate 
transform  graph  consistent  when  multiple  paths  ex¬ 
ist. 

•  Access  to  hardware  accelerators.  Additional  coor¬ 
dinate  transform  classes  can  be  defined  that  allow 
access  to  computations  performed  in  hardware.  The 
idiosyncrasies  of  their  use  should  be  largely  hidden 
by  the  specification  of  appropriate  coordinate  trans¬ 
form  classes. 

15  Constraints 

15.1  Scope 

The  use  of  constraints  in  Image  Understanding  research 
is  quite  diverse  and  in  many  ways  the  proper  treatment 
of  general  constraints  impinges  on  the  general  AI  prob¬ 
lem.  The  following  summarizes  some  the  ways  in  which 
constraints  can  be  defined  and  used. 

Label  Assignment  The  label  assignment  problem 
can  be  viewed  as  a  constraint  satisfaction  problem  and  be 
solved  by  heuristic  search  methods.  The  idea  is  to  con¬ 
sider  a  partial  label  assignment  as  the  node  of  a  search 
graph  and  the  cost  assigned  to  a  node  is  related  to  the 
degree  with  which  it  satisfies  the  constraints  between 
currently  assigned  labels.  Recent  work  in  generic  object 
recognition  (Levine  et  al)  using  GEONS  applies  this  type 
of  constraint  processing. 

Constraint  Propagation  Networks  The  represen¬ 
tation  of  constraints  as  a  network  has  appeared  most  of¬ 
ten  in  neural  network  research.  A  typical  example  is  the 
Hopfield  net  where  nodes  have  a  particular  input/output 


constraint.  In  the  case  of  the  Hopfield  net,  the  output  of 
a  node  is  activated  if  the  balance  of  the  weighted  inputs 
is  positive.  Such  networks  are  often  solved  by  relaxation 
methods  where  node  states  are  adjusted  until  a  consis¬ 
tent  overall  network  state  is  achieved.  Another  approach 
to  such  networks  is  the  use  of  integer  linear  programming 
methods. 

Geometric  Constraints  Perhaps  the  most  central 
use  of  constraints  in  lU  research  to  date  is  the  appli¬ 
cation  of  geometric  constraints.  A  significant  example  is 
the  ACRONYM  system  which  represents  object  models 
in  terms  of  generalized  cylinder  components.  The  con¬ 
figuration  and  shape  of  the  components  is  restricted  by 
a  set  of  geometric  constraints  which  are  defined  for  a 
particular  generic  object  class  such  as  an  aircraft.  Con¬ 
straints  can  also  be  applied  to  camera  viewpoint.  In 
ACRONYM  these  constraints  are  defined  as  symbolic 
inequalities  and  solved  using  a  symbolic  rewriting  algo¬ 
rithm  called  SUP-INF. 

The  advantage  of  such  use  of  constreiints  is  that  known 
relations  which  must  exist  between  entities  can  be  real¬ 
ized,  even  in  the  presence  of  empirical  measurement  error 
or  numerical  imprecision  in  forming  the  model  instance. 
For  example,  a  measured  angle  between  lines  might  be 
89.6°  but  it  is  risky  to  conclude  that  the  lines  are  in¬ 
tended  to  be  perpendicular.  The  symbolic  constraint 
relations  are  ^dways  directly  specified  in  the  model  rep¬ 
resentation  and  do  not  have  to  be  inferred  from  the  im¬ 
precise  geometry  of  a  model  instance.  A  final  advantage 
is  that  a  wide  range  of  geometric  shapes  can  be  expressed 
by  a  relatively  compact  constraint  description. 

15.2  Conceptual  Design 

The  following  object  designs  are  primarily  focussed  on 
geometric  constraints  but  the  approEtch  appeetrs  to  be 
generalizable  to  handle  the  other  examples  just  dis¬ 
cussed.  There  are  three  basic  object  classes:  symbolic- 
entities,  constraint-relations  and  the  constraint  system. 
These  object  are  related  as  illustrated  in  figure  16. 

The  symbolic-entity  is  a  symbolic  form  of  an  lU 
class  such  as  a  pleuie,  transform,  or  curve.  The  key  pa^ 
rameters  or  attributes  of  the  class  are  represented  as 
symbols  rather  than  fixed  numbers.  For  example  the 
orientation  of  the  surface  normal  of  a  plane  can  be  rep¬ 
resented  as  three  variable  symbols.  There  also  may  be 
equations  which  are  part  of  the  definition  of  a  partic¬ 
ular  lU  class.  For  example,  an  ellipse  is  defined  by 
Ax^  -I-  Bxy  -1-  -b  Dx  4-  Ey  -t-  F  =  0.  The  symbolic  en- 
itity  also  contains  pointers  to  the  constraint  relations  in 
which  it  takes  part.  In  general  the  symbolic-entity  hier¬ 
archy  will  mirror  the  non-symbolic  class  hierarchy.  This 
duplication  seems  desirable  so  that  more  conventional 
applications  do  not  have  to  carry  around  the  baggage  of 
constraint  representations. 

The  constraint-relation  is  also  a  symbolic  structure 
which  relates  entities  in  a  network  structure.  It  is  not 
unreasonable  to  consider  a  symbolic-entity  as  a  unary 
constraint-relation,  but  I  have  chosen  to  keep  them  dis¬ 
tinct  for  clarity.  The  constraint  relation  provides  point¬ 
ers  to  the  related  enitities  as  well  as  the  equations  in¬ 
volved  in  defining  the  constraint.  For  example,  if  two 
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Figure  15:  Example  of  possible  relationships  among  coordinate 


system  instances  and  coordinate  transform  instances 


vectors  Vj  and  V2  are  constrained  to  be  perpendicular 
then  the  constraint  equation  Vi  •  K2  =  0  is  constructed 
by  the  corresponding  constraint-relation  class  instance. 

The  constraint-system  consists  of  a  constraint  re¬ 
lational  network  and  the  associated  entities.  The  main 
function  of  the  system  is  too  collect  together  all  of  the 
constraint  equations  and  symbolic  variables  and  compile 
them  into  a  form  for  efficient  solution.  In  the  case  of 
geometric  constraints,  all  of  the  equations  are  polynomi¬ 
als,  and  therefore  the  solution  of  the  constraint  system 
corresponds  to  finding  the  roots  of  a  multivariate  poly¬ 
nomial  equation  set.  It  is  also  often  the  case  that  the 
constraints  must  be  satisfied  in  the  context  of  data  mea¬ 
surements  such  as  image  boundaries  or  region  moments, 
as  in  the  case  of  “snakes” .  In  these  applications  the  solu¬ 
tion  of  the  constraint  system  corresponds  to  constrained 
minimization,  or  non-linear  programming.  The  standard 
approach  to  this  in  the  lU  literature  is  the  Levenberg- 
Marquat  algorithm  In  any  case,  the  constraint  sys¬ 
tem  is  responsible  for  collecting  the  appropriate  data  and 
forming  a  cost  function  for  the  minimization. 

Another  function  of  the  constraint-system  class  is  to 
collect  all  of  the  information  about  the  constraints  and 
symbolic-entities  needed  to  form  an  efficient  representa¬ 
tion  for  solving  the  contraints.  This  process  of  collection 
is  reasonably  called  parsing  the  constraint  network.  In 
the  case  of  geometric  constraints,  the  equations  and  vari¬ 
ables  are  collected  to  form  a  Jacobian  matrix  which  is  the 
primary  representation  used  in  the  Levenberg-Marquat 

*  “Nnmerical  Recipes  in  C,”  W.  Press  et  al  Eds,  Cambridge 
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scheme.  The  equations  can  be  differentiated  numerically 
or  symbolically,  as  appropriate. 

16  Sensors 

16.1  Scope 

These  objects  provide  a  number  of  important  lU  capa¬ 
bilities  including  generation  of  synthetic  data  via  render¬ 
ing,  reasoning  about  sensory  events,  image/data  filter¬ 
ing.  Related  to  sensors  are  the  classes  Energy-Objects, 
and  Filters  described  in  the  following  sections. 

The  following  summarizes  our  view  of  the  “sensing” 
process:  Physical  sensors  take  energy  from  the  world, 
apply  some  “focusing”  and  filtering  mechanism  on  their 
energy  input,  convert  one  form  of  energy  into  another, 
possibly  filters  this  new  energy  form,  and  finally  a  con¬ 
version  into  a  final  discrete  form.  The  energy  input  to 
the  sensor  is  a  (generally)  complex  interaction  of  various 
energy-sources  (some  possibly  controlled  by  the  sensor) 
and  the  objects  in  the  world/scene  space.  A  sensor  in¬ 
put  might  ako  be  the  output  of  another  sensor,  (e.g.  a 
flat-bed  scanning  of  a  photograph  of  a  scene). 

Sensors  are  thus  a  mapping  from  one  space  (energy 
inputs)  to  another  space  (discrete  energy  outputs).  Sen¬ 
sors  can  thus  be  cascaded  to  form  other  sensors.  In  par¬ 
ticular  we  believe  we  should  support  the  view  of  logical 
sensors  wherein  a  sensor’s  input  can  be  the  output  of 
one  or  more  other  sensors.  Thus  a  stereo  sensor  could 
be  defined  which  takes  as  input  the  output  of  two  edge¬ 
detecting  sensors  which  each  took  as  input  the  output  of 
a  camera  sensor.  The  output  of  sensors  is  not  restricted 
to  image  like  objects,  but  rather  may  be  any  lU  object. 
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Figure  16:  The  top  level  constraint  object  hierarchy. 

Sensors  will  interact  with  just  about  every  lU  object. 
The  sensing  of  synthetic  scenes  will  involve  all  just  about 
all  possible  object  models  which  can  exist  in  the  world 
and  produces  images  or  lU  objects.  The  sensors  natu¬ 
rally  involve  coordinate  transforms,  and  calibration  pro¬ 
cedures  will  likely  require  the  use  of  the  constraint  mod¬ 
els. 

The  issues  we  must  address  with  sensor  and  energy 
objects  are  two  fold: 

1.  what  operations  needed  to  be  supported  on  such 
objects  to  facilitate  lU  research,  and 

2.  at  what  level  of  detail  should  the  process(es)  be 
modeled. 

Note  that  the  answer  to  the  second  issue  affects  the  first 
in  that  the  same  operation  may  need  to  be  provided  with 
respect  to  various  “models”  which  will  generally  affect  it 
implementation  (and  computational  cost).  These  two 
questions  will  be  addressed  in  the  following  discussion. 

16.1.1  Operations  of  sensors 

We  begin  with  the  general  operations,  and  then  discuss 
what  will  be  included  in  the  initial  system.  Sensors  are 
composed  of  objects  which  provide  the  functionality  of 
the  basic  operations  of  the  sensor.  These  components 
are  shown  in  figure  17. 

As  pointed  out  before,  one  of  the  operations  that  the 
sensor  section  is  to  provide  is  rendering.  It  is  empha¬ 
sized  that  rendering  for  image  understanding  purposes 
has  quite  a  different  emphasis  than  graphics  rendering. 
In  the  case  of  graphics,  the  purpose  is  to  generate  a  real¬ 
istic  looking  scene.  In  image  understanding  research  the 


goal  is  to  accurately  model  the  image  formation  process 
to  that  various  aspects  of  image  analysis  algorithms  can 
be  evaluated. 

To  be  more  specific,  we  view  rendering  as  a  special 
case  of  the  forwau'd  use  of  a  sensor,  i.e.  mapping  from 
a  world/scene  into  a  data  object.  Because  this  forward 
mapping  view  is  so  prevailent  we  have  defined  a  logical 
subclass  called  Transducers  which  are  exactly  those  ob¬ 
ject  supporting  only  the  forward  mapping  operations  of 
sensors. 

16.1.2  Transducers 

The  fundamental  operation  of  transducers  is  ^log*" 
cally)  to  ask  how  changes  in  scene/world  “map”  (project) 
into  the  image/object  space.  For  notational  convenience 
this  single  fundamental  operation  is  decomposed  into  two 
operations,  Transduce  which  is  supposed  to  compute  a 
“likely”  image/object  given  the  world/scene.  A  slightly 
more  aggressive  method.  Data- certainty  will  allow  one  to 
ask  about  the  certainty  of  (subparts  of)  the  image/object 
given  the  current  (hypothesis  of)  the  scene/world  space. 

Transducer:  scene/world  — ►  image/object. 

The  remaining  operations  will  be  those  needed  to  sup¬ 
port  these  two  basic  operations.  These  including  setting 
parameters  which  affect  rendering,  setting  parameters 
which  affect  certainty  calculations,  including  a  general 
calibration  which  may  simultaneously  affect  both  the 
world/scene  and  rendering/certainty  parameters. 

The  objects  will  come  with  defaults  which  build  up  the 
rendering  and  certainty  calculations  by  a  concatenation 
of  methods  applied  to  the  components  of  the  sensor.  For 
example  the  default  rendering  might  be  constructed  as  a 
ray-tracing  of  the  scene  which  provides  a  super-sampled 
input  to  the  energy-transfer-system  (which  actually  uses 
the  reconstruction-prefilter  to  convert  to  whatever  sam¬ 
pling  rate  it  desires).  The  energy-transfer-system  would 
outputs  a  function  which  is  then  filtered  by  the  spectral- 
response,  and  then  sampled  by  the  output-sampler. 

Since  each  of  these  might  involve  ’’arbitrarily”  com¬ 
plex  filtering  the  full  functionedly  of  a  sensor  should  be 
achievable  in  any  single  step  (e.g.  the  output-sampler 
could  do  everything).  However,  this  decomposition  of 
the  model  was  constructed  as  a  sequence  of  filters  to  re¬ 
duce  implementation al  burden  on  the  user  and  to  allow 
conceptual  (and  implementational)  modeling  of  sensors 
by  a  collection  of  simpler  standard  components.  In  cases 
where  each  filtering  step  is  a  linear  filter  (and  the  noise 
models  are  simple  enough),  the  system  will  compose  this 
sequence  of  filtering  operations  into  a  single  linear  fil¬ 
ter  significantly  decreasing  the  computational  complex¬ 
ity.  This  will  allow  reasonably  efficient  implementation 
in  most  cases. 

Automatic  definition  of  the  data-certainty  method  is 
considerably  more  difficult  and  will  assume  that  each 
filtering  component  is  from  the  subclass  of  noisy  filters 
which  provide  an  underlying  certainty  function  for  their 
mapping,  and  that  the  noise  is  poiniwise  independent. 
Thus  the  certainties  can  be  directly  composed.  For  the 
general  case  where  inter-dependencies  are  assumed,  the 
user  will  have  to  supply  the  certainty-function  directly. 
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Figure  17:  The  top  level  sensor  hierarchy. 


16.1.3  Image  Sensors 

Sensors  are  a  generalization  of  transducers  where 
we  consider  the  true  bi-directional  mapping  nature  of 
physical  sensors.  They  provide  the  basic  operations 
Sense  which  returns  a  sensor-data-object  based  on  a 
“likely”  image/object  (as  opposed  to  Transduce  which 
directly  returns  an  image/object).  The  Sensor-Data- 
Object  provides  a  mechanism  for  mapping  from  the  im¬ 
age/object  back  into  the  world/scene.  For  this  inverse 
mapping  there  are  two  basic  operations,  inverse-point- 
projection,  which  returns  a  volume  in  the  scene  which 
would  map  into  a  given  image/object  point.  This  inverse 
mapping  is  based  entirely  on  the  geometric  aspects  of  the 
sensors  subparts.  A  more  costly  operation  will  be  Scene- 
Certainty  which  will  allow  one  to  ask  about  the  certainty 
of  a  scene/world  object  given  the  sensory  data.  Note 
that  the  problem  of  computing  Scene-Certainty  method 
is  generally  ill  posed  and  will  involve  a  number  of  as¬ 
sumptions.  Related  to  the  Scene-Certainty  method  is 
the  method  Invert- sensing  which  will  attempt  to  return 
a  likely  image/object  which,  when  passed  through  the 
sensor,  would  account  for  the  data.  Obviously  this  last 
method  is  not  even  a  well  defined  mapping  (let  alone 
mathematically  well-posed)  since  we  have  no  restrictions 
on  the  space  on  input  world/scenes  and  can  not  hope 
to  have  a  unique  definition.  In  general  this  method  will 
need  to  be  provided  by  the  user,  though  a  default  will  be 
provided  which  assumes  the  world  is  a  frontal  projection 
of  a  plane  with  an  “image”  painted  on  it,  in  which  case 
this  method  is  simply  image  deblurring  and  resampling. 

We  want  to  emphasize  that  all  of  the  certainty  cal¬ 
culations  associated  with  sensors  need  not  by  physically 
accurate.  Rather,  they  are  mathematical  models  of  the 
certainty  and  hence  their  relation  to  reality  depends  on 


the  faithfulness  of  the  model.  The  goal  is  to  provide 
a  way  for  lU  researchers  to  test  their  algorithm  using 
synthetic  data  generated  according  to  a  ‘’pure”  model 
(e.g.  lambertian  surface  with  point  light  source  viewed 
orthographically  with  a  blur-free  camera  and  white-noise 
of  a  given  level)  or  increase  the  realism  and  also  test 
under  different  more  realistic  conditions  (e.g.  Torrance- 
sparrow  type  partially  speculeir  objects  with  multiple  ex¬ 
tended  light  sources,  viewed  through  a  thin-lens  with 
depth  of  field,  focus  and  chromatic  aberration  affects 
with  position  dependent  exponentially  distributed  noise 
and  a  simple  model  of  CCD  charge  bleeding.)  In  addi¬ 
tion  we  hope  to  provide  a  freunework  (via  the  bidirec¬ 
tional  certainty  mappings)  which  will  be  useful  to  lU 
researchers  engaged  in  sensor  modeling  and  using  sen¬ 
sor  models  in  lU  applications.  By  providing  a  consistent 
interface  with  respect  to  certainties  higher  level  models 
that  want  to  incorporate  this  information  into  sensor  fu¬ 
sion  tasks  can  work  with  a  very  wide  variety  of  sensors 
without  changing  their  interface  to  the  sensors. 

16.2  Level  of  modeling 

We  believe  that  the  mechanisms  proposed  are  sufficient 
extensible  to  allow  many  levels  of  detail  to  be  effectively 
pursued.  The  question  as  to  what  levels  will  be  provided 
in  the  basic  lUE  is  a  function  of  manpower  devoted  to 
the  sensor  section  for  initial  development,  the  willingness 
of  researchers  to  allow  incorporation  of  existing  software 
for  certain  modeling  problems,  and  the  effort  necessary 
to  convert  or  interface  to  existing  software.  We  remind 
the  reader  that  the  system  is  not  intended  to  be  real-time 
and  some  of  the  proposed  operations,  especially  if  mod¬ 
eled  at  a  very  low  level,  are  expected  to  take  considerable 
compute  power/time. 

It  seems  clear  that  we  should  provide  tools  to  allow 
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middle  level  researchers  to  build  upon  lower  level  mod¬ 
els,  when  the  wish  to,  but  also  provide  an  efficient  im¬ 
plementation  of  an  abstract  model  closer  to  their  level  of 
use.  E.g.  we  have  proposed  a  class  of  Grayscale  Cam¬ 
era  which  returns  images,  but  does  not  get  into  such 
detciil  as  photo-site  layout,  charge  bleeding,  photo-site 
spectral  characteristics,  etc.  Thus  a  Grayscale  cam¬ 
era  equipped  with  a  pinhole  simple  lens  also  Transduc¬ 
tion  in  a  fraction  of  a  second,  while  a  CCD  camera  with 
a  zoom-lens  may  take  many  hours.  From  the  users  point 
of  view,  all  methods  applicable  to  a  Grayscale  camera 
are  also  supported  by  a  CCD  camera,  only  slower. 

16.3  Energy  Objects 

These  objects  provide  the  functionality  for  the  lUE  to 
represent  and  reason  about  energy-waves  (light).  Most 
of  the  impact  of  this  class  has  to  do  with  rendering  (syn¬ 
thetic  sensing),  though  the  objects  could  also  be  used 
directly  by  other  methods. 

There  are  two  basic  subclasses  of  energy-objects: 
wave-bundles  and  energy-sources.  The  first  category 
provide  a  low  level  way  of  representing  a  generaliza¬ 
tion  of  light-rays  and  come  in  many  flavors,  including 
monochrome  rays,  RGB  rays,  multi-spectral  rays,  polar¬ 
ized  multi-spectral  rays,  light  pencils,  and  light  volumes. 
These  will  allow  lUE  users  to  do  their  own  ray-tracing 
of  scenes  (if  needed  for  advanced  applications  though 
this  will  be  rather  slowly  compared  to  traditional  ray- 
tracers).  More  commonly  they  will  be  used  for  reasoning 
about  small  portions  of  an  scene  in  a  more  sophisticated 
manner  (e.g.  phase  computations  for  coherent  sources  in 
regions  the  neighborhood  of  depth  edges).  In  the  initial 
lUE  implementation,  there  will  be  few  of  the  potential 
flavors  actually  used,  though  the  framework  and  meth¬ 
ods  will  be  defined  for  most. 

The  second  class,  energy-sources,  is  a  generaliza¬ 
tion  for  all  other  characteristics  of  the  way  object  ab¬ 
sorb/emit/reflect  energy.  All  objects,  which  are  in¬ 
tended  to  be  rendered,  will  subclass  off  energy-sources. 
A  reguleir  (reflective)  object  will  simply  have  an  energy- 
distribution  that  is  functionally  related  to  its  energy 
input  and  has  total  power  output  <  the  total  power 
input.  The  basic  operations  allow  one  to  affect  pa¬ 
rameters,  including  energy  (spectral)  distribution,  total 
power,  BDRF.  In  addition  there  are  methods  to  allow 
incremental  evaluation  and  update  of  a  scene  contain¬ 
ing  energy-sources,  and  for  computing  the  energy  flux  at 
a  point.  The  proposed  sources  include  point  sources, 
planer  extended  sources,  sun-light,  laser-sources,  pat¬ 
terned  sources  and  environment-mapped  sources.  We 
further  propose  a  energy-volume  source  which  may  be 
useful  for  modeling  fog  or  smoke  affects. 

While  it  may  seem  counter  intuitive  to  have  all  objects 
be  energy-sources,  it  is  actually  more  realistic.  When  in 
terreflections  are  taken  into  account,  each  surfauie  acts, 
in  one  sense,  like  an  energy  source.  By  making  a  single 
object  from  which  to  derive  all  reflective  properties, 
can  insure  that  for  the  same  scene  and  sensors,  those 
researchers  wishing  to  include  interreflections  into  their 
rendering  can  be  accommodating.  To  make  the  render¬ 
ing  process  efficient  (for  simple  rendering  models),  the 


scene  objects  will  actually  have  a  list  of  “active”  sources. 
These  are  sources  which  are  active  in  the  sense  of  out- 
putting  more  energy  than  they  receive. 

16.4  Filters 

Both  the  sensor  objects  and  the  energy  objects  make 
heavy  use  of  filter  objects.  This  class  provides  a  means 
for  “filtering”  functions  (discrete  or  continuous)  in  a  con¬ 
sistent  manner.  The  concept  of  filtering  is  too  extensive 
to  come  up  with  a  efficient  method  for  implementing  My 
’’filtering”  mechanism  in  our  OOPL.  This  class  takes  fil¬ 
tering  sub-topics  which  are  important  for  lUE  and  at¬ 
tempts  to  handle  them. 

A  very  important  (from  an  implementational  efficiency 
point  of  view)  category  of  filters  are  Linear  filters.  The 
base-filter  cIms  provides  a  slot  to  determine  note  if  a  fil¬ 
ter  is  linear.  A  the  ’’linear”  property  of  a  filter  is  never 
checked,  and  every  filter  may  operate  as  though  it  were 
linear,  it  is  up  to  the  lUE  user  to  verify  linearity  and 
to  check  the  isLinear  slot.  The  base-filter  class  dso  pro¬ 
vides  for  such  operations  as  building  a  filter-sequence  (a 
sequence  of  filters  which  are  called  in  order),  and  manip¬ 
ulating  the  linear  filter  characteristics  of  the  filter  such  as 
specifying  the  impulse  response,  computing  the  impulse- 
response  directly,  setting  the  sampling  rate  for  the  FFT 
representation,  computing  the  MTF  from  the  impulse- 
response,  specifying  the  MTF  directly,  and  computing 
the  impulse  response  from  the  MTF.  The  base-class  is 
virtual,  no  instances  are  allowed.  One  can  specify,  for 
linear  filters,  if  the  filter  should  be  applied  directly,  or  if 
it  should  be  used  in  the  frequency  domain  (in  which  case 
the  the  MTF  is  really  the  definition  of  the  filter).  For 
higher  dimensional  filters,  there  is  the  ability  to  spec¬ 
ify  a  separable  form  for  linear  filters  (but  there  is  no 
way  to  compute  them  automatically).  Finally,  there  is 
a  compose-linear-filter  operation  which  will  allow  one  to 
build  a  new  linear  filter  directly  from  a  sequence  of  oth¬ 
ers  (by  up-sampling  their  MTF’s  to  consistent  rate  and 
doing  pointwise  multiplication). 

The  four  basic  subclasses  here  are  the  DtoA  filter,  the 
AtoD  filter,  the  DtoD  filter  and  the  AtoA  filter.  These 
provide  a  consistent  interface  to  adlow  transformation  be¬ 
tween  digital  and  analog  forms  for  data  and  to  allow 
building  of  complex  transformation  from  simpler  ones. 
To  allow  flexibility,  the  Filters  are  applied  to  either  fixed 
dimension  functions  with  a  consistent  calling  mechanism 
or  to  arrays.  Their  output  can  be  either  form  except,  ob¬ 
viously,  that  analog  outputs  cannot  be  arrays.  Note  that 
since  images  are  subclassed  from  arrays  filtering  images 
is  straight  forward.  While  we  believe  the  AtoA  filter 
class  with  be  rarely  used  (filtering  in  functiontil  for  is 
generally  hard  without  knowledge  of  the  internal  repre- 
.sentation  (which  is  not  specified),  and  a  cascade  of  AtoD 
.ii\<l  DtoA  is  better  specified  in  that  form. 

I  r  Other  Extensions 

1:  -pite  of  the  rather  extensive  object  specifications  in 
ilie  preceding  sections,  there  are  a  number  of  areas  which 
require  further  investigation.  These  areas  will  be  further 
defined  in  the  final  lUE  specification. 
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17.1  The  Spatial  Object 

In  developing  the  various  class  structures  we  have  no¬ 
ticed  that  common  spatial  attributes  and  operations 
arise  over  and  over  again.  Examples  are, 

•  spatial  dimension 

•  coordinate  frames 

•  coordinate  treinsformations 

•  intersection,  e.g.  point-on-line 

•  containment,  e.g.  point-in-polygon 

We  intend  to  capitalize  on  this  abstraction  to  streamline 
the  classes  associated  with  spatial  description  such  as 
curves,  surfaces  and  solids 

17.2  Spatial  Indices 

Many  lU  algorithms  gain  efficiency  through  the  use  of 
spatial  indexing  methods.  We  have  already  mentioned 
the  Hough  array  as  a  classic  example,  but  there  are  many 
others  which  should  be  available  in  the  lUE.  The  follow¬ 
ing  is  a  partial  list  to  illustrate  the  concept. 

•  k-d  TVee 

•  Minimal  Spanning  Tree 

•  Quadtree 

•  Octree 

•  Delunay  Triangulation 

•  Various  Hashing  Schemes 

17.3  Statistical  Operations 

We  recognize  that  statistics  is  a  major  influence  in  many 
lU  approaches  and  algorithms.  We  intend  to  provide  a 
basic  set  of  objects  and  tools  for  using  statistical  meth¬ 
ods.  It  is  likely  that  a  Kalman  Alter  infrastructure  will 
be  specified  since  this  approach  has  experienced  consid¬ 
erable  popularity  in  recent  years,  particularly  in  active 
vision  work. 

17.4  Object  Oriented  Databases 

Not  reflected  in  this  paper  are  .some  preliminary  studies 
on  the  emerging  technology  of  object  oriented  databases. 
There  are  already  commercial  packages  available  for  C++ 
and  we  are  studying  various  approaches  to  make  such  fa¬ 
cilities  available  for  Lisp.  The  availability  of  a  persistent 
database  for  objects  will  considerably  improve  the  effi¬ 
ciency  of  rapid  prototyping,  since  currently  a  great  deal 
of  effort  is  expended  on  file  I/O  and  specialized  databases 
for  efficient  retrievrJ  of  lU  structures. 

17.5  Data  Standards 

We  have  already  discussed  the  problem  of  import  and 
exchange  of  lU  data  and  results  between  other  systems 
and  the  lUE.  Perhaps  the  most  important  role  of  the 
lUE  committee  is  to  define  and  maintain  a  data  exchange 
standard  for  lU  data  structures. 

A  preliminary  design  for  the  exchange  mechanism  has 
been  defined  based  on  a  list  structure.  The  most  impor¬ 
tant  issue  is  the  definition  of  keywords  for  common  lU 
attributes  and  data  structures.  It  is  planned  to  defined 
file  structures  which  carry  the  data  definition  along  with 


the  data  so  that  exchange  interfaces  can  be  easily  imple¬ 
mented.  One,  perhaps  over-optimistic,  idea  is  that  the 
lUE  specification  LaTex  macros  can  be  used  to  generate 
the  data  exchange  formats  directly. 
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ABSTRACT 

RCDE  is  a  software  environment  for  the  devel¬ 
opment  of  image  understanding  algorithms.  The 
application  focus  of  RDCE  is  on  image  exploita¬ 
tion  where  the  exploitation  tasks  are  supported 
by  2D  and  3D  models  of  the  geographic  site  be¬ 
ing  analyzed.  An  initial  prototype  for  RCDE 
is  SRI’s  Cartographic  Modeling  Environment. 
This  paper  reviews  the  CME  design  and  illus¬ 
trates  the  application  of  CME  to  site  modeling 
scenarios. 

1  Introduction 

1.1  RADIUS  and  RCDE 

Research  and  Development  for  Image  Under¬ 
standing  Systems,  or  RADIUS,  is  a  DARPA- 
funded  project  to  develop  and  demonstrate  new 
image  exploitation  capabilities,  based  on  image 
understanding  techniques.  The  central  theme 
of  RADIUS  is  the  use  of  three  dimensional  site 
models  which  provide  the  contextual  informa¬ 
tion  needed  by  image  understanding  algorithms. 
The  premise  of  RADIUS  is  that  by  registering 
a  site  model  to  an  image,  the  identification  of 
objects  and  the  detection  of  significant  change 
is  much  more  reliable  than  by  totally  bottom-up 


procedures,  such  as  image  subtraction. 

In  order  to  demonstrate  these  model-supported 
image  understanding  techniques,  a  flexible  soft¬ 
ware  environment  is  being  developed  called  the 
RADIUS  Common  Development  Environment, 
or  RCDE.  RCDE  provides  a  set  of  basic  data 
structures  and  algorithms  to  enable  image  un¬ 
derstanding  €ilgorithms  to  be  demonstrated  on 
application  images  so  that  individual  researchers 
do  not  have  to  repeat  the  development  of  these 
basic  tools.  RCDE  also  provides  the  infras¬ 
tructure  to  handle  large  images  and  a  full  set 
of  geospecific  coordinate  representations.  Tools 
are  also  provided  to  construct  geometric  models 
from  image  data  and  to  exploit  the  constraints 
provided  by  terrain  and  sun  ray  geometry. 

The  goed  is  to  place  RCDE  in  many  lU  labs  so 
that  groups  interested  in  lU  research  focused  on 
image  exploitation  can  conduct  experiments  in 
the  context  of  a  realistic  application  with  mini¬ 
mal  softwMe  development  overhead.  RCDE  will 
provide  basic  data  structures  and  processing  op¬ 
erations  that  often  arise  in  such  applications  as, 
object  recognition,  change  detection  and  image 
perspective  transformation. 

Consider  a  typical  example  where  an  lU  re¬ 
searcher  may  have  an  interesting  new  idea  for 
image  feature  extraction.  RCDE  will  provide 
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a  substrate  for  developing  an  evaluation  exper¬ 
iment  for  the  new  feature  extraction  method. 
The  new  feature  extraction  results  can  be  used 
by  other  experimental  systems  build  on  RCDE 
and  the  performance  of  the  new  features  can  be 
evaluated  in  terms  of  improved  performance  of 
these  other  systems. 


2  History 

The  SRJ  Artificial  Intelligence  Center  Percep¬ 
tion  Group  has  been  conducting  computational 
vision  research  in  cartography  and  photo  inter¬ 
pretation  for  over  15  years.  Toward  this  end, 
SRI  heis  developed  two  major  software  tools; 
ImagCalc,  and  the  Cartographic  Modeling  En¬ 
vironment  (CME). 

ImagCalc  is  a  general-purpose  system  for  image 
manipulation  that  supports  a  convenient  user  in¬ 
terface  to  a  rich  collection  of  image  operators 
and  interactive  tools.  Of  particular  importance 
is  an  underlying  (internal)  image  representation 
that  efficiently  handles  the  manipulation  and 
display  of  very  large  images  (larger  than  4k  x  4k 
pixels). 

CME  is  derived  from  ImagCalc  and  has  been  de¬ 
signed  for  interactively  deriving  3-D  models  of 
the  world  using  real  world  imagery  and  geomet¬ 
ric  constraints.  The  model  derivation  process 
can  be  thought  of  as  inverse  computer  graphics, 
i.e.,  generating  the  3-D  models  using  images  of 
the  world,  rather  than  generating  images  using 
models  of  the  world.  CME  combines  the  use 
of  monocular  and  stereo  imagery,  multiple  im¬ 
ages  of  greatly  differing  geometries,  digital  ter¬ 
rain  elevation  data  (DTED),  and  illumination 
(sun)  models  in  a  framework  that  allows  the  user 
to  interact  with  wire  frame  models  of  world  ob¬ 
jects  simultaneously  overlaid  on  all  of  the  im¬ 
age  sources.  The  illumination  models  allow  the 
derivation  of  object  heights  from  their  shadow 
lengths  as  well  as  enabling  objects  emplaced  in 
a  scene  by  the  user  to  be  shaded  realistically  in 
accordance  with  the  sun  angle  in  the  basic  im¬ 
age  being  displayed.  The  resulting  world  object 
models  and  terrain  models  can  be  used  to  render 
individual  new  images  at  arbitrary  camera  po¬ 
sitions,  and  sequences  of  images  along  arbitrary 
flight  paths.  Both  the  terrain  and  object  mod¬ 
els  can  be  rendered  using  texture  maps  derived 


from  actual  imagery. 

Both  ImagCalc  and  CME  are  currently  imple¬ 
mented  only  on  the  Symbolics  36xx  series  LISP 
Machines.  ImagCalc  and  CME  each  consist  of 
approximately  60000  lines  (2  million  bytes)  of 
source  code. 

ImagCalc  has  been  licensed  to  approximately  34 
sites,  including  12  university  sites. 

2.1  CMEE  Project 

Recently,  GE’s  Military  and  Data  Systems  Op¬ 
eration  (M&DSO)  and  GE’s  Corporate 
search  and  Development(CRD)  have  teamed 
with  SRI  to  further  develop  the  capabilities  of 
CME  and  port  the  system  to  a  Unix  platform. 
The  project,  called  CMEE  for  CME  Enhance¬ 
ment,  is  focused  on  the  documentation  of  CME 
and  on  a  formal  specification  of  the  requirements 
and  design  of  RCDE.  The  CMEE  project  is  also 
exploring  issues  that  involve  the  use  of  both  Lisp 
and  C  programming  for  the  development  of  im¬ 
age  understanding  systems  using  CME. 

3  Architecture 

The  architecture  of  CME  is  shown  in  figure  1 
which  illustrates  the  scope  of  the  system.  There 
are  four  main  entities  which  can  be  constructed 
and  manipulated  by  the  system. 

Images  CME  can  represent  a  large  variety  of 
image  data  types  and  a  novel  image  address 
mapping  scheme  enables  efficient  access  to 
large  images. 

Image  Operators  A  wide  range  of  image  pro¬ 
cessing  operations  are  available  and  acces¬ 
sible  by  menu  or  by  a  high  level  Lisp  pro¬ 
gramming  structure. 

Image  Display  CME  supports  fast  roaming 
and  zooming  that  is  comparable  in  func¬ 
tionality  but  somewhat  slower  in  perfor¬ 
mance  than  analogous  film-based  light  table 
manipulations. 

Windows  The  window  system  provides  a  large 
spectrum  of  geometric  transformations  and 
graphics  functions  for  image  and  wireframe 
display. 

Object  Models  A  number  of  object  families 
are  available  such  as  superquadrics  and 
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house  shapes.  3D  curves  and  ribbons  are 
also  supported. 

Camera  Models  CME  provides  tools  for 
modeling  perspective  image  projection  as 
well  as  more  specialized  camera  models 
such  the  satellite  pushbroom  sensor.  These 
models  are  used  to  relate  the  3D  geometry 
of  objects  with  the  2D  geometry  of  image 
features. 

Terrain  Models  Terrain  is  represented  in 
CME  as  a  trizuigulated  mesh  or  as  a  sim¬ 
ple  2D  array  of  elevations.  Terrain  data  is 
necessary  to  portray  a  realistic  3D  site  con¬ 
figuration. 

Site  Model  The  position  and  orientation  of 
the  CME  entities  can  be  represented  in  a 
common  coordinate  frame.  A  local  coor¬ 
dinate  system  is  called  the  site.  Geocen¬ 
tric  and  geodetic  world  coordinate  frame 
are  available  and  geolocations  can  be  con¬ 
sistently  determined  in  a  number  of  com¬ 
mon  coordinate  systems. 

Rendering  A  site  model  can  be  displayed  as 
a  realistic  image  view  by  the  application  of 
phototexture  onto  the  3D  model  faces  and 
the  terrain  facets.  A  sequence  of  views  is 
used  to  generate  a  realistic  “fly-through”  of 
the  site. 

User  Interface  All  of  the  CME  functions  can 
be  accessed  through  a  menu-driven  user  in¬ 
terface.  An  interactive  programming  envi¬ 
ronment  is  also  available  by  executing  in¬ 
terpreted  Lisp  functions.  A  history  mech¬ 
anism  keeps  track  of  cached  functions  for 
efficiency  and  interactive  programming. 

4  Scenarios 

The  functions  of  CME  are  probably  best  illus¬ 
trated  by  a  number  of  sample  scenarios.  In  this 
section  we  consider  three  examples  which  exer¬ 
cise  a  reasonable  fraction  of  the  system. 

4.1  Image  Exploitation 

ImagCalc  is  a  general-purpose  image  manipula¬ 
tion  system  that  forms  the  basis  of  CME.  This 
scenario  demonstrates  how  ImagCalc  aids  the 
user  in  extracting  useful  information  from  grey- 


level  aerial  images.  Figures  2  through  6  illus¬ 
trate  the  results. 

First,  images  must  be  loaded  into  ImagCalc 
from  the  file  system.  Selecting  the  Dired  menu 
(DIRectory  EDit)  allows  the  user  to  choose  a 
file  system  directory,  retrieve  image  files,  and 
place  them  in  an  ImagCalc  pane.  The  upper 
two  panes  of  Figure  2  contain  images  that  were 
retrieved  from  a  file;  the  upper  left  pane  shows 
an  overhead  image  of  an  airport,  while  the  upper 
right  pane  shows  an  oblique  view  of  the  terminal 
area. 

The  lower  two  panes  of  Figure  2  illustrate  the 
results  of  image  roaming  and  zooming  opera¬ 
tions.  The  lower  left  image  has  been  zoomed 
out,  and  the  lower  right  image  has  been  zoomed 
in.  Two  types  of  zooming  are  available:  the 
Fast  Zoom  commands  perform  a  pixel-replicated 
zoom  for  general  viewing  purposes,  while  the 
Zoom  commands  use  interpolation  for  greater 
clarity.  A  number  of  other  coiiunands  are  avail¬ 
able  for  scrolling  or  repositioning  an  image  on 
the  pane.  For  example,  the  Recenter  command 
moves  the  selected  point  to  the  middle  of  the 
pane,  while  %  Reposition  is  used  to  move  the 
image  about  the  pane  by  selecting  the  relative 
position  within  the  entire  image. 

Images  can  be  manipulated  with  respect  to 
their  grey-level  distribution  (histogram),  as  il¬ 
lustrated  in  Figure  3.  The  Contrast  Stretch  op¬ 
eration  under  the  Enhance  menu  computes  a 
histogram  of  an  image  (lower  left  pane)  with 
adjustable  upper  and  lower  adjustable  thresh¬ 
olds.  By  interactively  adjusting  the  thresholds 
with  mouse  clicks,  the  user  can  stretch  the  his¬ 
togram  interactively.  The  lower  left  pane  of 
the  figure  shows  the  operation  in  progress.  A 
green  line  representing  the  lower  threshold  can 
be  seen  near  grey  level  190  of  the  histogram,  and 
the  upper  threshold  is  set  at  256.  Grey  levels 
above  and  below  the  thresholds  are  mapped  to 
white  and  black,  respectively,  while  the  remmn- 
ing  histogram  is  stretched  to  the  original  range. 
To  facilitate  interactive  processing,  the  Contrast 
Stretch  command  modifies  the  color  map  di¬ 
rectly,  altering  all  image  views  on  the  screen; 
this  effect  disappears  when  the  operation  is  com¬ 
plete. 

Figure  4  illustrates  the  Window  command.  The 
user  can  select  a  region  of  interest  in  one  image 
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Figure  1:  The  architecture  of  CME.  CME  provides  a  closely  integrated  structure  for  the  represen¬ 
tation  of  images,  3D  objects,  terrain  and  cameras.  These  elements  are  the  basic  tools  needed  to 
support  image  exploitation. 
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and  paste  that  region  into  another  pane.  Note 
the  green  region  of  interest  box  in  the  bottom 
of  the  lower  right  pane.  The  resulting  pane  has 
also  been  zoomed  to  examine  its  detail. 

This  new  area  of  interest  (the  small  airplane)  is 
then  used  in  Figure  5  to  generate  a  surface  plot. 
The  Plot  3D  command  plots  the  image  grey  lev¬ 
els  as  a  depth  map  euid  allows  the  user  to  in¬ 
teractively  vary  the  angle  of  view.  The  result 
is  shown  in  the  lower  right  pane  of  the  figure; 
the  surface  is  viewed  from  the  lower  right  cor¬ 
ner  of  the  airplane  image.  With  some  difficulty, 
you  can  see  the  green  line  on  the  image  which 
indicates  the  view  direction. 

Figure  6  illustrates  finding  the  outline  of  the 
small  airplane  using  edge  operators.  The  image 
is  first  filtered  using  the  V^G  operator,  which 
is  implemented  by  the  Difference  of  Gaussians 
command  under  the  Enhance  menu.  Then  the 
Zero  Cross  Image  conunand  identifies  all  zero- 
value  pixels,  which  are  formed  at  edge  crossings 
when  using  V^G.  The  user  can  also  modify  the 
Zero  Cross  Image  threshold  to  reduce  the  effects 
of  quantization.  The  resulting  airplane  outline 
appears  in  the  lower  right  pane  of  the  figure. 

In  summary,  this  scenario  shows  how  the  Imag- 
Calc  portion  of  CME  can  be  used  to  manipu¬ 
late  digital  images.  The  demonstrated  opera¬ 
tions  include  roaming  and  zooming  around  an 
image,  contrast  stretching,  cutting  and  pasting 
sections  of  images,  generating  depth  maps,  and 
detecting  edges.  ImagCalc  forms  the  basis  for 
CME’s  user  interface  and  is  a  powerful  image 
processing  environment. 

5  Scene  Modeling 

The  CME  Object  system  allows  the  user  to  cre¬ 
ate  geometric  objects,  define  their  characteris¬ 
tics,  and  view  the  results.  This  scenario  demon¬ 
strates  how  to  create  a  scene  of  objects,  and  then 
view  the  scene  three  ways  with  three  separate 
cameras.  Figure  7  illustrates  the  scene  model¬ 
ing  scenario. 

The  first  step  in  building  an  Object  model  is  to 
create  a  camera  model  for  the  scene.  Starting 
from  a  blank  CME  pane  and  the  default  Bucky 
menu,  select  the  View  Menu  command.  From 
the  subsequent  menus,  choose  New  View  Trans¬ 


form  and  then  New  Blank  View  Transform  to  ini¬ 
tialize  the  crtmera  model.  A  cartesian  coordi¬ 
nate  origin  appears  in  the  center  of  the  pane 
and  a  small  box  that  indicates  the  camera’s  pres¬ 
ence  appears  in  the  lower  left  corner  of  the  pane. 
What  the  user  sees,  therefore,  is  a  perspective 
camera’s  view  of  the  world;  the  CME  pane  is 
the  camera  image  plane. 

Objects  are  then  created  to  populate  the  scene. 
In  this  example,  a  house,  a  cylinder  (silo),  and 
a  half-cylinder  (quonset  hut)  model  a  farm, 
while  a  set  of  linked  ribbon  segments  repre¬ 
sents  a  road.  The  objects  are  instantiated  us¬ 
ing  the  Create  Object  command,  then  altered 
or  moved  using  the  menu  commands  unique  to 
each  class  of  objects.  Modifying  the  object  pa¬ 
rameters  is  intuitive  and  efficient  when  using  the 
Bucky  menus.  Alternately,  each  object,  when 
selected  by  the  mouse  pointer,  can  be  modified 
through  the  who-line  Menu  command,  which 
causes  a  parameter  window  to  pop  up  for  editing 
the  object  parameters  directly.  For  example,  the 
roof  pitch  of  a  house  defaults  to  0.5;  the  pitch 
can  be  changed  interactively  using  the  mouse,  or 
directly  by  typing  a  new  vsdue  in  the  parameter 
window. 

To  show  three  views  of  a  single  scene,  it  is  nec¬ 
essary  to  copy  the  view  transform  to  the  lower 
two  panes.  Choosing  Exact  Copy  instead  of  New 
Blank  View  Transform  allows  the  user  to  copy  the 
existing  scene  (transform  nnodel  and  all  the  ob¬ 
jects)  to  the  lower  two  panes.  All  three  panes 
then  represent  separate  views  of  the  same  model. 

Although  the  effect  is  not  visible  in  the  figure, 
objects  are  highlighted  in  a  different  color  when 
selected  via  the  mouse.  Therefore,  when  an  ob¬ 
ject  that  appears  in  multiple  panes  is  selected, 
all  views  of  that  object  are  highlighted.  This 
feature  illustrates  how  CME  is  an  excellent  ex¬ 
ample  of  object-oriented  software  design. 

The  three  panes  now  contain  identical  repre¬ 
sentations  of  the  modeled  scene  as  viewed  with 
three  distinct  cameras.  The  default  camera  pa¬ 
rameters  produce  the  perspective  view  shown  in 
the  upper  right  pane  of  Figure  7.  The  views 
shown  in  the  lower  panes  are  obtained  by  moving 
their  cameras  to  another  viewpoint.  Two  coor¬ 
dinate  systems  are  available  for  moving  the  cam¬ 
era  interactively:  scene-centered  and  camera- 
centered.  The  Azimuth/Elevation  command  al- 


219 


lows  the  user  to  rotate  the  camera  about  a 
point  fixed  in  the  scene,  and  is  intuitive  to 
use.  Camera-centered  translations  and  rotations 
(e.g.,  UV  Roll,  Move  Z)  are  powerful  photogram- 
metric  tools  and  use  a  standard  notation. 

Given  multiple  views  of  a  scene,  it  is  often  useful 
to  represent  the  camera  directly  on  the  screen 
as  an  object.  This  screen  representation  must 
be  explicitly  invoked  by  choosing  Camera  from 
the  Create  Object  menu,  then  specifying  the 
source  and  destination  views.  The  lower  right 
pane  of  Figure  7  shows  the  two  cameras  (Over¬ 
head  and  Side-Looking)  as  seen  from  a  third 
view.  Seeing  the  camera  represented  in  another 
view  is  quite  useful  when  moving  the  camera 
around,  especially  when  the  user  is  not  comfort¬ 
able  with  using  the  perspective  camera  model. 

When  the  camera  rotations  are  complete,  the 
three  pares  look  exactly  as  seen  in  Figure  7. 
The  upper  right  pane  contains  a  view  of  the  farm 
scene,  with  a  house  and  silo  in  the  foreground 
and  a  quonset  hut  in  the  background.  The  lower 
left  pane  is  a  view  of  the  same  farm  from  down 
the  road.  Finally,  the  lower  right  pane  shows 
the  scene  from  farther  away  at  ground  level;  the 
two  pyramid-like  objects  on  the  top  and  left  are 
the  cameras  whose  views  are  shown  in  the  first 
two  panes.  The  camera  rays  converge  at  the 
camera’s  focal  point,  and  the  image  plane  is  vis¬ 
ible.  CME’s  sophisticated  support  of  3-D  trans¬ 
formations  allows  multiple  views  to  coexist  in  a 
smoothly  integrated  fashion. 

This  scenario  shows  how  to  create  and  manipu¬ 
late  a  scene  of  3-D  objects,  which  can  be  viewed 
by  multiple  cameras  from  any  perspective.  The 
cameras  can  also  be  viewed  as  objects  them¬ 
selves,  aiding  the  user  in  his  interpretation  of 
the  scene.  Although  ImagCalc  is  used  as  a  ba¬ 
sis  for  the  user  interface,  this  scenario  illustrates 
the  flexibility  of  the  CME  Object  system. 

6  Scene  Rendering 

The  separate  capabilities  of  both  ImagCalc  and 
Object  can  be  used  together  to  form  an  even 
more  powerful  exploitation  environment.  This 
scenario  illustrates  how  to  model  a  group  of 
buildings  and  render  them  using  two  images. 
The  result  is  a  realistic  3-D  model  that  can  be 
used  to  generate  fly-through  image  sequences. 


This  scenario  will  not  be  discussed  in  explicit 
detail  since  its  operations  are  considerably  more 
complex  than  the  others. 

Figure  8  illustrates  the  process  of  building  a  wire 
frame  model  using  aerial  imagery.  Two  distinct 
images  of  the  same  scene  are  put  into  both  of 
the  upper  and  lower  panes,  respectively.  Then 
a  wire  frame  model  of  the  important  buildings 
is  built  using  the  techniques  illustrated  in  the 
modeling  scenario.  A  second  view  of  the  model 
is  created  and  the  second  camera  is  adjusted  so 
that  the  model  matches  the  orientation  of  the 
buildings  in  the  underlying  image.  Ideally,  the 
image  views  are  different  enough  to  permit  a 
good  fit  of  the  model  to  the  images,  although 
some  adjustment  of  individual  buildings  may  be 
required. 

After  the  model  construction  is  completed,  the 
model  faces  are  rendered  with  photo  textures 
extracted  from  one  of  the  images.  The  CME 
Render  operations  are  used  to  assign  texture  to 
the  model.  The  result,  illustrated  in  Figure  9,  is 
a  3-D  model  with  realistic  texture  and  shading. 
Since  the  scene  is  three-dimensional,  it  can  be 
viewed  from  various  angles  to  support  applica¬ 
tions  such  as  fly-through  mission  rehearsal. 

In  summary,  this  scenario  illustrates  the  power 
of  CME  to  integrate  its  component  programs 
in  a  seamless  fashion.  An  intuitive,  power¬ 
ful  interface  aids  the  user  generating  realistic, 
three-dimensional  scenes  from  two-dimensional 
imagery. 

7  Software  Design 

CME  is  a  powerful  environment  for  developing 
new  lU  algorithms.  Some  of  the  features  which 
make  CME  such  an  environment  are  described 
below. 

Object-Oriented  Design  CME  is  an  object 
oriented  design  and  is  currently  based  on  the 
Symbolics  Flavor  System.  A  portion  of  the  ob¬ 
ject  hierarchy  is  shown  in  figure  10.  The  hi¬ 
erarchy  is  based  on  object  inheritance,  where 
objects  near  the  top  of  the  hierarchy  are  more 
general  than  those  at  the  bottom.  A  good  ex¬ 
ample  of  the  use  of  inheritance  is  the  vector- 
image  class.  The  vector-image  defines  a  set 
of  image  layers.  Most  of  the  image  processing 
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Figure  4:  ImagCalc  Windowing 


Figure  6;  ImagCalc  Edge  Detection 


Figure  7:  Object  Scene  Modeling 
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Figure  9:  Rendered  Object  Model 
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Figure  10:  The  portion  of  the  object  hierarchy  for  images.  The  power  of  inheritance  is  quite  clear  in 
the  case  of  vector  images  where  a  RGB  color  image  is  considered  as  a  special  case  of  a  set  of  image 
layers.  The  complex  image  produced  by  the  FFT  is  another  subclass  of  the  vector  image. 


operations  which  apply  to  an  image  also  apply 
to  the  vector-image,  such  as  :zoom,  :histogram 
and  :warp.  The  programmer  can  proceed  with¬ 
out  considering  the  existence  of  multiple  layers 
in  many  cases.  Some  special  cases  of  the  class 
vector-image  are  color-image  and  complex- 
image.  Many  of  the  operations  for  these  more 
specific  classes  are  inherited  from  the  image 
component  class  or  inherited  from  the  general 
vector-image  as  a  natured  consequence  of  the 
object-oriented  language. 

Macros  CME  also  makes  extensive  use  of  the 
Lisp  macro  facility  to  provide  an  efficient  and  ex¬ 
tensible  programming  environment.  Especially 
powerful,  are  CME’s  use  of  iteration  macros  so 
that  functions  which  operate  on  all  the  pixels  of 
an  image  or  the  vertices  of  a  polyhedron  can  be 
written  without  direct  coding  by  the  program¬ 
mer.  For  instance  the  macro  image-poiut- 
operator  allows  one  to  map  the  same  n-ary 
function  at  each  point  of  an  image  set.  This 
function  is  optimized  for  maximum  speed  using 
techniques  which  the  average  programmer  would 
not  be  familiar  with.  For  example,  a  restricted 
version  of  the  negate-image  function,  designed 
to  operate  only  on  images  containing  8-bit  pix¬ 


els,  could  have  been  defined  as 

(defun  negate-inage  (imaige) 

(image-point -operator  ((image)) 

(lambda  (x) 

'(-  255  ,x)))) 

using  image-point-operator,  while  a  simpli¬ 
fied  version  of  a  function  to  add  two  8-bit  images 
would  be 

(defun  add-images  (image-a  image-b) 

(image-point-operator  ((image-a  image-b)) 
(lambda  (a  b) 

'(ash  (+  ,a  ,b)  -1)))) 

Another  useful  macro 

is  with-objects-desensitized.  Normally  all 
three  dimensional  objects  in  the  CME  system 
are  mouse  sensitive,  meaning  that  they  respond 
to  the  mouse  by  changing  color  and  changing 
the  meanings  of  mouse  clicks.  If  a  programmer 
desired  to  temporarily  defeat  that  feature,  for 
instance  to  allow  the  mouse  to  become  sensitive 
to  same  graphical  overly  created  independently 
of  the  three-dimensional  object  system,  he  could 
use  macro  in  code  such  as  the  following: 
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8  The  Rehost  of  CME 


(defiin  frob-my-graphics  (graphic-element) 
(mith-objects-desensitized 
(sensitize  graphic-element) 
(frob-my-graphics - int  ernal 
;  Do  whatever 

(desensitize  graphic-element))) 

Process  History  Another  interesting  design 
feature  of  CME  is  the  cached  function  history 
mechanism.  The  relational  network  is  set  up 
with  data  types  as  nodes  and  functions  as  edges. 
For  example,  suppose  a  the  function,  :threshold, 
is  applied  to  image,  fi,  which  produces  the  1- 
bit  image,  The  resulting  network  is  [7i  <—  {: 
threshold}  *—  /a]  The  history  network  provides 
two  important  services. 

•  If  a  data  item  already  exists  in  the  history 
network  it  is  not  necessary  to  recompute 
it.  For  example,  suppose  an  image  is  bi- 
linearly  interpolated  to  produce  a  2x  higher 
resolution  image.  Then  later  the  operator 
decides  to  apply  a  2x  zoom-out  reduction. 
Instead  of  applying  another  bi-linear  map¬ 
ping  to  the  high-resoJ'>tion  image,  CME  de¬ 
termines  that  the  desired  result  is  already  in 
the  history  and  simply  returns  the  previous 
image.  This  mechanism  improves  the  effi¬ 
ciency  of  interaction  for  operations  which 
require  a  significant  amount  of  computa¬ 
tion. 


While  the  Symbolics  Lisp  Machine  provides  an 
effective  environment  for  rapid  prototyping  and 
out  experiments  in  image  exploita¬ 
tion,  it  is  clear  that  in  the  future  most  image  un¬ 
derstanding  (ID)  laboratories  will  be  equipped 
with  Unix  workstations.  It  is  also  the  case  that 
many  lU  labs  are  currently  developing  their  pro¬ 
grams  in  the  C  or  C-f-l-  language.  Consequently, 
CME  is  currently  being  rehosted  to  run  in  Com¬ 
mon  Lisp  on  Unix  platforms  and  an  interface  is 
being  developed  to  provide  C  and  C-|— f  access 
to  CME  functions. 

The  rehost  process  has  raised  many  interesting 
issues  both  in  language  interfaces  as  well  as  the 
problem  of  achieving  high  performance  graphics 
and  image  processing  functions  in  an  X- window, 
Unix  environment. 

9  Conclusion 

In  summary,  the  RCDE  is  an  exciting  experi¬ 
ment  in  software  engineering  which  should  pro¬ 
vide  a  new  approach  to  the  integration  of  image 
understanding  algorithms. 


graphic-elemftjitiarrying 


•  Another  application  of  the  history  mecha¬ 
nism  is  to  record  the  sequence  of  operations 
and  the  parameter  settings  which  are  in¬ 
volved  in  computing  a  final  result.  For  ex¬ 
ample,  constructing  an  image  segmentation 
requires  on  the  order  of  a  dozen  operations 
including  various  convolutions  and  thresh¬ 
olding  operations.  It  may  be  the  case  that 
the  operator  has  produced  the  result  by  ap¬ 
plying  a  series  of  menu  operations.  The 
history  mechanism  keeps  track  of  the  op¬ 
erations  and  the  data  items  involved  and 
produces  a  corresponding  Lisp  expression. 
The  expression  is  in  a  form  which  can  be 
easily  edited  to  produce  a  general  function 
for  repeating  the  complex  process.  The  re¬ 
sult  is  a  combination  of  the  best  features 
of  a  graphical  user  and  an  interpreted  lan¬ 
guage  interface. 
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Abstract 

Although  over  twenty  years  of  basic  and  applied 
research  work  is  available  for  understanding  im¬ 
ages  in  the  visible  spectrum,  computational  vi¬ 
sion  algorithms  developed  for  application  to  vi¬ 
sual  imaging  are  usually  not  directly  applica¬ 
ble  to  radar  imagery  because  of  the  fundamen¬ 
tal  differences  between  the  physics  of  synthetic 
aperture  radar  image  formation  and  that  of 
conventional  images  at  visible  wavelengths.  Yet 
many  of  the  paradigms  that  have  evolved  in  im¬ 
age  understanding  (lU)  and  Computer  Vision 
(CV)  research  are  suitable  for  radar  imagery. 

In  this  paper  we  summarize  some  of  the  ex¬ 
isting  lU  and  CV  approaches  to  understanding 
of  SAE  images  and  discuss  possible  future  re¬ 
search  directions.  Specific  attention  is  given  to 
problems  such  as  segmentation,  terrain  estima¬ 
tion,  change  detection  and  target  detection. 

1  Introduction 

Algorithms  developed  for  application  to  visual  imagery 
are  usually  not  directly  applicable  to  radar  imagery. 
However,  m2my  of  the  approaches  that  have  evolved  in  lU 
and  CV  research  are  suitable  for  radar  imagery.  Trans¬ 
lation  of  paradigms  and  methodologies  into  algorithms 
for  a  specific  application  requires  a  model,  either  explic¬ 
itly  in  a  formal  derivation  or  implicitly  in  the  experi¬ 
ences  of  the  algorithm  designer.  Models  used  in  ID  and 
CV  are  inappropriate  for  SAR  imaging  because  of  the 
fundamental  differences  between  the  physics  of  SAR  im¬ 
age  formation  and  that  of  conventional  images  at  visi¬ 
ble  wavelengths.  The  purpose  of  this  paper  is  to  help 
bridge  the  gap  between  lU  and  SAR  researchers  by  giv¬ 
ing  several  examples  of  the  applicability  of  some  of  the 
existing  lU  methods  for  SAR  imagery.  Earlier  attempts 
at  transferring  lU  methodology  for  the  interpretation  of 
SAR  imagery  are  documented  in  [63]. 

We  begin  with  a  discussion  of  different  types  of  SAR 
data  that  are  avulable.  Two  models  for  explaining  the 
SAR  data  are  then  presented.  The  first  model  uses  the 
physics  of  the  SAR  imaging  and  processing  system  to 
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characterize  the  statistics  of  speckle.  Then  the  observed 
complex,  multilook  intensity,  multifrequency  and  mul- 
tipolarimetric  SAR  data  are  modeled  using  appropriate 
probability  distributions.  The  second  model  relates  the 
surface  elevation,  the  nature  of  the  surface,  the  viewing 
angle  and  the  observed  intensity.  Such  a  computational 
vision  model  is  useful  when  one  is  interested  in  estimat¬ 
ing  terrain  elevation  from  the  given  SAR  intensity  data 
by  using  an  algorithm  modeled  after  existing  shape  from 
shading  algorithms. 

Several  examples  illustrating  the  applicability  of  lU 
p3iradigms  are  then  given.  Specifically,  we  discuss  pre¬ 
processing,  segmentation  of  single-look  single-frequency 
and  multi-frequency  complex  data,  terrain  estimation, 
and  object  recognition.  Several  possible  future  research 
efforts  are  also  outlined. 

2  Background  Information 

2.1  Physics  Based  Models  for  SAR  Complex 
Data 

In  this  section  the  basics  of  the  SAR  system,  configura¬ 
tion,  imaging  process,  and  digital  post-processing  oper¬ 
ations  are  presented.  More  details  are  also  available  in 
numerous  valuable  references  and  textbooks  [17,  70,  73, 
76,  75,  74].  In  addition,  the  types  and  specifics  of  SAR 
data  generated  by  current  SAR  systems  and  processors 
are  presented  to  help  the  reader  follow  the  adaptation 
of  the  segmentation  technique  to  different  types  of  SAR 
data. 

2.2  SAR  imaging  and  processing  systems 

SAR  imaging.  Figure  1  shows  the  typical  imaging  con¬ 
figuration  of  a  strip  map  SAR  instrument  carried  aboard 
an  aircraft  flying  at  an  altitude  H,  along  a  straight  line 
called  the  along  track  or  azimuth  direction,  at  a  constant 
ground  speed  v,  and  imaging  a  surface  strip  5. 

The  cross  track  direction  of  imaging  is  referred  to  as 
the  slant-range  direction.  The  antenna  beam  is  pointed 
to  the  side  of  the  aircraft  making  an  angle  (look  angle) 
with  the  vertical  direction  z.  The  local  incidence  angle  » 
measures  the  angle  between  the  local  slant-range  direc¬ 
tion  r  and  the  local  vertical  direction  z.  The  ground- 
range  axis  corresponds  to  the  projection  of  the  line  of 
sight  of  the  radar  (slant-range)  to  the  ground. 

Imaging  in  the  range  direction  results  from  accurate 
time-delay  measurements  of  echoes  received  from  many 
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Figure  1:  Imaging  configuration  of  a  strip  map  SAR  in¬ 
strument. 

successive  coded  pulses.  The  achievable  slant-range  reso¬ 
lution  Rr  is  determined  by  the  smallest  measurable  time 
difference  between  the  received  echoes  from  two  closely 
spaced  targets  on  the  same  range  line.  If  the  coded  pulse 
is  a  chirp  (as  is  often  the  case),  the  slant  range  resolution 
of  the  instrument  is 

where  Be  is  the  chirp  bandwidth  (Be  =  1/r  where  r  is 
the  pulse  duration),  and  c  is  the  speed  of  light.  The  am¬ 
plitude  and  phase  of  the  returned  electromagnetic  signal 
are  measured  for  each  resolution  element  in  slant  range. 

As  the  individual  targets  “move”  through  the  an¬ 
tenna  footprint,  the  phase  of  the  corresponding  echoes  is 
Doppler  shifted  due  to  the  variation  in  the  traveling  dis¬ 
tance  of  the  signal.  In  a  real  aperture  radar,  the  phase 
information  is  not  utilized  and  azimuth  resolution  is  lim¬ 
ited  by  the  size  of  the  antenna  footprint.  In  a  synthetic 
aperture  radar,  fine  resolution  imaging  in  the  azimuth 
direction  is  achieved  by  recording  the  Doppler  phase  his¬ 
tory  in  azimuth  of  each  target  so  that  the  sampled  echoes 
can  then  be  added  on  a  coherent  basis.  The  process  is 
equivalent  to  the  use  of  an  array  of  antennas  or  to  a 
synthetic  larger  antenna.  Spatial  resolution  in  azimuth, 
defined  by  the  size  of  the  synthetic  antenna  footprint,  is 

«.  =  I  m 

where  /  is  the  physical  length  of  the  antenna  along  the 
azimuth.  Azimuth  resolution  does  not  depend  on  the  al¬ 


titude  of  the  flying  craft.  As  range  resolution  is  also  in¬ 
dependent  of  the  altitude,  SAR  instruments  permit  high 
resolution  imaging  from  space  with  antennas  reasonably 
small  in  size.  In  addition,  in  the  microwave  region,  as 
the  atmosphere  of  the  planet  is  nearly  transparent,  SAR 
imaging  is  possible  in  all  weather  conditions.  Imaging  is 
also  independent  of  sun  illumination  as  the  SAR  provides 
its  own  source  of  illumination  of  the  surface. 

SAR  Processing.  Processing  of  the  SAR  raw  data 
consists  of  range  compression  where  the  returned  echoes 
are  cross-correlated  with  a  replica  of  the  transmitted 
pulses,  and  azimuth  compression  which  is  a  similar  corre¬ 
lation  process,  but  range  dependent,  performed  in  the  az¬ 
imuth  direction.  Correlation  is  traditionally  performed 
in  the  frequency  domain  via  Fourier  transforms.  Al¬ 
ternate  forms  of  SAR  processing  are  also  possible  [17], 
Different  scanning  modes  for  the  SAR  (burst  mode, 
spotlight  mode,  scan  mode)  require  different  processing 
strategies. 

Digital  SAR  processing  can  be  computationally  in¬ 
volved.  The  data  must  be  corrected  from  the  altitude 
and  trajectory  of  the  craft  carrying  the  instrument,  the 
plane  rotation,  the  antenna  pattern  illumination,  possi¬ 
ble  ambiguities  due  to  sampling  of  the  signal,  and  er¬ 
rors  in  focussing  (compensation  of  the  signal  for  the 
Doppler  shift  during  generation  of  the  synthetic  array). 
Numerous  filtering  techniques  are  used  to  control  com¬ 
puter  storage  and  arithmetic  complexity,  and  enhance 
the  image  quality.  In  the  case  of  multifrequency  and  po- 
larimetric  radar  systems,  data  transmission,  reception, 
handling,  processing,  calibration,  and  storage  are  also 
much  more  demanding. 

SAR  post-processing  operations.  A  number  of 
post-processing  operations  are  performed  to  facilitate 
the  manipulation  and  use  of  the  data  for  scientific  pur¬ 
poses.  The  most  important  one,  often  already  included 
in  the  SAR  processor  itself,  is  multilooking.  Multilook¬ 
ing  is  an  artificial  way  of  reducing  image  speckle  in  SAR 
imagery  and  improving  the  radiometric  resolution  of 
the  data  by  incoherently  averaging  multiple  independent 
SAR  data  samples  of  the  same  scene.  Independent  looks 
are  generated  from  a  single  data  take  by  processing  dif¬ 
ferent  parts  of  the  available  SAR  signal  bandwidth  [61]. 
Each  part  of  t.ie  spectrum  produces  an  image,  and  the 
resulting  images  are  incoherently  added  to  produce  one 
multilook  image.  Image  speckle  and  the  data  volume 
are  reduced,  two  important  advantages  for  large  scale 
and  high  data  rate  analysis  of  SAR  data  with  simple  al¬ 
gorithms.  The  drawbacks  are  a  reduction  in  spatial  res¬ 
olution  by  a  factor  proportional  to  the  number  of  looks, 
loss  of  absolute  phase  information,  and  degradation  of 
the  analytical  tractability  of  the  statistics  of  the  data. 

Because  of  its  viewing  geometry,  a  SAR  collects  data 
in  the  slant-range  direction.  To  obtain  information 
in  the  ground-range  direction  (better  suited  for  multi¬ 
sensor  coregistration  and  synergistic  utilization  of  the 
data  [48,  66])  the  image  must  be  corrected  from  a  non¬ 
linear  stretch  in  the  range  dimension  and  a  skew  in  the 
azimuth  direction  [18,  19].  This  transformation  is  ac¬ 
companied  by  a  resampling  of  the  data  which  modifies 
the  spatial  statistics,  an  important  aspect  to  consider  in 
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the  development  of  automated  techniques  of  analysis  of 
SAR  data. 

SAR  data  are  also  radiometrically  rectified  and  cali¬ 
brated.  Rectification  accounts  for  the  antenna  pattern  of 
the  illumination  and  errors  in  antenna  pointing.  Calibra¬ 
tion  assures  the  short  term  relative  (within  image),  long 
term  relative  (from  one  image  to  another),  and  absolute 
(independent  of  the  instrument)  radiometric  fidelities  of 
the  radar  measurements.  In  the  case  of  polarimetric 
data,  leakage  between  the  different  channels  must  also 
be  accounted  for  [28,  79]. 

SAR  noise  sources  and  distortions.  Thermal 
noise  from  the  receiver  electronics  is  not  the  only  source 
of  noise  in  SAR.  Quantization  noise,  bit  error  rate  noise, 
ambiguity  noise  and  side-lobe  noise  are  additional  noise 
sources  in  the  signal.  A  convenient  and  common  way 
of  characterizing  the  total  system  noise  power  level  is  to 
determine  its  noise  equivalent  backscatter  cross-section 
{NE<tq).  Typically,  the  NE(Tq  is  —40  dB  in  airborne 
SAR  and  —20  dB  in  spaceborne  SAR,  whereas  the 
backscatter  cross-section  of  natural  targets  lies  in  the 
range  -f-5  dB  to  —35  dB. 

In  most  conventional  passive  sensors,  the  dominant 
noise  source  is  system  noise.  In  a  coherent  imaging  sen¬ 
sor  such  as  SAR,  the  dominant  source  of  signal  fluctu¬ 
ations  is  image  speckle.  Signal  fluctuations  due  to  sys¬ 
tem  noise  are  much  lower  in  amplitude  except  in  the 
case  where  the  backscatter  cross-section  of  the  target 
reaches  the  NEffo  floor.  Image  speckle  results  from  the 
constructive  and  destructive  interferences  of  multiple  re¬ 
turns  from  individual  scatters  within  the  same  resolution 
cell  [33].  In  digital  data,  image  speckle  acts  as  a  multi¬ 
plicative  noise  on  the  radar  signal.  In  single  look  SAR 
data,  the  mean  square  to  variance  ratio  of  the  intensity 
of  the  radar  signal  is  1,  i.e.  the  inherent  signal-to-noise 
ratio  is  0  dB. 

Yet,  there  is  a  certain  danger  in  referring  to  image 
speckle  as  a  noise  source,  although  this  is  widely  done  in 
the  SAR  literature.  In  terms  of  its  physics,  image  speckle 
is  not  noise  but  information.  It  is  in  fact  the  principal 
source  of  information  used  in  SAR  interferometry  [32], 
i.e.  SAR  interferometry  would  be  impossible  if  image 
speckle  were  systematically  removed  from  the  data.  De¬ 
pending  on  the  8  '>plication,  one  must  always  carefully 
examine  whether  image  speckle  should  be  treated  as  per¬ 
turbation  or  information. 

An  additional  difficulty  in  the  interpretability  of  SAR 
imagery  is  the  geometric  distortions  resulting  from  the 
viewing  geometry  of  radars.  In  the  presence  of  ter¬ 
rain,  effects  such  as  layover,  shortening  and  overshad¬ 
owing  [51]  distort  the  geometry  and  complicate  the  in¬ 
ference  of  terrain  height  from  the  data  [26,  27]. 

2.3  Types  and  specifics  of  SAR  data 

Using  quadrature  Alters  the  voltage  measured  at  the  re¬ 
ceiving  antenna  of  a  SAR  is  decomposed  into  both  a  mag¬ 
nitude  I  a  I  and  a  phase  of  the  returned  electromagnetic 
signal.  The  corresponding  measurement  is  represented 
as  a  complex  number  called  the  complex  amplitude  of 
the  signal 

(3) 
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and  the  square  magnitude  of  this  complex  amplitude  is 
called  the  intensity  (or  power)  of  the  radar  return 

/=|aP  (4) 

In  addition,  a  SAR  instrument  can  operate  (sometimes 
simultaneously)  at  several  frequencies  and  polarization 
configurations  of  the  antenna,  and  provide  multitempo¬ 
ral  imaging  of  the  same  scene.  Multilooking  may  be  per¬ 
formed  on  these  multidimensional  measurements.  These 
different  imaging  modes  generate  different  types  of  data 
described  next. 

Single  frequency,  single  polarization  SAR  data 
can  be  of  two  types:  1)  SAR  complex  data:  2)  multilook 
SAR  intensity  data. 

SAR  single  look  complex  data.  Except  for  the 
case  of  SAR  interferometric  studies,  no  coherent  spa¬ 
tial  averaging  is  performed  on  these  data  as  it  does  not 
improve  radiometric  resolution  and  degrades  spatial  res¬ 
olution.  SAR  complex  data  therefore  correspond  to  the 
simplest  form  of  processed  information  from  the  radar 
and  also  to  the  best  spatial  resolution  available  from 
the  SAR  imaging  and  processing  systems.  On  the  other 
hand,  image  speckle  dominates  the  statistics  of  the  data. 
At  the  pixel  level,  it  is  characterized  by  a  low  correlation 
coefficient  (typically  less  than  two  pixels),  and  a  vari¬ 
ance  to  mean  square  ratio  of  1.  A  data  display  of  the 
magnitude  of  the  amplitude  appears  noisy  with  speckle 
patterns,  and  the  absolute  phase  of  the  signal  is  nearly 
uniformly  distributed  and  difficult  to  analyze. 

Multilook  (N-Iook)  intensity  data.  After  detec¬ 
tion  of  the  complex  signal,  a  single  look  intensity  image 
of  the  measurements  is  obtained.  Single  look  intensity 
images  are  not  adapted  for  visual  interpretation  of  the 
SAR  data  because  of  image  speckle.  Instead,  multilook 
intensity  is  the  most  common  SAR  data  format  used 
by  the  research  community  as  it  is  practical  for  visual 
interpretation  and  analysis  of  large  areas  with  simple  al¬ 
gorithms. 

Polarimetric  SAR  complex  data.  The  polariza¬ 
tion  state  of  an  electromagnetic  wave  describes  the  rela¬ 
tive  motion  of  the  vector  representing  the  electrical  field 
when  the  wave  moves  towards  an  observer.  Conventional 
radars  use  the  same  antenna  polarization  configuration 
for  both  transmission  and  reception  of  the  electromag¬ 
netic  signal  and  operate  at  a  fixed  polarization  state. 
In  recent  years,  fully  polarimetric  radars  have  been  de¬ 
veloped  that  decompose  the  signal  into  two  orthogonal 
signals,  one  horizontal  and  the  other  vertical  (Figure  2), 
that  are  received  and  processed  independently  in  sepa¬ 
rate  channels.  By  combining  several  horizontal/ vertical 
polarization  configurations  at  both  the  transmission  and 
reception  ends  of  the  imaging  system  the  complete  po¬ 
larimetric  complex  scattering  matrix  of  each  resolution 
element  of  the  image  is  measured,  i.e. 

where  Sxy  is  the  complex  amplitude  of  the  signal  in 
the  Y  polarization  when  the  transmitted  electromag¬ 
netic  wave  is  A'  polarized,  and  X  and  Y  £  {H,  V'}.  The 
recorded  complex  scattering  matrix  of  each  element  can 
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Figure  2:  Definition  of  horizontal  (H)  and  vertical  (V) 
in  a  polarimetric  image. 

in  turn  be  used  to  synthesize  any  type  of  polarization 
configuration  L  of  the  radar  antenna  using 

St  =  bfSbt  (6) 

where  T  denotes  transposition,  and  bt^t  and  br,t  are  the 
wave  orientation  matrices  of,  respectively,  the  transmit¬ 
ted  wave  and  the  received  wave  for  configuration  L  [87]. 
To  generate  an  L  =  HH  complex  amplitude, 

=  6r  =  ^  0  0  ) 

For  a  circularly  right  polarized  return. 

Single  look  polarimetric  complex  data.  The  scat¬ 
tering  matrix  of  each  sample  element  of  the  image  is 
stored  in  a  compressed  format  [78].  These  data  are  re¬ 
quired  to  perform  the  accurate  radiometric  multichan¬ 
nel  calibration  of  the  data,  and  correspond  to  the  best 
spatial  resolution  available  from  the  SAR  imaging  and 
processing  systems. 

Multilook  polarimetric  SAR  complex  data. 
Multilook  averaging  of  polarimetric  SAR  data  is  per¬ 
formed  as  follows: 

{SxySxy)  =  X] 

i=l 

i=N 

{SxxS^xy)  =  Yl^XXiS\Yi  (9) 

i=l 

i.e.  it  corresponds  to  the  spatial  averaging  of  the  scat¬ 
tering  matrices.  The  data  are  still  complex  amplitudes 
but  the  absolute  phase  information  is  lost,  and  only  the 


relative  phase  difference  between  the  different  polarimet¬ 
ric  channels  is  preserved.  The  effect  on  the  intensity 
(I  Sxx  P)  is  identical  to  a  multilooking  operation  per¬ 
formed  in  the  time  domain.  The  cross  product  coeffi¬ 
cients  ({SxxSxy))  unchanged,  i.e.  the  mean  phase 
difference  is  preserved  as  well  as  the  correlation  coeffi¬ 
cient  between  the  different  channels.  On  the  other  hand, 
their  variance  is  reduced  by  a  factor  proportional  to  the 
number  of  looks.  Multilook  polarimetric  SAR  complex 
data  are  stored  in  a  standardized  compressed  format 
which  permits  one  to  retrieve  the  multilook  amplitude 
and  the  multilook  phase  difference  of  the  complex  scat¬ 
tering  coefficients  [22]. 

Multifrequency  SAR  data.  The  frequency  of  the 
radar  electromagnetic  wave  is  a  key  factor  in  the  pene¬ 
tration  depth  of  the  signal.  As  the  wavelength  increases, 
the  signal  penetrates  deeper  into  the  ground,  and  im¬ 
ages  different  layers  of  the  surface  and  near  subsurface 
which  may  be  of  different  physical  structure,  dielectric 
constant,  or  surface  roughness.  Frequency  is  also  a  key 
factor  in  scattering  from  a  rough  surface.  In  the  Bragg 
model,  the  backseat  ter  cross-section  of  a  rough  surface 
is  proportional  to  the  roughness  spectrum  of  the  surface 
and  the  fourth  power  of  frequency.  Numerous  examples 
have  already  shown  [76]  that  multifrequency  informa¬ 
tion  helps  separate  different  types  of  natural  surfaces 
and  characterize  their  physical  and  electrical  properties 
better  than  single  frequency  data. 

The  electromagnetic  radar  signal  is  also  characterized 
by  its  frequency  which,  by  convention,  is  designated  by  a 
letter  of  the  alphabet.  For  instance,  P-  band  ranges  from 
225  to  390  GHz,  L-  band  from  390  to  1550  GHz,  C-  band 
from  3900  to  6200  GHz,  and  X-  band  from  6200  to  10900 
GHz.  Several  airborne  SAR  instruments  are  currently 
capable  of  near  simultaneous  acquisition  of  SAR  data  at 
several  frequencies.  The  SIR-C  experiment  will  be  the 
first  of  its  kind  to  operate  a  multifrequency  SAR  from 
space,  and  the  upcoming  EOS  SAR  platform  is  planned 
to  operate  at  L-,  C-  and  X-bands. 

2.4  Model  of  the  speckled  complex  amplitudes 

Joint  distribution  of  complex  amplitudes.  Let  us 
assume  that  speckle  is  fully  developed  as  in  [33]  ,  i.e.: 

•  The  phase  <f>  and  the  magnitude  |  a  |  of  the  signal 
are  independent  random  variables. 

•  The  contribution  of  each  element  to  the  total  re¬ 
turn  is  independent  of  the  other  scattering  elements 
contained  in  the  resolution  cell. 

•  The  phase  <f>  of  the  signal  is  uniformly  distributed. 

•  The  number  of  scattering  elements  per  resolution 
cell  is  much  larger  than  one. 

Under  these  assumptions,  the  complex  amplitude  of  the 
signal  has  circular  complex  Gaussian  statistics 

p(a)=  ^exp{--i^-log< />}  (10) 

where  <  /  >  is  the  local  mean  intensity  of  the  signal. 
The  complex  amplitudes  are  zero  mean. 

To  include  contextual  information  from  neighbor¬ 
ing  pixels  and  improve  the  segmentation  process,  the 
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joint  distribution  of  M  complex  amplitudes  ag  = 
[oi, . . . ,  aM]T  in  a  neighborhood  N,  of  pixel  site  a  is  ex¬ 
amined  instead  of  the  distribution  of  a  single  complex 
amplitude  at  pixel  site  a.  The  segmentation  technique 
can  then  exploit  the  additional  information  provided  by 
the  neighbors  of  a  to  improve  the  segmentation. 

The  complex  image  array  Q  is  viewed  as  a  set  of  P  x 
(where  P  and  Q  are  the  numbers  of  pixels  in  the  verti¬ 
cal  and  horizontal  directions)  conditionally  independent 
overlapping  windows  of  M  elements  each,  centered  at 
each  pixel  site  S,  such  that  each  pixel  within  that  win¬ 
dow  has  the  same  local  mean  intensity  level  <  I  >.  The 
joint  distribution  of  the  M  complex  amplitudes  is 

^  exp{-as*’"Rag  ag  -  In  |  Rag  (}  (11) 

Rag  is  the  M  x  M  complex  matrix  of  the  M  zero-mean 
complex  amplitudes  a«  at  site  a  (i.e.,  Ras  =<  aga*^  >,, 
where  <>  denotes  ensemble  averaging),  and  |  Rag  |  de¬ 
notes  the  determinant. 

This  model  is  in  principle  not  applicable  when  speckle 
is  no  longer  fully  developed.  Such  is  the  case  in  areas 
where  a  single  scatter  dominates,  the  number  of  scatters 
per  resolution  is  small,  or  the  scale  of  the  surface  rough¬ 
ness  is  no  longer  much  smaller  than  the  wavelength  of 
the  electromagnetic  signal. 

This  model  also  assumes  that  the  backscatter  cross- 
section  of  the  imaged  surface  is  uniform.  When  this  is 
not  the  case,  (11)  must  be  convolved  with  the  distribu¬ 
tion  of  the  backscatter  cross-section  of  the  surface.  Be¬ 
cause  of  the  presence  of  image  speckle,  this  distribution  is 
not  directly  observable  and  is  difficult  to  infer.  A  gamma 
distribution  of  the  backscatter  cross-section  results  in  a 
K-distribution  of  the  speckled  complex  amplitudes  and 
has  been  shown  in  several  applications  to  better  describe 
the  statistics  of  radar  measurements  [40].  The  variabil¬ 
ity  of  the  backscatter  cross-section  of  the  surface,  called 
texture,  is  quantified  using  a  single  parameter  a  which 
measures  the  increase  in  the  variance  of  the  signal  com¬ 
pared  to  that  due  to  image  speckle  alone.  On  the  other 
hand,  texture  has  no  effect  on  the  correlation  properties 
of  the  complex  data. 

Correlation  matrix  of  the  complex  amplitudes. 
Let  us  consider  the  SAR  imaging  and  processing  system 
as  a  coherent  system  which  transforms  a  complex  ampli¬ 
tude  input  into  a  complex  amplitude  output. 

Earlier  studies  [2,  55,  85]  showed  that  the  coherent 
system  can  be  modeled  as  a  linear  shift  invariant  system 
characterized  by  its  two-dimensional  impulse  response 
function  h  provided  that  the  across-track  dimension  is 
slant-range  (as  opposed  to  the  projected  ground-range 
direction  for  which  h  would  be  shift  variant)  and  the 
along-track  direction  is  azimuth  (Figure  1).  As  a  conse¬ 
quence,  for  homogeneous  regions, 

Ras  =<  f  >»  Rh,  (12) 

where  Rh  is  the  correlation  matrix  of  the  system  coher¬ 
ent  impulse  response  h,  and  <  /  >,  the  ensemble  av¬ 
erage  of  the  intensity  of  the  complex  data  at  pixel  site 
a.  Hence,  the  modeling  of  Ras  is  directly  related  to  the 
SAR  system  coherent  impulse  response  function. 


Ideal  response  from  a  point  target.  We  assume 
for  the  moment  that  SAR  processing  is  comprised  of 
two  independent  operations;  range  compression  and  az¬ 
imuth  compression.  In  digital  imagery,  as  pixel  elements 
are  manipulated  instead  of  resolution  elements,  the  com¬ 
ponents  of  the  correlation  matrix  of  the  ideal  coherent 
system  impulse  response  are  rewritten  in  terms  of  pixel 
spacing  as 

s  s 

Rh(m,  n)  =  sinc(m-^)  sinc(n-^),  (13) 

IXrf 

where,  respectively,  m  and  n  are  slant-range  and  azimuth 
pixel  indexes,  Sr  and  Hr  are  the  sample  spacing  and  spa¬ 
tial  resolution  in  slant-range,  and  s^  and  Ri  are  the  same 
quantities  in  azimuth.  Note  that  spatial  resolution  here 
is  not  equivalent  to  the  conventional  3  dB  bandwidth  of 
the  impulse  response  rather, 

rr  =  0.8845Rr  (14) 

The  correlation  matrix  Rh  given  in  (13)  has  several  re¬ 
markable  properties  that  have  been  pointed  out  indepen¬ 
dently  by  various  authors  [55,  59,  64]. 

•  Rh  is  also  equal  to  the  correlation  function  of  specki»- 
as  the  bandwidth  of  speckle  is  determined  by  the 
width  of  the  synthetic  aperture. 

•  Rh  is  not  affected  by  the  presence  of  texture,  i.e. 
the  spatial  coherence  of  the  imaged  surface. 

•  Rh  is  insensitive  to  focussing  errors. 

In  general,  the  system  response  of  a  sensing  instru¬ 
ment  is  difficult  to  estimate  or  characterize.  SAR  com¬ 
plex  data  provide  a  unique  opportunity  where  the  sys¬ 
tem  coherent  impulse  response  function  can  actually  be 
accurately  characterized,  measured  and  modeled. 

2.5  Model  of  the  multifrequency  speckled 
complex  amplitudes 

Speckle  is  assumed  to  be  fully  developed  as  in  Sec¬ 
tion  2.4.  The  complex  amplitude  of  the  signal  is  cir¬ 
cular  Gaussian  and  the  joint  probability  distribution 
function  of  a  set  of  A/  neighboring  complex  amplitudes 
aj  =  [ai,i, . . .  ,ai,A/]  at  a  frequency  fi  is 

p(®i)  =  I  1) 

where  Rj  is  the  correlation  matrix  of  the  complex  am¬ 
plitudes  at  frequency  /i . 

If  the  same  scene  is  imaged  at  a  different  frequency 
/2,  the  resulting  complex  amplitudes  are  mathemati¬ 
cally  independent.  The  reason  is  that  the  radar  signal 
is  sensitive  to  scatterers  of  a  different  nature  (different 
roughness  scale,  dielectric  constant,  etc.)  and  that  con¬ 
sequently  image  speckle  is  uncorrelated.  The  complex 
amplitudes  are  therefore  uncorrelated,  and  since  they  are 
circular  Gaussian  they  are  also  independent.  The  joint 
distribution  of  M  complex  amplitudes  at  two  frequencies 
is  therefore  the  product  of  the  joint  distributions  at  each 
frequency, 

p(ai,a2)  =  exp{-ai*^Rj^ai  -  In  |  Rj  | 

-a2*^R2^a2  -  In  |  R2  |}  (16) 
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The  correlation  matrices  Rj  j  are  related  to  the  corre¬ 
lation  matrices  of  the  system  coherent  impulse  responses 
R\  as  in  (17), 

Ri  l  =<  li,,  >  n  (17) 

where  </<,/>  is  the  mean  intensity  of  the  signal  at  fre¬ 
quency  fi  in  region  1.  As  spatial  resolution  is  frequency 
dependent,  iij,  depends  on  frequency.  Most  current  mul¬ 
tifrequency  SAR  imaging  and  processing  systems  are  de¬ 
signed  so  that  spatial  resolution  is  nearly  constant  across 
frequency  to  facilitate  the  coregistration  and  multidi¬ 
mensional  analysis  of  the  multifrequency  measurements. 

has  the  same  properties  as  those  encountered  in  Sec¬ 
tion  2.4  and  is  estimated  using  the  same  technique. 

2.6  Model  of  multilook  SAR  intensity  data 

The  marginal  distribution  of  multilook  SAR  intensity 
data  is  not  accurately  known.  When  the  samples  used 
to  generate  the  multilook  data  are  independent  (which 
may  not  be  the  case),  and  the  backscatter  coefficient  of 
the  surface  is  locally  uniform,  the  multilook  intensities 
are  gamma  distributed,  i.e. 

where  <  7  >  is  the  local  mean  intensity  of  the  signal 
and  N  is  the  equivalent  number  of  looks,  here  equal  to 
the  number  of  independent  single  look  samples.  If  the 
backscatter  coefficient  of  the  surface  is  not  uniform  but 
gamma  distributed  with  parameter  a,  the  intensity  is 
K-distributed  [40]. 

P(Ii)  =  ^|^(a//(7))=^A'„_i(2v/^77(7))  (19) 

where  Ka~i  is  the  modified  Bessel  function  of  the  third 
kind  of  order  o  —  1. 

If  the  samples  are  correlated,  both  (18)  and  (19)  are 
not  mathematically  correct.  The  exact  form  of  the  distri¬ 
bution  of  the  multilook  SAR  intensities,  discussed  in  [33], 
is  complicated.  A  computationally  convenient  alterna¬ 
tive  is  to  adapt  the  equivalent  number  of  looks  N  in 
(18)  or  the  parameter  a  in  (19)  to  correctly  model  the 
larger  than  expected  variance  of  the  signal.  Although 
K-distributions  seem  to  better  describe  the  statistics  of 
SAR  data  in  several  applications,  the  rather  empirical 
gamma  distributions  have  been  shown  to  provide  a  rea¬ 
sonable  model  of  the  statistics  of  SAR  data. 

The  correlation  properties  of  SAR  intensity  data  pro¬ 
duced  by  a  partially  coherent  system  which  transforms  a 
complex  amplitude  input  into  an  intensity  output  have 
been  studied  by  several  authors  [55,  59,  64].  As  fo¬ 
cussing,  texture,  and  multilooking  have  a  significant  im¬ 
pact  on  the  shape  of  the  system  incoherent  impulse  re¬ 
sponse,  the  results  are  different  from  those  obtained  in 
the  case  of  a  completely  coherent  system. 

If  the  scene  has  stationary  statistics  and  the  system  is 
perfectly  focussed,  the  correlation  function  of  single  look 
intensity  data  (Rj(b/r)  =<  I(U)I(u  -f-  r)  >)  is  given  by 
the  Siegert  relation 

Rl(r)  =<  7  >*  -b  <  7  >^|  Rnir) 


where  r  =  (m,n)  is  the  displacement  vector.  If  texture 
is  present,  the  correlation  function  becomes  [59] 

Rl(r)=<7>2  +  <7>2|7?ft(r)p 

+  / ^[1  M«)  l"l  Mv)  I" 

-b  /,(u)/i-(v)/»*(b-r)]K<i> 

(u  —  V -b  r)rfurfv  (20) 

where  K^j^,  the  covariance  function  of  the  local 
mean  intensity,  measures  the  spatial  variability  of  the 
backscatter  cross-section  of  the  surface.  The  additional 
component  is  composed  of  an  incoherent  term  that  corre¬ 
sponds  to  the  incoherent  imaging  of  the  scene  backscat¬ 
ter  fluctuations  and  of  a  coherent  term  similar  to  the  cor¬ 
relation  function  of  the  coherent  impulse  response.  The 
system  incoherent  impulse  response  function  is  affected 
by  focusing  [64].  If  the  SAR  processor  is  out  of  focus, 
the  system  incoherent  impulse  is  broadened,  resulting  in 
a  blurred  intensity  image,  and  (20)  is  no  longer  correct. 

Assuming  that  the  covariance  matrix  of  the  intensity 
data  Ri_/i)  is  known,  the  joint  distribution  of  M  neigh¬ 
boring  single  look  intensities  is 

p(7i , . . . ,  Im/L,  =  /)  =  |RJ_1  I  exp{-  ^  S.1  U 

'  '  I 

>>» 

where  7o  is  the  modified  zero  order  Bessel  function  of  the 
first  kind,  and  S  =  («•;)(<, ;)€n  =  ^7-{i) 
case  of  M  =  2). 

When  the  number  of  looks  is  larger  than  one,  the  cor¬ 
relation  function  of  the  intensity  is  modulated  by  the 
multilooking  filters  used  to  select  the  independent  sam¬ 
ples.  When  multilooking  is  performed  in  the  time  do¬ 
main  using  neighboring  samples  in  azimuth,  the  system 
incoherent  response  is  nearly  a  sine  function  in  azimuth. 
In  the  general  case,  an  exact  analytical  expression  of 
the  correlation  function  of  multilook  is  complicated  [55]. 
There  is  little  need  for  exact  modeling  of  the  system 
incoherent  impulse  response  function  in  multilook  SAR 
intensity  data  as  the  spatial  correlation  of  the  intensity 
that  is  due  to  speckle  becomes  rapidly  negligible  as  the 
number  of  looks  increases.  Spatial  correlation  of  SAR 
intensity  data  is  therefore  mostly  due  to  the  spatial  cor¬ 
relation  of  the  backscatter  cross-section  of  the  imaged 
surface.  Even  in  the  presence  of  highly  non-homogeneous 
features  the  uncorrelatedness  assumption  is  adequate. 

The  adoption  of  a  simple  model  for  the  correlation 
function  of  multilook  SAR  intensities  is  also  Justified  by 
several  other  considerations: 

•  In  the  context  of  coregistration  and  synergistic  anal¬ 
ysis  of  multisensor  data,  remapping  of  the  multilook 
SAR  intensity  data  onto  a  common  ground  range 
projection  is  a  requirement  [48,  66],  and  the  pro¬ 
cess  will  be  implemented  on  a  number  of  future 
operational  SAR  processors.  As  this  operation  is 
highly  non-linear  and  results  in  a  shift-variant  im¬ 
pulse  response  function,  the  analytical  modeling  of 
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the  correlation  function  of  multilook  intensities  be¬ 
comes  difficult  and  may  eventually  lead  to  overly 
complicated  segmentation  algorithms. 

•  Even  in  the  case  of  SAR  single  look  complex  data, 
where  neighboring  complex  samples  are  strongly 
correlated,  it  has  been  shown  that  the  uncorrelated 
model  may  yield  the  same  level  of  segmentation  ac¬ 
curacy  as  the  correlated  model  [65]. 

If  the  SAR  complex  data  are  uncorrelated,  single  look 
SAR  intensity  data  are  independent  and  so  are  the  mul¬ 
tilook  SAR  intensity  data.  If  the  SAR  complex  data  are 
correlated,  in  the  limit  of  an  infinite  number  of  of  looks, 
the  SAR  intensity  data  are  uncorrelated  and  Gaussian 
distributed  (central  limit  theorem),  and  therefore  inde¬ 
pendent.  In  the  case  of  a  finite  number  of  looks,  although 
it  is  not  mathematically  correct  for  gamma  distributions, 
one  can  assume  that  the  uncorrelated  multilook  intensity 
data  are  independent  as  well.  The  joint  distribution  of 
the  multilook  intensities  is  approximated  by  the  prod¬ 
uct  of  the  marginal  distributions.  In  the  case  of  Gamma 
marginal  distributions,  the  joint  distribution  of  M  inten¬ 
sities  within  a  neighborhood  of  pixel  site  s  is  of  the 
form 


p(/i,  =  expi-MUiih,. . . ,  Im)}  (22) 

where  the  energy  function  Ui  is 

-b  N  log  <  It  >  -^  log  T{N) 

-  TV  log  V  (23) 

Each  region  is  characterized  by  a  mean  intensity  value 
(/i).  The  entire  intensity  data  array  is  characterized  by 
its  equivalent  number  of  looks  N. 

2.7  Model  of  polarimetric  SAR  complex  data 

To  be  able  to  completely  characterize  the  polarization 
properties  of  the  backscattered  signals  it  is  necessary 
to  decompose  the  wave  into  two  orthogonal  polarized 
components  and  measure  the  returned  like  and  cross- 
polarized  signals.  In  practice,  radars  have  been  designed 
to  be  able  to  acquire  almost  simultaneously  the  HH,  HV, 
VV  and  VH  returns  of  the  transmitted  wave,  where  the 
first  letter  represents  the  polarization  of  the  signal  at 
the  transmission  and  the  second  letter  at  the  reception. 
With  the  help  of  these  four  quantities  any  type  of  polar¬ 
ization  configuration  can  be  simulated  (elliptic,  circular, 
. . . )  [87,  80]  and  the  polarimetric  properties  of  the  tar¬ 
gets  are  completely  characterized. 

Xf  denotes  the  polarimetric  measurement  vector  at 
site  s,  i.e.  the  vector  of  the  complex  measurements  (i.e. 
amplitude  and  phase  of  the  electromagnetic  response)  at 
site  8, 

X,  =  [HH,HV,VV]„  (24) 

The  cross-polarized  return  HV  is  the  complex  amplitude 
of  the  polarized  V  response  given  that  the  transmitted 
signal  is  //-polarized.  The  HH  and  VV  amplitudes  are 
the  co-polarized  terms.  The  VH  responses  is  not  present 


in  (24)  as  it  is  symmetrized  with  the  HV  response  dur¬ 
ing  compression  and  calibration  of  the  data  based  on 
the  reciprocity  principle.  Since  the  polarimetric  com¬ 
plex  amplitudes  are  circular  Gaussian,  the  conditional 
distribution  of  X,  is  [43] 

P{X.)  =  ^^s^expi-ArC-iAj-}.  (25) 

The  3x3  complex  matrix  C  =<  X*^X  >  is  the  polari¬ 
metric  covariance  matrix  of  the  data. 

Evaluation  of  (25)  at  each  pixel  location  involves  pro¬ 
hibitive  calculations  of  the  Hermitian  form  (X’C~^Xj). 
In  the  case  of  azimuthally  symmetric  targets  (valid  for 
a  large  variety  of  natural  targets)  the  HV  amplitude  is 
uncorrelated  with  the  HH  and  the  VV  complex  ampli¬ 
tudes  and  the  covariance  matrix  is  simply 

/  1  0  P\/7 

C  =  <t[  0  e  0 

V  p•^/7  0  7 

where 


(26) 


<T  = 


€  = 


7  = 


P  = 


<1  HH  I^); 

mu!). 

(I  hh  \y 

MYll. 

(I  hh  fy 

{HHVV) 

y/{\  HH  f){\  KK  1^)’ 


(27) 


C  can  be  inverted  analytically,  leading  to 

-  +  11'’! 

coa{<l>HHVV  -  <l>p)  -  log[(7^c7(l  -  I  p  1^)]}  (28) 


where  4>hhvv  =  ^hh  —  <t>vv  and  <t>vv  and  <j)f, 

are  the  phase  of  HH,  VV,  and  p,  respectively.  The 
evaluation  of  (28)  now  requires  a  much  smaller  number 
of  operations. 

A  certain  number  of  classification  techniques  have  al¬ 
ready  been  published.  Some  are  based  on  the  selection 
of  simple  features  to  perform  classification  of  the  pixels 
(to  reduce  the  dimensionality  of  the  data),  some  oth¬ 
ers  use  the  fully  polarimetric  information.  A  review  of 
these  techniques  and  a  comparison  of  the  results  is  given 
in  [53].  [43]  derived  an  optimal  Bayesian  classifier  us¬ 
ing  the  fully  polarimetric  information.  Refinements  of 
the  method  have  appeared  in  [53,  86].  The  work  in  [80] 
significantly  improved  the  earlier  classification  results  in 
[43]  by  adaptively  and  iteratively  estimating  the  priori 
probabilities  needed  for  the  implementation  of  discrimi¬ 
nant  function.  Improvements  were  also  obtained  [86]  by 
normalizing  the  vector  X  by  the  total  received  power, 
i.e., 

Xt,  =  X/[|////|2  -t-  \HV\^  4- \VV\^] 
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(a)  3-D  View 


Figure  3:  SAR  Imaging  Geometry. 


2.8  Computational  vision  model 

SAR  Image  Coordinate  System 

An  image  can  be  thought  of  as  the  projection  of  a  3-D 
scene  into  a  2-D  representation.  For  an  image  created  us¬ 
ing  conventional  optics  the  spatial  transformation  from 
3-D  to  2-D  is  given  by  the  perspective  projection.  The 
coordinate  system  for  a  SAR  strip  map  is  very  different, 
as  depicted  in  Figure  3.  For  simplicity,  consider  the  case 
of  a  straight  flight  path  with  the  antenna  main  beam  or¬ 
thogonal  to  the  flight  path  (zero  squint  angle).  Assume 
that  the  x-axis  is  lined  up  with  the  r-axis.  Then  x  may 
be  referred  to  as  ground  range  and  is  related  to  slant 
range  r  by 

r^  =  x^  +  {h-  zf  (29) 

where  h  is  the  radar  altitude  and  the  origin  is  located  on 
the  ground  directly  below  the  radar.  The  second  SAR 
strip  map  axis,  by  design,  represents  the  along  track  dis¬ 
tance  y.  Thus,  SAR  image  coordinates  are  a  projec¬ 
tion  of  the  physical  scene,  {x,y,  z(x,y)),  into  the  “slant 
plane”,  (r, y).  One  result  is  foreshortening  of  the  moun¬ 
tainsides  that  slope  upward  as  r  increases. 

Clearly  the  transformation  from  Cartesian  coordi¬ 
nates,  (x,  y,  z),  to  SAR  image  coordinates  (r,  y)  is  not 
an  orthographic  projection.  However,  it  is  possible  to 
approximate  the  SAR  coordinates  as  an  orthographic 
projection  of  a  rotated  version  of  the  surface.  Suppose 
that  r  is  large  relative  to  the  image  size  so  that  arcs  of 
constant  range  are  approximately  straight  lines  over  the 
depression  angle  subtended  by  the  image.  Then  we  ob¬ 
tain  u(r,  y),  the  surface  height  relative  to  the  slant  plane 
defined  by  (r,y),  as 


/  r  \  /  cosO  0 

y  0  1 

\  u  J  \  ain0  0 


where  (ro,y(,,«o)  is  an  arbitrary  reference  point.  Here 
the  transformation  from  slant  plane  surface  coordinates 
to  SAR  image  coordinates  is  indeed  an  orthographic  pro¬ 
jection.  In  the  absence  of  layover,  z{x,y)  being  single 
valued  implies  that  «(r,  y)  is  single  valued.  To  evaluate 
the  reflectance  map  for  surface  reconstruction  or  image 
synthesis  it  is  necessary  to  evaluate  the  angle  of  incidence 
or,-  or  its  cosine.  When  expressed  in  (r,  y,  u)  coordinates 


where 


COSO,  = 


y/u^  +  ul  +  l 


du  dui  „ 

87=  8r+“*"^ 

_  du  _  du\ 


(31) 


(32) 

(33) 


«i(»*,y)  =  «(»*,y)-(r-ro)tanfl  (34) 

are  the  partial  derivatives  of  u.  For  computer  implemen¬ 
tation,  u(r,  y)  is  represented  in  a  2-D  array  with  constant 
sample  spacing  in  r  and  y  as  in  a  SAR  strip  map,  and 
“i(*’.  y)  is  the  deramped  version  of  «(r,  y).  Standard  fi¬ 
nite  difference  approximations  are  used  for  the  partial 
derivatives. 

This  slant  representation  allows  relatively  simple  and 
efficient  processing  techniques  to  be  applied  to  SAR  im¬ 
age  synthesis  from  terrain  evaluation  data.  Efficient  sur¬ 
face  reconstruction  and  image  synthesis  using  the  slant 
plane  representation  requires  a  method  for  resampling  a 
digital  terrain  map  (DTM)  from  ground  coordinates  to 
slant  plane  coordinates  and  vice  versa.  A  method  which 
is  a  variation  of  the  summed  area  table  method  reported 
in  the  computer  graphics  literature  [36]  and  is  amenable 
to  high  speed  implementation  can  be  used. 


The  Reflectance  Map 

The  reflectance  map  R(.)  is  the  function  which  relates 
the  local  surface  orientation  to  image  intensity.  In  both 
radiometry  and  photometry,  R{.)  is  a  combination  of  the 
surface  microstructure,  the  geometric  properties  of  the 
illumination  and  the  geometric  properties  of  image  for¬ 
mation.  The  reflectance  model  should  be  parameterized 
to  retain  the  distinction  between  factors  unique  to  SAR 
image  formation  and  factors  due  to  electromagnetic  scat¬ 
tering  properties  of  the  surface  while  explicitly  including 
the  dependence  on  the  surface  slopes  in  both  the  range 
and  azimuth  directions.  For  SAR,  the  reflectance  map 
R(.)  can  be  expressed  as  the  product  of  an  area  factor  .4 
multiplied  by  the  radar  cross-section  (RCS)  ffg  per  unit 
area.  Both  A  and  (Tg  are  functions  of  imaging  geometry 
and  surface  orientation,  i.e. 

RiUr,Uy,0)  =  AiUr,Uy,f))(ToiUr,Uy,(3).  (3-5) 

This  resulting  reflectance  map  is  normally  much  more 
directional  than  if  the  same  surface  were  viewed  by  a 
passive  imaging  system  at  visible  wavelengths. 
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Two  different  parameterizations  for  area  A  and  RCS 
Oo  used  in  the  radsir  literature.  One  parameteri¬ 
zation  [31]  uses  the  area  of  a  pixel  projected  onto  the 
surface  times  the  normalized  radar  cross  section.  In  this 
approach,  the  area  factor  is  approximated  as 

A  «  +  (36) 

The  second  parameterization  [15]  uses  the  area  pro¬ 
jected  onto  a  plane  normal  to  the  illumination.  The  illu¬ 
mination  area  Ai  denotes  the  area  of  the  incident  plane 
wave  that  impinges  on  the  projected  pixel.  For  the  rect¬ 
angular  pixel  impulse  response  this  gives 


At  =  ArAy  |  u,  |  (37) 


providing  a  very  simple  expression  in  terms  of  the  differ¬ 
ential  geometry  of  the  surface. 

Many  models  for  the  variation  of  <To  as  a  function  of 
incidence  angle  are  presented  in  the  literature  [4,  6,  54] 
based  on  both  empirical  studies  and  modeling  the  vari¬ 
ous  possible  scattering  mechanisms  that  may  occur.  For 
better  or  worse,  radar  clutter  is  often  more  directional 
than  implied  by  earlier  work.  The  generalized  Lambert 
model  <To(a,)  =  cos*(a',),  developed  as  a  remedy  for  this, 
has  been  further  generalized  based  on  empirical  data. 
This  model  [42]  expresses  average  radar  cross  section  as 
a  function  of  angle  of  incidence  by 


ffoiai) 


cos*  Oj 
sin'  Ui 


(38) 


It  has  long  been  suggested  that  (38)  with  k  in  the  range 
of  3  to  4  may  provide  a  reasonable  model  for  relatively 
smooth  surfaces  such  as  asphalt  [54],  and  that  /  =  0,  with 
ib  €  [1,2]  is  appropriate  for  very  rough  surfaces.  The 
other  model  [5]  is  derived  by  a  physical  optics  approach 
but  evaluated  in  the  zero  wavelength  limit.  These  mod¬ 
els  define  the  reflectance  function  used  [4]  for  extracting 
depth  information  from  one  or  more  SAR  images.  A 
typical  SAR  and  the  Lambertian  reflectance  maps  are 
plotted  in  Figure  4. 

Albedo  of  R.adar  Returns.  Cosgriff  [15]  applied 
the  term  albedo  to  radiometry  by  analogy  with  photom¬ 
etry  to  mean  that  portion  of  the  incident  energy  which 
is  reradiated  by  the  surface.  For  convenience,  one  can 
lump  together  all  multiplicative  constants  that  appear 
in  front  of  the  reflectance  map  and  refer  to  that  product 
as  albedo.  Albedo  depends  on  the  electrical  properties 
of  the  surface  materials  and  any  ground  cover.  Variation 
in  the  median  of  measured  radar  albedo  for  typical  clut¬ 
ter  spans  a  wide  dynamic  range  on  the  order  of  60  dB 
where  the  extreme  lows  are  for  water  and  the  extreme 
highs  are  for  cities.  The  issue  of  albedo  variation  is  so 
complex  that  only  very  simple  models  can  be  used. 

Observation  Noise.  The  most  significant  source  of 
noise  for  SAR  imagery  is  speckle.  Thermal  noise  from 
the  radar  electronics  introduces  a  noise  source  with  com¬ 
plex  amplitude  modeled  as  white  complex  Gaussian  noise 
added  to  the  image.  A  third  source  of  noise  arises  from 
the  voltage  accumulated  in  the  pixel  sidelobes.  The  com¬ 
bined  effect  is  that  the  noisy  intensity  can  be  expressed 


Figure  4:  A  typical  SAR  image  reflectance  map  and 
Lambertian  reflectance  map. 


as  a  noise-free  image,  combined  with  a  bias  term  (depen¬ 
dent  on  thermal  and  sidelobe  noise  levels)  multiplied  by 
a  noise  term 

+  n(r,y)  (39) 

where  is  the  combined  power  or  the  thermal  and 
side-lobe  noise,  and  n(r,  y)  is  unit  power  random  field 
distributed  as  chi-squared  with  two  degrees  of  freedom. 
A  simplified  model  ignoring  the  effects  of  noise  for  esti¬ 
mation  of  surface  elevation  is 

l{r,y)  =  R{ur,Uy,P)  (40) 

3  Survey  of  Ongoing  Work 

3.1  Preprocessing 

Kuan  et  al.  [44,  45,  46]  have  developed  smoothing  algo¬ 
rithms  for  reducing  speckle  and  signal  dependent  noise 
in  gray  scale  images.  Their  image  model  is  based  on  the 
physical  process  of  coherent  imaging  and  accounts  for 
spatial  correlation  due  to  speckle,  an  important  charac¬ 
teristic  of  speckle  that  has  long  been  ignored  in  previous 
studies.  Yet,  their  method  ignores  some  of  the  compli¬ 
cations  existing  in  real  SAR  imaging  systems  and  was 
only  tested  on  real  gray  scale  optical  images  corrupted 
by  computer  generated  speckle  noise. 

Detecting  edges  in  SAR  images  has  also  received  some 
attention  in  the  literature.  In  [72],  a  ratio  detector  is 
proposed.  A  modification  to  the  Marr-Hildreth  edge  de¬ 
tector  using  a  ratio  of  averages  is  discussed  in  [11]. 

3.2  Segmentation  of  single  look  complex  SAR 
data 

Owing  to  the  modeling  of  SAR  complex  data,  the  seg¬ 
mentation  problem  can  be  formulated  as  an  estimation 
problem  where  the  region  labels  of  individual  pixels  are 
inferred  from  a  knowledge  of  the  radar  observations  and 
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a  priori  knowledge  of  the  distribution  of  the  region  la¬ 
bels.  Region  labels  are  unobserved  pixel  attributes  that 
identify  the  region  to  which  each  pixel  belongs.  In  [65], 
a  Markov  random  field  [8,  9,  13,  29,  84]  has  been  used 
to  model  the  region  labels.  Modeling  of  the  observa¬ 
tions  is  based  on  an  understanding  of  the  physics  of 
the  SAR  imaging  and  processing  system,  and  empha¬ 
sizes  the  use  of  correlative  information  from  neighboring 
pixel  elements.  Using  Bayes’  theorem  the  prior  distri¬ 
bution  of  the  radar  measurements  is  combined  with  the 
prior  distribution  of  the  region  labels  to  obtain  the  pos¬ 
terior  distribution  of  the  region  given  the  radar  measure¬ 
ments.  Several  possible  optimality  criteria  [8,  14,  30,  57] 
that  define  an  optimal  region  labelling  process  are  then 
considered  and  implemented  on  an  optimization  net¬ 
work  [37,  56].  Some  examples  of  segmentation  results 
are  given  in  Figure  5.  Figure  5(a)  is  the  original  single 
look  intensity  image  of  sea  ice  (although  complex  data 
was  used  for  processing,  intensity  images  are  displayed) 
acquired  in  the  Beaufort  Sea  during  the  Alaska  cam¬ 
paign  of  March  1988  by  the  multifrequency  polarimetric 
NASA-JPL  airborne  SAR  instrument. 

Figure  5(a)  shows  the  logarithm  of  the  magnitude  of 
the  complex-amplitude  array  at  L-band  frequency  (24- 
cm  wavelength)  with  a  horizontal  linear  polarization  of 
the  transmitted  and  received  radar  signal.  A  direct  dis¬ 
play  of  the  intensity  array  could  not  be  used  because  of 
the  large  dynamic  range  of  the  pixel  intensities  and  the 
strong  presence  of  speckle.  The  image  is  1024  x  512  pix¬ 
els  in  size  and  has  3.025-m  pixel  spacing  along  azimuth 
[left  to  right  in  Figure  5(a)]  and  6.662-m  pixel  spacing 
along  slant-range  [top  to  bottom  in  Figure  5(a)].  Spa¬ 
tial  resolution,  estimated  from  the  autocorrelation  coef¬ 
ficient  of  a  100  X  100  pixel  area,  is  14.28m  along  slant 
range  and  4.95m  along  azimuth.  The  mean  intensity 
level  of  four  regions  was  selected  by  using  an  unsuper¬ 
vised  technique  [65].  Visual  inspection  of  the  SAR  re¬ 
vealed  that  these  four  distinctive  regions  could  be  sorted 
into  four  different  sea-ice  types:  (1)  sea-ice  ridges,  visi¬ 
ble  as  bright  and  fine  lineaments  in  Fig.  5(a);  (2)  mul¬ 
tilayer  sea  ice,  a  thick  ice  that  survived  several  summer 
melts  and  has  a  relatively  bright  signature;  (3)  first-year 
sea  ice,  a  newly  formed  ice  of  varying  thicknesses  and 
intermediate  brightness;  (4)  open  water  and  frazil  ice, 
which  appear  dark  in  the  imagery  within  narrow  open 
leads.  A  3x3  neighborhood  window  was  used  during  seg¬ 
mentation  to  minimize  boundary  segmentation  errors. 
Segmentation  results  using  an  approximate  maximum  a 
posteriori  (MAP)  technique  [65]  are  given  in  Fig.  5(5). 
A  selection  of  sample  boxes  across  homogeneous  areas 
of  multilayer  sea  ice  revealed  segmentation  accuracy  of 
approximately  97  %  for  the  MAP  classifiers. 

Derin  et  al.  [20]  developed  a  region-segmentation  tech¬ 
nique  in  which  the  image  model  consists  of  two  levels:  At 
the  lower  level,  a  first  order  causal  Markov  random  field 
(MRF)  describes  the  speckled  complex  amplitudes;  at 
a  higher  level,  a  second  MRF  models  region  labelling, 
i.e.  the  grouping  of  pixels  into  spatially  homogeneous 
regions.  However,  speckle  is  arbitrarily  assumed  to  have 
a  separable  exponential  autocorrelation  function,  which, 
as  demonstrated  in  [65]  is  not  appropriate  for  SAR  corn- 
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Figure  5:  (a)  Original  single-look  intensity  field  of  SAR 
complex  data  acquired  over  the  Beaufort  Sea  in  Alaska. 
In  decreasing  order  of  brightness  are  ice  ridges,  multiyear 
ice,  first-year  ice,  and  open  water  and  frazil  ice.  (b)  MAP 
segmentation  map. 

plex  data. 

The  extension  to  the  multifrequency  case  is  simple. 
The  complex  amplitudes  resulting  from  multifrequencies 
can  be  modeled  as  statistically  independent  as  they  cor¬ 
respond  to  independent  realizations  of  speckle.  Thus  if 
/i  and  /2  are  two  frequencies  of  SAR  imagery  with  am¬ 
plitudes  and  a2  and  covariance  matrices  Rj  and  R2 
then  the  distribution  of  SAR  complex  data  correspond¬ 
ing  to  the  two  frequencies  can  be  written  as 

P(»l.«2)  =  exp{-ai’'^Riai 

-  In  I  Ri  i  •a2*’’R2«2  “  1"  I  R2  1)  (41) 

where  the  Rj  are  modeled  as  in  (12).  By  using  simple 
Markov  random  fields  for  region  models  one  can  obtain 
posterior  distributions  corresponding  to  multifrequency 
data.  The  results  of  applying  an  approximate  MAP  seg¬ 
mentation  algorithm  are  given  in  Figure  6.  A  single  look, 
multifrequency  log-amplitude  image  of  terrain  is  shown 
in  Figure  6(a)  and  6(6).  The  30  dB  bandwidth  of  the  sys¬ 
tem  coherent  impulse  response  function  estimated  from  a 
100  X  100  area  is  10.63  meters  in  range  and  4.01  meters 
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in  azimuth  at  L-band  HH,  and  10.54  meters  in  range 
and  3.80  meters  in  azimuth  at  C-band  V'V'.  Based  on 
the  selection  of  thirteen  training  areas,  the  images  were 
classifi.  k'  into  thirteen  regions.  The  MAP  segmentation 
result  is  shown  in  Figure  6(c).  The  combined  classiflca- 
tion  accuracy  is  about  84%.  Additional  results  may  be 
found  in  [67]. 

3.3  Terrain  estimation 

Topographic  data  can  be  extracted  from  SAR  imagery 
using  at  least  three  kinds  of  approaches,  which  respec¬ 
tively  exploit  radiometric,  geometric,  and  interferomet¬ 
ric  information.  Radiometric  information  is  useful  be¬ 
cause  the  backscattered  power  is  a  function  of  terrain 
surface  slopes.  This  is  the  basis  of  shape  from  shading 
techniques.  Geometric  information  enters  through  the 
relatively  stable  relationship  between  the  scene  geome¬ 
try  and  its  projection  into  a  SAR  image.  This  is  the 
basis  of  stereogrammetry,  using  multiple  images,  and  for 
mensuration  of  discrete  objects  in  a  single  image.  Inter¬ 
ferometry  uses  the  relative  phase  variation  between  the 
complex  amplitudes  of  two  SAR  images  made  from  two 
vertically  separated  antennas  to  obtain  estimates  of  the 
elevation  angle  to  the  terrain  surface. 

It  has  been  suggested  that  all  three  approaches  are 
complementary  to  some  extent.  [21].  It  is  well  known  in 
the  computer  vision  community  that  shading  of  smooth 
surfaces  and  stereo  parallax  provide  complementary  in¬ 
formation:  Shading  is  useful  in  areas  with  constant 
albedo  and  fairly  smooth  variations  in  surface  orienta¬ 
tion.  Conversely,  stereo  matching  is  more  effective  in 
areas  with  large  albedo  variations  or  sharp  surface  ori¬ 
entation  changes  that  provide  the  required  intensity  fea¬ 
tures.  For  SAR  imagery,  geometric  and  radiometric  in¬ 
formation  are  not  only  complementary,  they  are  syner¬ 
gistic.  Because  SAR  is  an  active  sensor,  both  geometric 
as  well  as  radiometric  disparities  are  created  when  a  SAR 
stereo  image-pair  is  formed.  This  limits  the  accuracy  of 
radar-stereogrammetry  [50].  The  performance  of  inter¬ 
ferometry  depends  on  the  ability  to  reliably  unwrap  the 
phase  of  noisy  image  data  and  correct  for  a  variety  of 
phase  errors  [52].  The  point  is  that  a  single  informa¬ 
tion  source  usually  does  not  yield  all  of  the  topographic 
information  available  from  SAR  imagery. 

The  use  of  radar  backscatter  power  to  extract  terrain 
surface  orientation,  sometimes  referred  to  as  radarcli- 
nometry  [81],  has  been  considered  before.  Cosgriff,  et 
al.  [15],  while  concentrating  on  clutter  modeling  for  sys¬ 
tem  design  purposes,  discussed  the  possibility  of  radarcli- 
nometry  and  suggested  a  solution  method  using  multiple 
radar  images.  Wildey  [81,  82,  83]  developed  algorithms 
for  reconstructing  surface  topography  from  the  shading 
in  a  single  SAR  image.  Those  algorithms  directly  in¬ 
verted  a  differential  equation  similar  to  (40)  for  SAR 
imagery  subject  to  local  constraints  on  the  relationship 
between  and  Zy.  These  constraints  are  similar  to,  al¬ 
though  more  general  than,  the  local  sphericity  assump¬ 
tion  used  by  Pentland  [60]  and  also  must  be  inferred  from 
image  intensity  derivatives. 

Frankot  and  Chellappa  developed  a  practical 
method  [27]  for  estimating  the  topography  of  natural 


(c) 

Figured:  Segmentation  of  multifrequenc^  SAR  data,  (a) 
Original  one-look  log-amplitude  SAR  image  of  Flevoland 
at  L-band  HH .  (b)  C-band  VV .  (c)  MAP  segmentation 
result. 
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Figure  7:  Block  diagram  of  SAR  SFS  approach  for  esti¬ 
mation  of  surface  elevation. 


terrain  from  the  radiometric  or  shading  information  in 
a  SAR  image.  Five  essential  features  are  combined  in 
developing  the  shape  from  shading  algorithm  for  SAR 
imagery.  First,  an  imperfect  fit  between  the  observed 
image  intensity  and  the  estimated  surface  is  explicitly  ac¬ 
counted  for  through  a  mean  squared-error  term.  Second, 
a  reguluization  penalty  constraint  encourages  smooth¬ 
ness  in  the  estimated  surface.  Third,  self  consistency  of 
the  estimated  surface  is  enforced  through  an  interability 
constraint.  Fourth,  the  low  resolution  surface  data  has 
been  utilized  to  replace  low  frequency  surface  height  in¬ 
formation  not  available  from  shading  alone  and  to  obtain 
estimates  of  the  reflectance  model  parameters.  Finally, 
the  SAR  image  coordinates  and  reflectance  models  from 
Section  2.8  have  been  incorporated. 

Given  the  SAR  image  and  a  companion  digital  terrain 
elevation  model  (DTM)  at  much  lower  resolution,  the 
surface  estimation  procedure  is  as  follows:  The  DTM  is 
registered  to  the  SAR  imagery.  Reflectance  model  pa¬ 
rameters  are  estimated  such  that  the  mean  squared-error 
between  the  SAR  image  and  its  prediction  from  the  low 
resolution  DTM  is  minimized.  The  function  R  is  repre¬ 
sented  by  a  model  with  unknown  parameters,  dependent 
on  both  scene  characteristics  and  sensor  characteristics 
as  discussed  in  Section  2.8.  Given  suitable  estimates 
of  those  parameters,  the  SFS  algorithm  iteratively  es¬ 
timates  the  terrain  surface  slopes  (ur,Uy).  At  each  it¬ 
eration,  integrability  of  those  surface  slopes  is  enforced 
and,  by  the  same  process,  a  surface  height  estimate  is 
constructed. 

This  is  accomplished  using  a  fast  least-squares  proce¬ 
dure  that  involves  a  weighted  sum  of  the  Fourier  trans¬ 
forms  of  the  estimates  for  Ur  and  Uy .  The  Fourier  trans¬ 
form  of  the  DTM  is  substituted  for  the  low  frequency 
portion  of  the  surface  estimate,  which  are  not  adequately 
represented  by  shading  information.  This  difficulty  is 
fundamental  because  image  intensity  is  a  noisy  function 
of  surface  height  derivatives.  It  is  exacerbated  for  SAR 
imagery  because  of  the  often  highly  directional  nature 
of  R  and  the  presence  of  speckle  noise.  This  approach 
was  tested  using  real  SIR-B  SAR  imagery  and  indepen¬ 
dently  derived  DTMs.  The  resulting  surface  reconstruc¬ 
tions  compared  favorably  with  the  high  resolution  DTM. 


This  method  has  been  adopted  [71]  for  use  in  mapping 
the  Venus  terrain  from  the  high  resolution  SAR  image 
and  the  low  resolution  altimetry  data  to  become  avail¬ 
able  from  the  NASA  Megallon  project.  A  block  diagram 
of  this  method  is  shown  in  Figure  7. 

Zheng  and  Chellappa  [89]  later  developed  a  method 
for  estimation  of  surface  topography  by  combining  SFS 
and  stereopsis.  In  this  work,  Horn’s  [38]  technique  is  first 
used  to  reconstruct  the  needle  maps  from  opposite-sided 
SAR  stereo  images.  Then  the  needle  maps  are  trans¬ 
formed  to  ground  plane  coordinates  and  a  facet  model 
based  feature  extraction  algorithm  [34]  is  used  to  ex¬ 
tract  topographic  primal  sketches.  Next  stereo  match¬ 
ing  is  done  between  the  feature  points  of  two  images; 
the  disparity  values  and  hence  the  depths  along  the  fea¬ 
ture  points  are  computed.  Subsequently,  an  SOR  tech¬ 
nique  [10]  is  used  to  interpolate  a  smooth  surface  from 
the  sparse  depth  map.  Finally,  the  low  frequency  compo¬ 
nents  of  the  scene  are  extracted  from  the  reconstructed 
surface  and  a  high  resolution  terrain  map  is  obtained  by 
applying  the  Frankot-Chellappa  SFS  algorithm  [27]. 


The  reflective  scattering  from  SAR  causes  man-made 
objects  to  be  specular  or  mirrorlike  whereas  at  opti¬ 
cal  wavelengths  man-made  objects  predominantly  scat¬ 
ter  diffusely.  The  primary  features  that  are  observed 
in  SAR  signatures  are  point-like  returns  which  predomi¬ 
nantly  arise  from  angled  supports  that  serve  to  direct  en¬ 
ergy  back  towards  the  radar  [88].  The  range  resolution  of 
SAR  is  independent  of  range  to  the  target  because  range 
resolution  is  a  function  of  the  waveform  bandwidth.  Its 
azimuth  resolution  is  also  independent  of  range. 

In  [88],  the  role  of  SAR  sensor  knowledge  in  indexing 
and  pose  estimation  is  illustrated.  The  three  major  steps 
considered  are  feature  extraction,  matching  between  pre¬ 
stored  and  predicted  features,  and  pose  estimation  based 
on  these  correspondences.  The  feature  extraction  stage 
essentially  consists  of  detecting  local  m^lxima  that  are 
brighter  than  the  background.  Since  SAR  gives  point¬ 
like  returns  corresponding  to  the  various  scattering  cen¬ 
ters  on  the  object,  the  feature  extraction  step  captures 
much  of  the  information  contained  in  the  SAR  image. 

A  geometric  hashing  procedure  [49]  can  be  used  for  es¬ 
tablishing  a  correspondence  between  the  extracted  point 
patterns  and  pre-stored  patterns.  When  adapted  to 
SAR,  this  technique  is  equivalent  to  a  2-D  correlation 
over  a  sparse  set  of  translation  and  rotation  values  using 
binary  SAR  model  templates  over  a  binary  SAR  image 
where  the  I’s  represent  the  locations  of  the  dominant 
scattering  centers.  This  sparse  set  is  the  set  of  all  trans¬ 
lations  and  rotations  of  the  template  that  align  at  least 
one  pair  of  I’s  in  the  model  template  and  in  the  binary 
data. 

After  correspondence  is  established,  the  model  param¬ 
eters  are  estimated  using  a  hierarchical  extended  Kalman 
filter  [58].  At  the  top  level  in  the  hierarchy,  the  object 
model  is  represented  by  its  state  vector  delineating  its 
3-D  location  and  orientation.  At  the  next  level,  the  ob¬ 
ject  components  are  represented  as  rigid  body  transfor¬ 
mations  referenced  to  the  object.  Associated  with  these 
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components,  the  locations  of  scattering  centers  and  their 
persistence  over  view  angle  is  also  stored.  Finally,  the 
transformation  from  the  3-D  world  frame  to  the  2-D  sen¬ 
sor/image  pl2ine  is  represented  at  the  lowest  level  in  the 
Kalman  filter  to  predict  where  and  to  what  accuracy  the 
model  scattering  centers  are  located  in  the  image.  This 
representation  of  the  model  object  is  a  very  economical 
way  to  store  the  model  parameter  uncertainty  during  the 
estimation  process. 

The  last  step  in  the  recognition  process  consists  of 
performing  a  hypothesis  test  to  determi  e  the  degree  to 
which  the  data  matches  the  constrained  model.  Since 
pose  estimation  has  already  occurred  hypothesis  testing 
occurs  in  the  sensor  frame.  Extracted  feature  attributes 
are  matched  with  predicted  feature  attributes  using  a 
likelihood  ratio  test  described  in  [58]. 

A  model  based  object  recognition  method  that  com¬ 
bines  a  bottom-up  evidence  accumulation  process  and  a 
top-down  hypothesis  verification  process  has  been  pre¬ 
sented  in  [47]  for  detection  of  a  jet  airplane  in  SAR  im¬ 
ages.  A  SAR  simulator  developed  at  The  Analytic  Sci¬ 
ences  Corporation  was  used  to  generate  SAR  imagery. 
A  relaxation  labeling  algorithm  was  also  developed  for 
the  same  task,  using  both  point-like  and  line-like  fea¬ 
tures  [16]. 

In  related  work  [68],  a  generic  model  based  vision  algo¬ 
rithm  has  been  developed  for  recognizing  shiny  objects 
using  their  specular  features  utilizing  a  solid  model,  a 
SAR  sensor  model,  and  an  image  simulator  for  specular 
features.  All  possible  patterns  of  specular  reflections  are 
pre-compiled  and  aspect  maps  are  generated.  A  coarse- 
to-fine  deformable  template  matching  procedure  is  used 
for  recognition.  Results  using  simulated  SAR  images  are 
presented. 

Related  significant  work  on  target  detection  and  iden¬ 
tification  in  SAR  imagery  can  be  found  in  [24].  Due  to 
lack  of  space  we  are  unable  to  provide  more  details  about 
this  paper.  Issues  about  fusion  of  SAR  and  infrared  sen¬ 
sor  data  are  discussed  in  [12]. 

4  Possible  Research  Directions 

4.1  Terrain  estimation 

Because  vision  is  a  highly  underconstrained  problem,  the 
fusion  of  information  from  different  images  and  from 
different  cues  within  a  single  image  is  important.  For 
example,  it  has  been  suggested  that  SFS  complements 
stereogrammetry  [50].  By  utilizing  radiometric  infor¬ 
mation  one  can  improve  the  reliability,  accuracy,  and 
resolution  of  topography  estimates  available  from  radar- 
stereogrammetry.  For  SAR  imagery,  a  tradeoff  exists 
between  two  competing  effects:  the  larger  the  difference 
in  look  angles  for  the  stereo  image  pair  the  less  sensitive 
the  surface  reconstruction  is  to  errors  in  stereo  corre¬ 
spondence  and,  on  the  other  hand  ,  the  errors  in  stereo 
correspondence  grow  as  the  disparity  between  the  look 
angles  increases  [50].  In  many  cases  the  human  eye  is  not 
even  able  to  recognize  that  two  SAR  images  are  of  the 
same  terrain  if  they  are  made  from  drastically  different 
look  angles  because  the  change  in  iHumination  geome¬ 
try  causes  changes  in  shading.  Hence,  the  radiometric 


information  contained  in  the  images  is  synthetic. 

It  is  difficult  to  obtain  reliable  stereo  matches,  espe¬ 
cially  in  the  presence  of  speckle  noise.  In  order  to  obtain 
reliable  matches,  the  resolution  of  the  matches  is  nec¬ 
essarily  low  for  SAR  imagery.  Feature-based  matches 
require  extended  features  and  matching  by  correlation 
requires  large  integration  subareas  [25,  62]  for  reliability. 
The  resolution  of  the  matches  can  be  improved  with¬ 
out  compromising  reliability  by  using  smaller  subareas 
and  then  resolving  ambiguities  with  shading  information. 
The  image  shading  predicted  by  the  stereoscopically  de¬ 
rived  surface  reconstruction  should  approximately  fit  the 
shading  in  the  observed  image.  A  secondary  requirement 
is  that  the  variance  of  the  observed  image  intensity  ex¬ 
ceeds  the  variance  of  the  stereo- predicted  shading  com¬ 
ponent  by  at  least  some  threshold,  predicted  by  speckle 
characteristics. 

Similarly,  the  precision  of  stereo  matches  can  be  im¬ 
proved  by  accounting  for  shading  differences  that  oc¬ 
cur  between  SAR  stereo  image  pairs.  Two  correction 
approaches,  intensity  prediction  and  intensity  compen¬ 
sation,  are  possible.  The  intensity  prediction  method 
extracts  high  frequency  surface  information  using  SFS 
techniques  and  then  forms  a  predicted  image.  Stereo 
matching  of,  say,  the  first  SAR  image  with  its  predic¬ 
tion  from  a  second  SAR  image  allows  the  computation 
of  a  residual  parallEtx  error,  used  for  estimating  a  resid¬ 
ual  surface.  The  intensity  compensation  method  starts 
with  a  stereoscopically  derived  surface  reconstruction 
and  then  predicts  a  local  shading  ratio  between  the  two 
images.  Given  the  surface  reconstruction,  the  aspect  dif¬ 
ference  between  the  two  images,  knowledge  of  the  paral¬ 
lax  errors,  and  the  reflectance  map  it  is  possible  to  com¬ 
pute  the  shading  compensation.  If  additive  noise  terms 
are  low,  the  correction  is  insensitive  to  albedo  variations. 
After  shading  compensation  the  stereo  matching  proce¬ 
dure  is  repeated  with  greater  achievable  accuracy. 

One  can  also  experiment  with  methods  that  combine 
SFS  stereo  methods  by  explicitly  including  the  surface 
height  [39,  90]  «i(r,  j/)  in  the  SFS  equations.  Another 
method  [69]  may  be  to  use  generate  needle  maps  for  each 
image  using  an  appropriate  SFS  technique.  The  depth 
map  can  then  be  generated  from  these  needle  maps  by  es¬ 
tablishing  the  correspondence  so  that  the  disparity  over 
these  needle  maps  is  minimized.  The  integrability  con¬ 
straint  can  also  be  enforced. 

Radiometric  stereo  can  also  be  extended  to  SAR  im¬ 
agery  and  applied  simultaneously  with  traditional  ge¬ 
ometric  stereo.  In  applying  radiometric  stereo  and  in 
the  above  intensity  prediction  and  correction  methods 
to  SAR  imagery  the  problem  of  surface-height  depen¬ 
dent  registration  errors  arises.  The  local  registrations 
provided  by  the  initial  stereo  correspondences  may  be 
sufficiently  accurate  to  allow  radiometric  stereo,  geomet¬ 
ric  stereo  and  monocular  shading  cues  to  bootstrap  each 
other.  Recent  methods  for  extracting  stereo  depth  maps 
from  visual  images  [3]  can  be  re-evaluated  for  SAR  im¬ 
agery,  to  provide  better  methods  for  utilizing  both  ra¬ 
diometric  and  geometric  information. 

Earlier  work  has  considered  the  estimation  of  surface 
height  when  the  reflectance  properties  of  the  surface  are 


constant.  The  estimation  of  surface  topography  simul¬ 
taneously  with  variations  in  albedo  and  other  surface 
properties,  such  as  roughness,  is  a  much  more  difficult 
problem.  Multi-spectral  Landsat  imagery  has  been  used 
to  segment  albedo  variations  independent  of  surface  to¬ 
pography  [23,  35].  In  a  similar  manner,  one  can  de¬ 
tect  variations  in  albedo  using  multifrequency  radar  im¬ 
agery  [7,  77].  The  recent  analysis  of  multi-polarization 
radar  imagery  [77]  may  provide  insight  into  the  utiliza¬ 
tion  of  phase  data  and  polarization  diversity  for  infer¬ 
ring  surface  structure.  For  example,  it  is  possible  to 
distinguish  between  surface  and  volume  scattering  mech¬ 
anisms,  and  therefore  reflectance  maps  class,  using  the 
phase  differences  between  images  sensed  with  different 
polarizations. 

More  research  is  needed  in  order  to  make  use  of  these 
additional  information  sources.  Models  are  needed  which 
are  general  enough  to  account  for  the  effect  of  sur¬ 
face  roughness,  the  dielectric  properties  of  the  surface, 
polarization  diversity,  and  frequency  diversity,  yet  are 
tractable  enough  to  be  useful  for  image  analysis. 

4.2  Segmentation  of  Multilook  Intensity  and 
Multipolarimetric  SAR  Data 

As  mentioned  in  the  introduction,  it  is  desirable  to  be 
able  to  adapt  SAR  algorithms  to  multilook  intensity  im¬ 
agery.  Kelly  et  al.  [41]  have  developed  an  adaptive  tech¬ 
nique  for  segmentation  of  speckled  images  using  a  hierar¬ 
chical  random  field  model.  However,  the  joint  probabil¬ 
ity  distribution  function  of  a  set  of  M  N-look  intensities 
does  not  have  a  simple  analytical  form.  Even  if  it  were 
possible  to  derive  a  closed  form  expression  of  the  joint 
density  based  on  a  marginal  gamma  distribution  of  the 
intensity,  the  result  would  not  be  relevant.  The  samples 
that  are  used  to  generate  the  N-look  image  are  corre¬ 
lated;  therefore  the  equivalent  number  of  looks  in  terms 
of  reducing  the  variance  of  the  signal  is  not  equal  to 
N  but  slightly  inferior  to  N.  Furthermore,  the  marginal 
distribution  of  a  N-look  intensity  is  not  gamma  dis¬ 
tributed.  By  feeding  in  the  value  of  the  equivalent  num¬ 
ber  of  looks  into  the  analytical  expression  of  a  gamma 
distribution,  one  gets  only  an  approximation  to  the  cor¬ 
rect  distribution,  as  noted  in  [33].  To  conclude,  the  com¬ 
putation  of  the  joint  probability  density  function  of  M 
N-look  intensities  appears  to  be  hopelessly  complicated 
if  an  exact  solution  is  desired,  and  quite  unreliable  if  a 
gamma  distribution  assumption  is  used  for  the  marginal 
distributions. 

The  only  valid  results  concern  the  derivation  of  the 
variance  and  of  the  autocorrelation  function  of  the  in¬ 
tensity  field.  The  variance  of  speckle  in  multilook  im¬ 
agery  is  equal  to  the  variance  of  speckle  in  one-look  im¬ 
agery  divided  by  the  equivalent  number  of  looks.  The 
correlation  matrix  *4o  can  be  modeled  based  on  the 
correlation  properties  of  the  complex  amplitudes  used 
to  generate  the  multilook  intensities.  If  we  assume  that 
the  correlation  properties  of  each  of  the  N  intensities  are 
identical  (approximation),  and  there  is  no  overlap  be¬ 
tween  the  look-filters,  the  resulting  correlation  matrix  of 
the  sum  of  these  intensities  will  be  sensibly  the  same  as 
the  correlation  matrix  of  one  intensity  computed  from 


one  sub-aperture.  A  complete  and  in-depfh  analysis  of 
the  correlation  properties  of  multilook  intensities  is  also 
available  in  [55] . 

In  terms  of  the  joint  probability  distribution  function 
itself,  the  only  applicable  result  is  that  the  statistics  tend 
to  normality  for  large  N.  Although  it  is  known  that  in 
practice  this  is  not  a  solid  assumption  until  N  is  of  the 
order  of  several  tens,  one  can  make  the  approximation 
of  normality  starting  at  N  =  4.  The  Joint  probability 
distribution  function  of  a  set  of  M  zero-mean  intensities 
is  therefore 
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and  Ra  is  given  in  (17). 

By  using  the  model  [65]  for  the  region  labelling  pro¬ 
cess,  MAP,  ICM  and  MPM  estimates  of  the  labelling 
process  can  be  computed.  The  methods  are  similar  to 
the  complex  data  case  and  involves  1  /8  as  many  opera¬ 
tions  for  the  same  area  coverage. 

From  the  conditional  distribution  of  a  single  polari- 
metric  measurement  vector  (28),  one  can  derive  the  joint 
conditional  distribution  of  a  small  set  of  contiguous  po- 
larimetric  measurement  vectors  contained  in  a  window 
N,  and  centered  at  site  s.  Because  of  the  presence  of  the 
SAR  system  impulse  response  function,  part  of  this  ad¬ 
ditional  information  is  already  contained  in  X«,  but  for 
computational  convenience  one  can  assume  that  the  M 
polarimetric  measurement  vectors  X,  =  [Xi,...,XAf] 
corresponding  to  the  pixel  elements  of  N,  are  spatially 
uncorrelated.  As  the  polarimetric  complex  amplitudes 
are  circular  Gaussian,  they  are  independent  and  the  con¬ 
ditional  distribution  of  X,  is  the  product  of  the  marginal 
distributions.  Again,  one  can  view  the  polarimetric  data 
array  as  composed  of  a  set  of  overlapping  windows  of  M 
elements  centered  at  each  pixel  site  s  such  that  each  of 
these  windows  is  homogeneous  (all  pixels  have  the  same 
region  label),  and  conditionally  independent  of  the  other 
windows.  The  joint  conditional  distribution  of  X,  is  then 
expressed,  using  a  Gibbs  representation,  as 
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where  the  energy  function  U'  is 
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The  operation  of  computing  (45)  is  equivalent  to  a  M- 
look  operation  performed  on  polarimetric  SAR  data  [22]. 
As  (44)  does  not  depend  on  the  absolute  phase  of  each 
component  of  the  polarimetric  measurement,  the  same 
algorithm  can  therefore  be  used  for  both  single  look  and 
multilook  SAR  polarimetric  complex  data. 

The  segmentation  technique  of  [43]  suffers  from  the 
shortcoming  of  only  using  a  multispectral  description. 
One  can  modify  the  technique  so  that  region  level  de¬ 
scriptions  can  be  added  to  improve  segmentation. 

The  major  difficulty  is  to  be  able  to  simply  model  the 
terms  of  the  covariance  matrix.  The  estimation  of  the 
polarimetric  parameters  e,  p,  and  y  has  also  been  ac¬ 
knowledged  as  being  difficult.  A  simple  model  used  in 
conjunction  with  a  multilevel  Markov  random  field  is  ad¬ 
equate  to  represent  the  labelling  process  and  the  same 
technique  used  for  SAR  complex  data  can  be  used  to 
compute  the  MAP  or  the  MPM  estimate  of  the  labelling 
process.  One  can  also  study  the  effect  of  reducing  the  di¬ 
mension  of  the  polarimetric  measurement  vector  on  seg¬ 
mentation  accuracy. 

Analyzing  multifrequency  polarimetric  data  repre¬ 
sents  a  real  challenge  to  the  experimentalist  because  of 
the  high  dimensionality  of  the  data.  It  is  also  a  chal¬ 
lenge  for  scientists  because  little  has  been  done  in  terms 
of  classifying  multifrequency  polarimetric  data.  Most  of 
the  early  analysis  has  been  through  visual  interpreta¬ 
tion.  If  the  segmentation  of  polarimetric  SAR  data  can 
be  performed  by  the  aforementioned  technique,  it  would 
then  be  simple  to  add  other  frequencies  to  the  data  to 
improve  feature  detection  (since  multifrequency  data  are 
independent)  so  that  the  algorithm  can  be  used  to  seg¬ 
ment  multifrequency  polarimetric  SAR  data. 

4.3  Change  detection  in  SAR  images 
4.3.1  Difference  versus  ratio 

Given  a  pair  of  SAR  images  of  the  same  scene  ac¬ 
quired  at  two  different  dates,  a  conjecture  is  to  determine 
whether  it  is  better  to  use  the  ratio  or  the  difference  of 
the  SAR  intensities  to  measure  the  degree  of  change  in 
the  backscattering  properties  of  the  target. 

In  single  look  SAR  intensity  data,  the  distribution  of 
the  intensity  is 

Pih/  <Ii>)  =  -  exp -{■  7  —  +  log  <  h  >}  (46) 

x  <  ii  > 

at  a  time  ti  for  a  given  homogeneous  region.  A  similar 
expression  can  be  written  at  a  different  observation  time 
h- 

If  there  is  no  change  in  the  physical  and  electrical  char¬ 
acteristics  of  the  surface  between  the  two  dates  and  if 
the  imaging  conditions  are  identical,  image  speckle  will 
be  the  same  in  both  images.  If  the  data  are  coregis¬ 
tered  with  sub-pixel  accuracy,  speckle  will  be  exactly 
repeated  pixel  by  pixel  as  the  distribution  of  scatters  in 
each  resolution  cell  will  be  the  same.  This  is  in  fact  the 
fundamental  principle  of  SAR  interferometry  which  has 
been  verified  as  valid  even  in  the  case  of  SAR  data  ac¬ 
quired  several  days  apart  [32].  In  those  circumstances, 
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the  different  between  the  two  intensities  is  independent 
of  image  speckle. 

By  contrast,  if  a  change  occurs  in  the  physical  and 
electrical  characteristics  of  the  surface,  i.e.  the  distribu¬ 
tion  of  scatterers  and  the  resulting  backscatter  response 
are  different,  image  speckle  will  be  uncorrelated  between 
the  two  dates.  As  image  speckle  has  circular  Gaussian 
statistics,  we  may  thereby  assume  that,  for  change  de¬ 
tection,  multitemporal  realizations  of  speckle  are  mathe¬ 
matically  independent.  The  joint  distribution  of  the  two 
observations  is  therefore 


p(Ii,h/<h  >,<h  >) 

=  p{h/<h  >)p{hl<h>) 


(47) 


The  distribution  function  of  the  difference  d  =  I2  —  h 
is  derived  by  integrating  the  joint  distribution  over  h 
leading  to 


p(d/  <  /i  >,  <  I2  >) 

=  </!>+</»>  <  h  >} 


(48) 


which  is  similar  to  the  distribution  of  the  original  speck¬ 
led  intensities.  Similarly,  letting  r  =  h/h,  the  distribu¬ 
tion  of  the  ratio  is  derived  after  a  change  in  variables  by 
integrating  the  joint  distribution  over  Ii  leading  to 


p(r/  <  h  >,<l2  >) 

—  </a>/</i> 

“  (r  +  </3>/</,»» 


(49) 


If  we  assume  a  threshold  do  to  decide  on  whether  to 
classify  a  given  difference  d  between  class  A  character¬ 
ized  by  a  difference  dA  =<  Ia  >  —  <  h  >  ot  class  B 
characterized  by  a  difference  da  =<  Ib  >  —  <  h  >  (i  e- 
d>  dA  is  classified  as  class  B,  and  d  <  do  is  classified  as 
class  A),  the  class  A  probability  of  error  is  given  by 
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=  r 
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and  the  class  B  probability  of  error  by 


exp{-do/  <  Ia  >} 


PEb  =  f’^pid/dB)dd 
Jo 


_  <  Ib  >  /  <  h  > 

1+  <  Ib  >  /  <  h  > 

(1  -  exp{-do/  <  Ib>})  (50) 

Assuming  equal  a  prior  probabilities  for  class  A  and  class 
B,  the  threshold  decision  is  found  using 

p{do/dA)  =  p{do/dB)  (51) 

leading  to 


<Io  =<  Ia>  <  Ib 


<  Ib>  -  <  Ia> 


(52) 


The  total  probability  of  error  is  then  computed  as 
1 


PE  =  -{PEa  +  PEb) 


(53) 


Figure  8;  Probability  of  error  of  the  difference  and  of  the 
ratio  for  single-look  and  four-look  SAR  data. 


In  the  case  of  the  ratio,  with  =<  Ia  >  /  <  h  > 
and  rs  =<  Ib  >  /  <  h  >,  the  decision  threshold  is 


v/<  1a  ><  h  > 

To  =  - - - 

<  h  > 

and  the  probability  of  error  is 

PE= - ==!=== 

1  +  \/<Ia> /<Ib> 


(54) 


(55) 


The  probability  of  error  using  the  ratio  only  depends 
on  the  ratio  of  the  two  possible  changes  (i.e.  <  Ia  > 
I  <  Ib  >)i  whereas  the  probability  of  error  using  the 
difference  also  depends  on  the  absolute  change  of  each 
class  (i.e.  <  Ia  >/<  h  >  or  <  Ib  >/<  h  >)• 
A  plot  of  two  probabilities  of  error  is  shown  in  Figure  8 
as  a  function  of  the  contrast  between  the  possible  classes 
<  Ia  >  /  <  Ib  >,  for  <  Ii  >=  1,  and  where  R  =<  Ib  > 
/  <  Ji  >.  The  results  illustrate  the  better  performance 
of  the  ratio  independent  of  the  intensity  level  Ii  of  the 
region. 

The  same  computation  can  be  done  using  a  gamma 
distribution  of  the  intensity,  leading  to  a  distribution  of 
the  ratio 


r(2Ar-i)  «f2>/</i>)^r^- 
^  "  (r(JV))a  (r+<l2>/<  h  ^  ^ 


assuming  that  the  number  of  looks  N  remains  the  same. 
The  decision  threshold  is  equal  to  the  decision  threshold 
of  the  single  look  case.  For  N  =  4,  the  correspond¬ 
ing  probability  of  error  is  shown  on  Figure  8  and  illus¬ 
trates  the  significant  improvement  in  detection  accuracy 
obtained  when  multilook  SAR  intensity  data  are  used 
rather  than  single  look  SAR  data. 

A  Bayes  cUusifier  for  change  detection 
Multitemporal  SAR  data  can  be  segmented  into  regions 


of  homogeneous  and  similar  change  in  radar  backscatter. 
As  before,  we  view  the  image  array  fi  as  composed  of  a 
set  of  conditionally  independent  overlapping  windows  N, 
of  M  elements  centered  at  each  pixel  s  such  that  all  the 
pixels  contained  in  the  window  have  the  same  region  la¬ 
bel.  The  joint  distribution  function  of  the  M  backscatter 
ratios  r,  =  [rj , . . . ,  ru]  contained  in  a  neighborhood  N, 
of  site  s  is  approximated  by  the  product  of  the  marginal 
distributions,  i.e. 

i=M 

pir,/L,)  =  p{rilL,)  (57) 

i=l 

The  corresponding  likelihood  function  is 

ln\pir,/L,)](x  M  Ui(r,/L,)  (58) 

where 

^  i=M 

Ui{r,/L,)  =  —  ^[(Ar-l)lnr,-2Arin(r,-|-r/)]-t-Annr, 

(59) 

r/  is  the  ratio  of  region  L,  =  /,  and  N  is  the  number  of 
looks. 

The  distribution  of  the  region  labels  is  given  as 

p{L./Lr,re  A/)  a  exp -{M[/2(I./Ir, r  £  N^)}  (60) 

The  logarithm  of  the  posterior  distribution  of  the  region 
labels  is 

\n\p{L,/r,,Lr,r  e  N°)]  ocUi{r,/L,) 
-bt/2(L,/Lr,reJV.o) 

Using  the  approximate  MAP  criterion,  optimal  region 
labelling  of  the  image  is  defined  as  minimizing  an  energy 
function 

Emap  =  4-  U2{L./Lr,re  A/))  (62) 

8 

The  optimization  techniques  discussed  in  [65]  again  are 
applicable  to  compute  an  approximate  MAP  or  other 
solutions. 

The  feature  vector  used  for  clustering  will  be  the  dif¬ 
ference  between  the  logarithm  of  the  components  of  the 
covariance  matrix  of  each  data  sample.  Clustering  tech¬ 
niques  [1]  can  be  used  to  cluster  the  sample  measure¬ 
ments. 

4.3.2  Multitemporal  segmentation  of  SAR  data 
Change  detection  only  addresses  the  relative  change 
in  the  backscatter  characteristics  of  the  surface  between 
two  different  dates.  In  some  circumstances,  it  may  be 
useful  to  consider  the  absolute  values  of  the  backscatter 
returns  and  classify  the  data  into  regions  of  homoge¬ 
neous  change  and  homogeneous  backscatter  character¬ 
istics.  Such  a  strategy  may  prove  useful  for  difficult 
segmentation  problems  where  single  observations  cannot 
separate  different  types  of  natural  surfaces,  e.g.  different 
types  of  terrain. 

In  that  case,  the  joint  distribution  of  the  multitem¬ 
poral  radar  measurements  must  be  used  instead  of  the 
distribution  of  the  backscatter  power  ratio.  The  tech¬ 
nique  is  similar  to  the  one  adopted  for  the  segmentation 
of  multifrequency  SAR  data. 
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5  Conclusion 

We  have  discussed  a  number  of  important  and  interest¬ 
ing  problems  related  to  analysis  and  interpretation  of 
SAR  imagery.  Much  more  research  needs  to  be  done  for 
effective  transfer  of  lU  methodology  to  the  SAR  domain. 
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Abstract 

Automatic  Target  Recognition  (ATR)  is  an  ex¬ 
tremely  important  capability  for  DoD  applications. 
In  this  paper  we  discuss  the  problems  in  developing 
real-world  ATR  systems  and  present  the  status  of 
technology  for  these  systems.  We  identify  some  im¬ 
age  understanding  problems  that  need  to  be  solved 
in  order  to  enhance  the  effectiveness  of  ATR-based 
weapon  systems.  The  technological  gains  will  also 
lead  to  significant  advances  in  other  areas  of  appli¬ 
cations  of  image  understanding. 

1  Introduction 

Automatic  Target  Recognition  (ATR)  is  the  process 
of  automatic  target  acquisition  and  classification. 
All  the  services  need  ATR  capability  [8]  for  a  vari¬ 
ety  of  missions  and  scenarios  such  as  air-to-ground, 
ground-to-ground,  surface-to-surface,  air-to-air,  etc. 
ATR  systems  that  can  work  in  multiscenarios  do  not 
yet  exist.  The  development  of  automatic  target  rec¬ 
ognizers  that  can  perform  effectively  in  dynamic  en¬ 
vironmental  conditions  will  be  of  great  importance 
[!]• 

The  generic  ATR  problem  is  to  take  information 
from  one  or  more  sensors,  and  if  necessary,  com¬ 
bine  it  with  a  priori  information  (e.g.,  tactics,  digi¬ 
tal  map  information,  etc.).  A  decision  is  then  made 
aboiM  w^ich  targets  are  present.  The  targets  are 
posssbly  prioritized  by  their  tactical  importance  so 
that  action  can  be  tdcen  to  eliminate  them.  The 
sensors  can  be  any  one  of  a  variety  of  types  -  acous¬ 
tic,  seismic,  visible  or  near  infrared,  laser,  millimeter 
wave,  thermal  imagers,  radar,  etc.  -  or  some  combi¬ 
nation  of  them.  The  weapon  platforms  can  be  air¬ 
craft,  missile,  ship,  or  ground  vehicles.  The  targets 
can  be  fixed  (e.g.,  storage  depots,  bridges,  ships  in 
ports,  etc.)  or  mobile  (e.g.,  tanks,  helicopters,  ships 
at  sea,  etc.). 

The  term  ATR  includes  both  autonomous  and  aided 
recognition  (or  cueing  with  a  “person  in  the  loop”). 
In  cueing,  the  acquisition  is  done  by  the  targeting 
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system,  but  ultimately  recognition  is  done  by  the 
person.  Although  many  researchers  would  like  to 
perform  a  wide  variety  of  missions  autonomously, 
the  services  will  only  automate  critical  operator 
functions  reluctantly.  There  is  a  built-in  bias  to¬ 
ward  the  flexibility  of  the  human  operator  (e.g,  the 
Air  Force  still  relies  on  manned,  strategic  nuclear 
bombers  despite  excellent  laud  and  sea  based  strate¬ 
gic  missiles).  There  is  more  willingness  to  remove 
the  operator  in  the  missions  where  survivabihty  of  a 
human  is  low.  Soldiers  may  be  moved  further  from 
the  “action,”  but  we  do  not  expect  them  to  relin¬ 
quish  control.  Aided  systems  with  a  “person  in  the 
loop”  will  be  employed  before  autonomous  ones. 

It  is  clear  that  ATR  is  a  multidisciplineury  area  that 
requires  diverse  technology  and  expertise  in  sensors, 
processing,  architecture,  implementation,  and  evalu¬ 
ation  of  software  and  hardware  systems.  The  related 
computer  vision  and  pattern  recognition  technology 
and  systems  have  evolved  from  using  the  statistical 
pattern  recognition  approaches  to  model-based  vi¬ 
sion,  to  knowledge-based  systems.  Recently,  adap¬ 
tive  and  learning  systems  focused  on  parts  of  ATR 
problems  are  also  being  developed  in  laboratories. 

In  this  paper,  we  first  present  typical  ATR  appli¬ 
cations  for  various  services  in  Section  2.  This  is 
followed  by  a  discussion  of  technological  problems 
in  developing  real-world  ATR  systems  in  Section  3. 
Section  4  presents  the  current  status  of  technology 
for  ATR  systems.  In  Section  5  we  identify  and  dis¬ 
cuss  technical  research  areas  related  to  ATR  and  im¬ 
age  understanding.  Finally,  in  section  6  we  present 
the  conclusions  of  this  p2q>er. 

2  ATR  Applications 

Several  Army  applications  needing  ATR  technology 
arefire-and-forget  anti-tank  missiles  for  the  infantry, 
smart  minefields,  targeting  for  artillery,  air  defense, 
tank  commander  and  gunner  assistance,  perimeter 
surveillance,  and  attack  helicopter  missions.  The 
attack  helicopter  application  is  probably  the  high- 


est  priority  for  ATR  insertion.  It  is  a  good  example 
of  the  functional  capability  desired.  The  mission 
is  currently  performed  in  the  Army  by  the  Apache 
helicopter  with  a  crew  of  a  pilot  and  weapons  op¬ 
erator.  The  weapons  operator,  by  viewing  through 
an  optical  telescope,  day  TV,  or  thermal  infrared 
imager,  selects  targets  which  are  designated  by  a 
laser.  Hellfire  missiles  fly  to  the  laser  spot  on  the 
target.  The  time  tedcen  by  the  operator  searching 
for  targets  makes  this  platform  vulnerable  to  en¬ 
emy  anti-aircraft  fire.  To  shorten  this  time,  in  the 
new  Commanche  Helicopter,  the  operator  will  be  as¬ 
sisted  by  an  ATR  process  that  possesses  the  sensor 
information  and  directs  the  operator  to  suspected 
targets. 

Th^i  Navy’s  largest  efibrt  on  detection  and  recog- 
n-i.  on  focuses  on  undersea  vessels  using  acous¬ 
tics.  This  application  is  highly  classified  and 
does  not  overlap  significantly  with  surface  applica¬ 
tions.  Finding  surface  ships  is  done  primarily  with 
radar.  There  is  a  need  for  anti-ship  missiles  to  au¬ 
tonomously  attack  the  most  valuable  (dangerous) 
ships  in  enemy  battle  groups.  This  requires  ATR 
capability.  Both  the  Navy  and  the  Air  Force  share 
the  need  for  air-to-air  fighter  identification. 

The  Air  Force  is  the  primary  service  for  conducting 
air  strikes  into  enemy-held  territory.  This  covers 
close  air  support,  attacks  on  second  echelon  forces 
tens  of  kilometers  from  the  battle,  and  deeper  re¬ 
supply,  power  plant,  rail  centers,  or  storage  depots. 
The  mobile  targets  like  advancing  tanks  or  hidden 
targets  like  SCUD’s  are  the  more  challenging  ATR 
applications.  The  fixed  targets  allow  for  sufficient 
time  to  prepare  references  and  atteu;k  profiles  that 
make  the  ATR  task  easier. 

The  main  distinction  between  tactical  and  strate¬ 
gic  is  in  the  significance  of  the  targets.  In  the  ATR 
world,  it  is  generally  easier  to  automate  the  func¬ 
tions  against  the  strategic  targets  than  tactical  ones 
because  more  resources  can  be  devoted  to  the  prob¬ 
lem.  Also,  in  tactical  situations,  the  background  is 
continuously  changing  because  targets  are  generally 
mobile.  The  underlying  processing  technology  for 
tactical,  and  strategic  situations  is  quite  similar. 

It  is  to  be  noted  that  the  effectiveness  of  autonomous 
smart  weapons  guided  by  multisensors  that  we  have 
witnessed  recently  during  Desert  Storm  was  almost 
entirely  against  fixed  targets.  This  allowed  signif¬ 
icant  mission  preplanning.  To  achieve  comparable 
results  against  mobile  targets  will  require  a  major 
infusion  of  ATR  technology. 

Because  of  the  wide  variety  of  missions,  sensors,  and 
functions  which  an  ATR  system  may  be  required  to 
deal  with,  there  is  no  single  solution  to  the  algo¬ 
rithm  or  hardware  task.  For  each  application,  the 


information  available  or  attainable  must  be  uniquely 
matched  with  the  appropriate  processing  required 
for  the  functional  needs  of  the  mission. 

3  Problems  in  Developing 
Real- World  ATR  Systems 

The  nature  of  the  ATR  problem  is  characterized 
by  nonrepeatability  of  target  signatures,  competing 
cluttered  objects,  obscured  targets,  low  contrast  (for 
some  sensors),  long  range  (low  resolution),  conflict¬ 
ing  evidence,  natural  variability,  presence  of  camou¬ 
flage,  concealment  and  deception,  and  a  wide  variety 
of  outdoor  scenarios  (geographic  areas,  weather  and 
battlefield  conditions),  etc. 

Some  of  the  problems  associated  with  developing 
ATR  systems  are: 

•  Robust  Algorithms 

The  key  scientific  problem  is  the  absence  of  ro¬ 
bust  image  understanding  algorithms  that  can 
work  in  multiscenarios.  Current  image  under¬ 
standing  algorithms  do  not  provide  the  neces¬ 
sary  consistency,  reliability  and  predictability 
of  results.  For  example,  current  algorithms  re¬ 
sult  in  high  false  alarm  rates  in  images  with 
high  clutter  and  poor  recognition  performance 
in  images  without  well  defined  target  signa¬ 
tures.  They  make  very  little  use  of  a  priori  in¬ 
formation  related  to  and  present  in  the  image, 
and  generally  make  little  use  of  meta-knowledge 
control. 

•  Validation  and  Performance 

In  the  current  ATR  systems,  there  is  an  absence 
of  validation  of  data  and  models.  Although  we 
can  measure  some  performance,  we  lack  metrics 
to  measure  the  input,  i.e.,  useful  techniques  to 
characterize  the  input  data  are  not  available. 
Also  there  are  problems  associated  with  real¬ 
time  performance  evaluation.  There  is  a  need 
for  suitable  multiscenario  databases.  As  a  re¬ 
sult,  current  experience  in  the  ATR  field  is  only 
with  limited  muitisensor  databases  which  are 
not  very  representative  of  scenario  variability. 

•  Lack  of  Specific  Mission  Requirements 
Because  the  users  of  this  technology  do  not 
have  a  clear  idea  of  what  capability  is  achiev¬ 
able,  there  is  a  reluctance  to  establish  firm  re¬ 
quirements.  Instead,  all  the  services  have  listed 
ATR  as  a  desired  capability  for  a  variety  of  mis¬ 
sions.  A  panel  under  the  DoD  Working  Group 
on  ATR  Technology  has  listed  these  mission 
needs.  However,  the  recent  major  shift  in  the 
perceived  U.S.  threat  is  likely  to  modify  this. 

•  Software/Hardware 
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There  are  problems  associated  with  specialized 
or  general-purpose  efficient  processor  architec¬ 
tures  with  hardware/software  programmabil¬ 
ity  and  managing  both  numeric  and  symbolic 
computation.  General  purpose  architectures 
are  hard  to  optimize  with  respect  to  special¬ 
ized  processing  and  special  purpose  architec¬ 
tures  have  problems  with  flexibility  and  pro- 
granunability.  Also,  there  are  no  standard  mod¬ 
ules,  hardware  or  software,  to  allow  economy  of 
scale  to  reduce  costs. 

•  Computational  Power 

Since  the  mission  is  not  specifically  defined, 
there  is  no  simple  answer  to  required  computa¬ 
tional  throughput.  It  is  clearly  a  function  of  the 
bandwidth  of  the  incoming  data,  the  functions 
to  be  performed,  and  the  complexity  of  those 
functions.  In  most  cases,  the  processing  will 
be  expanded  to  the  capability  of  the  available 
hardware.  Today,  this  means  that  algorithms 
can  be  developed  using  billions  of  operations 
per  second  for  applications  requiring  no  more 
than  a  few  small  cards. 

•  Man-machine  Interfaces 

Current  target  cueing  technology  can  provide 
reasonable  help  in  the  field.  However,  the  im¬ 
portant  research  issues  related  with  cueing  are 
reliable  target  acquisition  and  man-machine  in¬ 
terfaces  for  presenting  the  information  to  the 
operator/pilot.  The  interface  concepts  are  just 
beginning  to  emerge  about  how  to  best  use  the 
operator  and  ATR  together.  What  information, 
how  to  pass  it  and  when  to  pass  it,  are  some  of 
the  unresolved  issues  for  most  applications. 

•  Technology  Transfer 

As  a  result  of  the  proprietary  nature  of  algo¬ 
rithms  and  systems,  sharing  of  algorithms,  soft¬ 
ware,  hardware  and  data  is  difficult. 

4  Status  of  Technology  For  ATR 
Systems 

Significant  progress  has  been  made  during  the  last 
ten  years  from  adhoc  techniques  like  looking  for  hot 
spots  in  infrared  images  by  thresholding  on  contrast 
measures  to  more  scientific  understanding  of  sen¬ 
sors,  algorithms,  architectures,  processors,  systems 
and  associated  technology  for  software/hardware 
implementation  [3]. 

During  this  time,  the  need  for  better  databases  used 
in  training  and  testing  ATR  systems  has  been  real¬ 
ized.  Although  more  and  better  data  will  always 
be  desired,  several  data  sets  have  been  established 
with  ATR  development  in  mind.  In  general,  they 
have  been  characterized  better  than  previous  data 


sets  and  often  have  machine  readable  ground  truth 
or  image  truth  with  them,  rather  than  just  hand 
scribbled  log  books. 

The  processor  technology  has  been  revolutionary.  In 
1980  an  algorithm  often  took  30  minutes  or  more  to 
run  on  a  general  purpose  computer  such  as  a  VAX 
11/750.  To  get  the  times  down  to  a  few  seconds, 
in  order  to  process  a  large  number  of  images,  dedi¬ 
cated  processors  were  required  for  which  algorithms 
could  not  be  changed  without  modifying  the  hsird- 
ware,  a  task  often  requiring  several  months.  To¬ 
day,  commercial  signal  processing  hardware  exists 
to  perform  many  of  the  component  modules  needed 
in  ATR  functions.  This  hardware  is  quite  valuable 
in  an  R&D  environment. 

Simple  algorithms  to  achieve  real-time  performance 
have  been  implemented  in  hardware  and  several  such 
systems  exist.  Some  of  these  systems  have  also  been 
field  tested.  These  systems  are  based  purely  on  sta- 
tistical  pattern  recognition  algorithms.  They  em¬ 
ploy  limited  multiframe  processing  analysis  and  pos¬ 
sess  no  countermeasure  capability.  With  inadequate 
training  data,  these  algorithms  alone  could  never 
hope  to  achieve  the  robustness  required  in  many  mil¬ 
itary  applications.  Only  for  limited  low  clutter  and 
high  contrast  FLIR  imagery,  the  detection  and  clas¬ 
sification  performance  has  been  satisfactory. 

At  present,  in  laboratory  prototypes  or  in  simu¬ 
lation,  ATR  systems  employing  more  complex  al¬ 
gorithms  and  cross-sensor  features  including  some 
model-based  ATR  are  being  developed.  These  sys¬ 
tems  use  multisensors  (i.e.,  FLIR,  MMW,  LADAR, 
SAR  etc.)  but  make  limited  use  of  models,  a 
priori  information,  countermeasure  resistance,  and 
scene  analysis  techniques.  They  have  been  tested 
on  very  small  databases.  Only  for  unobstructed 
medium  contrast  FLIR  imagery,  the  recognition  per¬ 
formance  has  been  from  marginal  to  satisfactory. 
TVuly  model-based  multisensor  ATR  systems  do  not 
yet  exist  in  real-time  hardware  systems. 

There  is  now  a  larger  set  of  “tools”  in  the  algorithm 
developers  “toolbox.”  They  include  knowledge- 
based  tools,  model-based  tools,  neural-nets,  and  ge¬ 
netic  techniques.  However,  no  one  of  these  tech¬ 
niques,  alone,  is  likely  to  be  the  solution  to  all  ATR 
problems,  but  by  applying  the  most  useful  tech¬ 
niques  to  each  piece  of  the  problem,  progress  b  ac¬ 
celerating. 

5  Challenging  Image 

Understanding  Problems 

As  discussed  earlier,  ATR  is  a  system  that  involves 
sensor*,  algorithms  and  processors.  The  area  where 
image  understanding  can  significantly  contribute  is 
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in  algorithm  development.  New,  improved,  robust 
algorithms  will  help  to  increase  the  effectiveness  and 
usage  of  ATR  systems.  They  will  provide  a  better 
understanding  of  complex  interaction  between  input 
data,  models  and  output  results.  In  the  following  we 
relate  ATR  problems  to  image  understanding  and 
identify  potentially  new  research  areas  in  image  un¬ 
derstanding  that  will  help  to  solve  the  ATR  prob¬ 
lems. 

•  Characterization  of  Input  to  an  ATR  Sys¬ 
tem 

An  ATR  system  may  be  viewed  as  a  vision  sys¬ 
tem  that  consists  of  a  variety  of  multisensors 
and  multisource  data.  It  is  desired  to  character¬ 
ize  the  input  to  an  ATR  system  so  that  we  can 
relate  the  inputs  to  the  outputs  of  the  system. 
This  relationship  is  required  for  understanding 
the  behavior  of  the  system  under  a  wide  vari¬ 
ety  of  inputs.  This  analytic  or  parametric  rela¬ 
tionship  in  turn  also  helps  to  provide  prediction 
capability  that  is  essential  for  practical  uses  of 
the  system. 

We  know  a  number  of  measures  (such  as  proba¬ 
bilities  of  detection,  false  alarm,  confusion  ma¬ 
trix,  etc.)  that  we  can  use  to  evaluate  the  out¬ 
put  of  the  system.  However,  we  do  not  know 
precisely  how  to  characterize  the  input.  A  num¬ 
ber  of  simple  information  measures  such  as  edge 
points,  entropy,  uniformity,  and  structural  mea¬ 
sures  have  been  proposed  in  the  past.  However, 
they  are  of  limited  value  in  characterizing  the 
input  objects  and  clutter.  We  need  better  mea¬ 
sures  that  are  realistic  and  may  provide  insight 
into  the  behavior.  We  would  like  to  have  some 
fundamental  results  (like  in  Information  The¬ 
ory)  that  set  the  bounds  on  the  performance 
for  recognition  that  is  based  on  the  information 
content  in  the  sensor  and  other  relevant  data. 

•  Consolidated  Recognition  and  Motion 
Analysis 

In  a  typical  ATR  system,  tracking  is  generally  a 
component  of  the  system.  IVacking  is  normally 
done  by  using  a  multimode  tracker  consisting 
of  some  combinations  of  2-D  feature  matching, 
centroid  matching  and  correlation  matching.  In 
the  past,  the  3-D  information  available  in  a  se¬ 
quence  of  images  has  not  been  used  to  improve 
tracking  within  an  ATR  system.  Further,  in 
ATR  systems  and  image  understanding,  motion 
analysis  has  not  been  used  in  close  coopera¬ 
tion  with  recognition  to  improve  the  recogni¬ 
tion  rates  [4,5].  Also,  very  little  work  has  been 
done  where  recognition  helps  to  improve  motion 
analysis. 


The  idea  is  to  consolidate  recognition  and  mo¬ 
tion  analysis. 

(a)  Use  3-D  depth  information  to  improve  tar¬ 
get  tracking  through  occlusion  and  high  clutter 
situations  which  in  turn  will  improve  recogni¬ 
tion  performance. 

(b)  Integrate  tracking  and  recognition  function- 
aUties,  where  one  improves  the  performance  of 
the  other.  As  an  example,  in  the  long  range 
detection  of  aircraft  signatures,  IR  signature 
of  an  incoming  threat  target  is  detected  using 
a  highly  stabilized,  high  sensitivity  IR  sensor 
and  by  separating  true  targets  from  the  back¬ 
ground.  Such  a  sensor  is  called  the  IRST  (In- 
fraRed  Search  and  TVack).  Using  the  IRST 
sensor,  target  identity  and  range  is  passively 
estimated  using  known  motion  of  the  sensor 
and  assumed  motion  of  detected  incoming  tar¬ 
gets.  The  idea  is  to  use  something  like  IRST 
for  ground  or  surface  tairgets. 

(a)  and  (b)  as  described  in  the  above  can  be 
developed  independently  or  in  conjunction  with 
each  other. 

•  Model-based  ATR 

Model-based  ATR  is  an  extremely  important 
research  area  and  requires  much  advancement. 
In  the  lU  community,  mostly  visible  and  some 
SAR  and  laser  data  has  been  used  [6].  We  need 
to  develop  expertise  in  other  sensors  (such  as 
FLIR,  SAR,  MMW,  etc.)  which  are  so  vital  to 
the  development  of  practical  systems  [9].  Fur¬ 
ther,  the  lU  community  has  mainly  dealt  with 
geometric  models  of  objects  and  limited  (visi¬ 
ble,  SAR)  sensor  models.  Many  things  are  de¬ 
sired  here. 

Model-based  ATR  involves  developing  not  only 
the  geometric  models  of  the  targets,  but  also 
models  for  sensors,  clutter,  background,  atmo¬ 
spheric  physics  and  countermeasures. 

A  capability  to  generate  and  predict  in  real- 
tinve  the  signatures  of  targets  under  varying 
condition  will  be  extremely  valuable  as  are  the 
needs  for  clutter  and  background  modeling. 

Accurate  and  efficient  development  of  sensor 
(e.g.  thermal,  SAR  waveform,  multispectral, 
laser,  etc.)  models  and  object  models,  and  their 
use  in  model-based  recognition  and  tracking, 
will  be  of  importance  in  enh2uicing  the  recogni¬ 
tion  performance  and  reducing  false  darms.  It 
will  also  be  desired  to  quantify  the  effect  of  sen¬ 
sor  improvement  with  the  performance  of  the 
system  and  validate  various  models  used  in  the 
recognition  process. 


252 


Developing  models  for  algorithm  behaviors  re¬ 
lating  input  data/scene  properties  to  algorithm 
behavior  will  be  valuable.  Also  important  are 
approximate  model  matching  techniques  in  a 
dyneunic  environment  which  may  be  suited  for 
explicit  control,  and  fast,  parallel  implements^ 
tion. 

•  Robust  Algorithms 

“Robustness”  as  needed  for  ATR  tasks  is  dis¬ 
tinct  from  what  it  is  perceived  in  the  lU  com¬ 
munity.  “Robustness”  of  an  ATR  system  is  the 
measure  of  insensitivity  to  the  deviations  in  as¬ 
sumed  input  conditions  to  the  system.  One 
way  to  characterize  “robustness”  of  the  algo¬ 
rithms  could  be  to  relate  the  input  informa¬ 
tion  to  the  performance  of  the  algorithms.  A 
measure  of  “robustness”  could  be  the  number 
of  false  alarms  for  the  same  set  of  targets,  but 
under  different  environmental  conditions.  The 
terms  “robustness”  and  “multiscenario”  are  re¬ 
lated.  “Multiscenario”  is  implying  “generality” 
of  algorithms  to  work  under  varying  environ¬ 
mental  conditions  in  the  outdoor  scenario. 

Robust  algorithms  for  segmentation  and  3-D 
object  recognition  are  desired.  Since  a  target 
may  give  rise  to  a  limitless  variety  of  images 
(varying  viewpoint  and  environmental  condi¬ 
tions),  it  is  important  to  use  all  the  available 
knowledge  and  sensor  information  to  accom¬ 
plish  the  system  goals. 

•  Adaptive  Algorithms 

The  single  most  important  thing  that  will  have 
one  of  the  greatest  impacts  on  ATR  perfor- 
metnce  is  the  use  of  adaptive  algorithms  that 
can  learn  and  adapt  to  the  varying  environmen¬ 
tal  conditions.  No  matter  how  sophisticated  an 
algorithm  may  be,  it  will  always  have  some  im¬ 
agery  or  conditions  that  will  cause  it  to  break¬ 
down  unless  it  has  inherent  learning  capability. 
“Adaptiveness”  ensures  “robustness”  as  defined 
in  the  above.  “Adaptive  algorithms”  are  in¬ 
deed  “robust  algorithms.”  However,  adaptive 
algorithms  may  also  possess  self-calibration  and 
learning  capability  which  is  not  a  fundamental 
requirement  for  robust  algorithms.  Adaptive  al¬ 
gorithms  are  desired  for  segmentation,  feature 
extraction  and  object  recogniti.  [2,7].  Tech¬ 
niques  for  rapidly  training  these  subsystems  are 
also  desired. 

•  Architectures  for  Integration  of  Auxiliary 
Information  and  Multisensors 
Multisensor  integration  technology  must  be 
augmented  with  a  priori  information  to  provide 
improvement  in  target  detection  and  classifi¬ 


cation  performance.  Use  of  auxiliary  informal 
tion  like  maps,  navigational  sensors,  metrolog¬ 
ical  data,  zdtimeter,  etc.  into  the  recognition 
and  tracking  process  will  be  extremely  valu¬ 
able  in  enhancing  recognition  and  reducing  false 
alarms.  We  need  to  develop  flexible  open  sys¬ 
tem  architecture  that  will  allow  integration  of 
knowledge  databases. 

(a)  Integration  -  Mission-based,  model-based 
integration  of  sensors,  relevant  data  and  in¬ 
formation.  Multisource  information  integra¬ 
tion  may  include  inteihgence,  mission  planning, 
evidential  reasoning,  geography,  doctrine,  in¬ 
flight  updates,  previous  sightings,  metrology, 
and  models  for  target,  clutter,  countermea¬ 
sures,  sensors,  etc.  Integration  process  should 
allow  for  mission  specifics  so  as  to  follow  suit¬ 
able  recognition  strategy  based  on  available  re¬ 
sources.  All  this  will  ensure  the  effectiveness  of 
performance. 

(b)  Knowledge  databases  -  Good  target  recog¬ 
nition  will  not  be  achieved  unless  one  up¬ 
dates  knowledge  databases  dynamically  so  as 
to  adapt  to  the  changing  situations.  Currently, 
necessary  software  technology  base  is  far  be¬ 
hind.  This  makes  the  architecture  use  difficult. 

•  Recognition  and  Guidance  of  Sensors 
There  are  many  mission  scenarios  where  recog¬ 
nition  and  guidance  of  available  sensors  are 
closely  related.  In  the  lU  community  active  vi¬ 
sion  and  task-based  vision  paradigms  have  been 
popular  for  several  years.  For  this  technology  to 
have  an  impact  on  the  ATR  problem,  we  need  to 
apply  this  technology  to  the  outdoor  scenarios 
where  the  problem  complexity  is  significantly 
higher  than  in  the  indoor  scenarios. 

•  Parallel  Algorithms  rmd  VLSI 

Parallel  algorithms  and  mechanisms  for  map¬ 
ping  algorithms  to  specialized  architectures 
need  to  be  developed.  Real-time  digital/analog 
VLSI  implementation  of  algorithms  and  sub- 
systems/modules  useful  in  various  applications 
need  to  be  developed.  A  lot  of  it  can  be  done 
in  a  university /industry  collaboration  by  de¬ 
veloping  and  expanding  working  relationships 
between  industry  and  universities  with  support 
from  the  government. 

•  Real-time  Performance  Evaluation 

We  need  to  develop  the  framework  for  building 
vision  systems  so  that  real-time  evaluation  and 
performance  characterization  of  algorithms  can 
be  done  in  laboratory  prototypes  or  in  actual 
field  tests.  This  framework  needs  to  be  com¬ 
mon  among  various  sites  so  pieces  that  are  use- 
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ful  from  various  researchers  can  be  integrated 
together. 

The  availability  of  some  multisensor,  registered 
database  of  image  and  associated  ground  truth 
to  lU  researchers  will  be  very  valuable.  Stan¬ 
dard  data  sets  for  various  applications  can  help 
focus  the  research  efforts,  allow  comparisons 
and  expand  the  expertise  of  the  lU  community 
to  ATR  problems. 

•  Man-machine  Interfaces 

Suitable  interfaces  need  to  be  developed  to  ap¬ 
ply  the  image  understanding  and  human  factors 
technology  to  the  “person  in  the  loop”  situa^ 
tions. 

6  Conclusions 

The  development  of  automatic  target  recognition 
systems  that  can  perform  under  dynamic  environ¬ 
mental  conditions  will  be  of  great  practical  signif¬ 
icance  in  many  DoD  applications  such  as  reducing 
workloads  of  pilots  and  tank  commanders.  We  hope 
that  the  solution  of  image  understanding  problems 
as  discussed  in  this  paper  will  lead  to  further  dis¬ 
cussions  of  image  understanding  research  for  ATR 
application.  All  this  will  result  in  scientific  advance¬ 
ments  in  the  lU  field  and  more  effective  and  robust 
ATR  systems  in  the  future.  The  technology  devel¬ 
oped  will  also  be  applicable  to  other  areas  of  lU 
applications  such  as  navigation,  photointerpretation 
and  robot  vision. 
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Abstract 

Because  of  poor  communications  between 
Image  Understanding  (IU)  and  Automatic 
Target  Recognition  (ATR)  communities,  IU 
research  has  had  a  negligible  impact  on  ATR 
developers.  In  this  paper  we  will  provide 
some  possible  (^plications  oflU  technology 
that  can  assist  in  ATR  problem  solutions. 
First,  we  define  the  Current  ATR 
Dilemma  and  what  makes  the  ATR  problem 
a  difficult  challenge.  Following  this 
discussion,  we  highlight  ATR  Research 
Challenges  and  present  applicable  IU 
research. 

1.  THE  CURRENT  ATR  DILEMMA 

Although  Automatic  Target  Recognition 
(ATR)  systems  must  function  in  an  extremely 
noisy  environment,  the  tools,  techniques,  and 
methods  available  to  them  work  only  in  fairly 
sterile  or  well  bounded  domains.  Recently, 
model  based  approaches  have  been  used  to 
improve  performance.  Even  using  these 
methods,  rigidly  constructed  models  do  not 
produce  the  flexibility  or  relaxation  methods 
for  true  multiscenario  recognition. 

In  the  past  ATR  developers  believed  that  the 
primaiy  ATR  performance  limitations  were 
real-time  hardware,  sensors  performance, 
aiui  the  availability  of  data.  Ibis  assumption 


is  no  longer  true.  Commercial  hardware  is 
now  available.  The  belief  that  sensor 
resolutions  do  not  allow  internal  target  or 
backgrounds  details  to  be  seen  is  also  untrue. 
New  sensors  can  provide  fine  grained  target 
and  environment  information  that  not  only 
make  newer  approaches  feasible  but 
necessary  as  well.  Blob  detection  is  no 
longer  the  key  to  target  recognition;  gathering 
clues  of  target  identity  and  reaching  an 
educated  conclusion  is  now  practical  and 
central  to  future  improvements  in  ATR 
performance.  Data  availability  is  still  an 
issue.  Data  collections  are  expensive  and  are 
often  classified,  which  limits  their 
availability.  But  enough  data  exists  that  if 
needed  a  sufficient  data  base  can  be 
developed  for  meaningful  ATR  development 
programs. 

Today,  the  challenge  is  to  show  that 
collaboration  between  the  ATR  developers 
and  the  IU  community  is  mutually  beneficial. 

•  ATR  research  and  development  has  much 
to  gain  from  teaming  with  the  IU 
community.  Many  of  the  promising 
ATR  approaches  have  been  derived 
indirectly  from  the  IU  community 
(graduate  students'  movement  into 
industry). 
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•  The  lU  community  can  gain  equally  from 
exposure  to  the  challenges  facing  ATR 
developers.  Just  as  the  ALV  application 
drove  research  in  the  1980's,  the  ATR 
application  can  drive  research  in  the 
1990’s. 

These  challenges  are  open  opportunities.  The 
criticality  of  the  goal  was  amply  demonstrated 
in  Desert  Storm.  An  early  success  in  moving 
research  to  a  fieldable  end  product  for  ATRs 
will  justify  the  lU  program  as  few  other 
arguments  will. 

1.1  ATR  Applications 

ATRs  have  many  uses.  In  this  section  we 
will  describe  their  proposed  applications  and 
the  hostile  environment  in  which  ATRs  are 
being  developed  to  operate. 

All  vision  problems  are  hard,  but  not  all  are 
hard  for  the  same  reason.  When  looking  at 
Vision  Research  as  a  scientiHc  endeavor, 
domains  can  be  simplified  so  that  the  essence 
of  the  problem  being  investigated  can  be 
isolated  and  studied.  Experimenting  and 
developing  the  means  to  represent  and 
capture  meaningful  information  offers 
unending  opportunities  for  a  researcher — 
even  in  the  simplified  domain  that  is  being 
investigated.  The  results  can  often  be 
appropriated  for  other  applications  where  the 
simplifying  assumptions  made  may  be 
directly  transferred  into  the  domain. 
However,  this  is  not  the  case  in  ATR 
applications. 

If  the  domain  itself  is  unconstrained,  then  the 
research  has  to  focus  not  so  much  on 
representation,  but  on  control.  The 
fundamental  challenge  is  to  determine  how  a 
vision  system  should  function  in  order  to 
handle  the  variability  inherent  in  complex, 
outdoor  contexts,  not  for  relatively  simple 
tasks  as  Landmark  Recognition,  but  for 
problems  such  as  recognizing  concealed 
relocatable  high  valued  targets.  Until  now 
solutions  have  been  constrained  because  of 
the  lack  of  adequate  hardware;  so,  lU 
solutions  would  have  been  impractical. 
Sensor  resolutions  have  not  permitted  lU 
exploitation.  Today  hardware  and  resolution 
are  no  longer  problems.  So  the  challenge 
now  is  to  develop  the  techniques  that 


can  exploit  features  found  in  ATR 
application  domains. 

ATR  applications  can  be  described  as; 

•  PenetratorlMissile — Shoot-to-kill,  move- 
to-kill,  move-to-position,  search,  and 
kiU;- 

•  Land  Vehicle — RSTA,  direct  fire,  indirect 
fire,  surveillance,  reconnaissance, 
perimeter  defense 

•  Air  Vehicle — Cueing,  indirect  fire, 
surveillance; 

•  Space  Borne — Surveillance, 
reconnaissance. 

While  each  of  these  applications  has  its  own 
set  of  challenges,  all  must  deal  with  a  set  of 
general  conditions: 

•  Scene  Contents — Stationary  targets, 
moving  targets,  one-on-few,  one-on- 
many 

•  Target  Types — Clutter  conditions,  likely 
ranges 

•  Scenario  Variables — Weather  condi¬ 
tions,  day/night,  varying  terrain; 

•  Battlefield  Conditions — Countermea¬ 
sures,  dust/smoke/fires,  occlusion; 

1.2  Smart  Weapons  Application 

Smart  weapons  will  be  used  to  conduct 
operational  missions  too  difficult  and/or 
dangerous  for  manned  systems  in  battle 
arenas  such  as  Close  Air  Support,  Battlefield 
Air  Interdiction,  Deep  Strike,  Offensive 
Counter  Air,  Defensive  Counter  Air, 
Surveillance/Sigint/Comint  and  Communica¬ 
tions.  Their  primary  mission  is  to: 

1.  Seek  out  enemy  threats 

2.  Select  appropriate  targets 

3.  Report/destroy  them. 

Moreover,  smart  weapons  v  ill  have  to 
perform  these  tasks  in  a  cluttered 
environment  with  dust  and  other 
environmental  factors  affecting  performance, 
where  the  target  or  threat  is  actively 
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concealed.  It  will  have  to  be  able  to  move 
into  an  area,  search  it  to  find  the 
threats/targets,  select  an  appropriate  one  if 
necessary  and  then  destroy  it.  Stealth  is 
important;  planning  is  important;  but, 
sensing,  situation  awareness,  and 
searching  are  critical. 

Two  kinds  of  platforms  are  typically 
employed,  an  Autonomous  Air  Vehicle  for 
long  hauling  and  searching  and  an  Intelligent 
Munition  for  final  targeting  and  delivery. 
Typically,  multisensor  suites  are  used 
containing  MMW,  LADAR,  and  FLIR 
sensors.  Multiple  sensors  are  needed 
because  multiple  bandwidths  and  active 
systems  provide  ranging  and  spectral 
information,  under  varying  scene  and 
atmospheric  conditions,  that  increase  the 
probability  that  the  mission  will  succeed. 

2.  RESEARCH  CHALLENGES 

Primarily,  ATR  research  challenges  are 
concerned  with  finding  ways  to  identify  and 
classify  targets  under  various  environmental 
and  scene  conditions.  Different  sensory 
modalities  offer  different  advantages  and 
disadvantages.  The  task  usually  determines 
which  sensor  or  specific  combination  of 
sensors  is  used. 

2.1  Target  Signature  Variations 

Target  signature  variations  in  FLIR  and  radar 
data  arise  from  the  target  being  in  different 
states.  The  different  states  are  produced  by 
varying  conditions  in  the  scene.  For  FLIR 
imagery,  the  influence  of  the  atmospheric 
effects  is  not  well  understood.  Some  of  the 
variables  are: 

•  Vehicle  Orientation  Variables — 

•  Aspect  angle 

-  Image  plane  rotation 

-  Depression  angle 

•  Vehicle  Thermal  Variables — 

-  Engine  (running  /  not  running) 

-  Vehicle  motion  (stationary  /  exercised) 

•  Environmental  Variables — 

-  Solar  loading  (nighttime  /  daytime) 

-  Sun  angle 

-  Wind 

-  Obscuration  (fog  /  rain  /  dust  /  smoke) 


2.1.1  FLIR  Signature  Variations 

Heated  objects  emit  light  having  a  spectrum 
characteristic  of  their  temperature.  In  a 
narrow  range  of  wavelengths  in  the  far 
infrared  region  of  the  spectrum,  the 
temperature  of  an  object  can  be  determined  by 
measuring  the  intensity  of  that  light.  Hotter 
objects  emit  higher  intensity  than  cooler 
objects.  This  is  the  basis  for  infrared  sensors. 

In  Forward  Looking  Infrared  (FLIR) 
imagery,  tactical  vehicles  such  as  tanks, 
APCs,  and  trucks  usually  appear  brighter 
than  their  surrounding  background  due  to  the 
greater  thermal  emission  from  the  targets 
compared  with  the  background.  In  FLIR 
imagery,  the  contrast  is  caused  by  differences 
in  temperatures  rather  than  differences  in 
intensities  of  reflected  light  as  in  visible  band 
imagery. 

A  new  generation  of  infrared  sensors  having 
higher  spatial  resolution  and  greater  therm^ 
sensitivity  has  been  developed.  Second 
generation  FLIR  sensor  image  quality  is 
significantly  better  than  that  of  the  previous 
generation  FLIR  sensors.  A  target’s 
structural  details  are  clearly  visible  at  longer 
ranges  and  with  higher  fidelity.  Object 
recognition  algorithms  developed  within  the 
lU  community  for  visible  imagery  are  of 
great  interest  within  the  ATR  community.  As 
in  the  case  of  visible  imagery,  an  object's 
appearance  varies  greatly  depending  upon 
several  object  and  environmental  factors. 

The  appearances  or  signatures  of  tactical 
vehicles  in  FLIR  are  highly  variable 
depending  on  factors  attributable  to  the 
vehicle,  the  sun,  and  the  atmosphere.  In 
addition  to  thermal  emissions  from  the 
targets,  there  are  also  emissions  from  the 
surrounding  background  caused  by  reradiated 
heat. 

Figure  1  shows  examples  of  three  different 
targets  in  two  signature  states.  In  the  upper 
row,  all  of  the  targets  have  their  engines 
running  and  are  exercised  and  their  FLIR 
signatures  are  bright  in  the  locations  of  their 
engines  and  exhausts  caused  by  thermal 
conduction  from  engine  combustion  and  their 
wheels  and  treads  caused  by  frictional 
heating. 
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In  the  lower  row,  the  engines  are  not  running 
and  so  there  is  no  thermal  signature  from  the 
engine.  1  ne  targets  are  not  exercised  so  there 
is  no  frictional  heating  from  the  wheels  or 
treads. 

Figure  2  shows  the  effect  of  aspect  angle 
variations  on  a  target's  apj^arance  in  FLIR. 
This  figure  shows  six  views  of  an  M35 
truck.  The  engine  is  visible  in  all  of  the  front 
facing  aspect  angles  while  it  is  not  in  the  rear 
facing  aspects,  making  it  more  difficult  to 
recognize.  Notice  how  the  position  of  the 
exhaust  pipe  is  a  strong  clue  to  the  vehicle's 
aspect  angle. 

ATR  research  has  focused  on  solving  the 
recognition  problem  for  all  of  the  different 
vehicle  orientation  and  thermal  variables. 
However,  because  of  the  difficulty  in  solving 
this  constrained  class  of  the  problem,  less 
emphasis  has  been  placed  on  solutions 


involving  variations  of  the  environmental 
variables. 

FUR  Modeling  Software 

Infrared  modeling  research  has  been 
conducted  for  many  years  towards 
understanding  the  impacts  on  both  targets  and 
backgrounds  of  the  state  variables  listed 
above.  An  example  of  recent  progress  in 
FLIR  modeling  for  targets  is  PRISML 
PRISM  incorporates  geometrical  models  of 
vehicles  with  efficient  approximations  to  the 
solutions  of  first  principles  heat  transfer 
equations.  It  is  an  effective  tool  for 
visualizing  target  signature  variations  given 
the  a  priori  values  of  many  target  and 
environmental  variables.  Research  is 
continuing  toward  accurately  predicting  the 
signatures  of  targets  in  simulations  of 
complex  real-world  backgrounds. 


Figure  1.  Examples  of  Signature  Variation,  a.)  M60  tank  with  hot  engine  and  treads,  b.)  M60 
tank  with  cold  engine  and  treads,  c.)  Ml  13  APC  with  hot  engine  and  hatch,  d.)  Ml  13 
APC  with  hot  engine  but  cold  hatch,  e.)  M35  truck  with  hot  engine,  exhaust,  and  bed, 
f.)  M35  truck  with  hot  engine  and  exhaust  but  cold  bed. 
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Figure  2.  Six  Aspect  Angles  of  an  M35  Truck  Showing  its  Changing  FLIR  Signature  as  a 
Function  of  Aspect. 


2.1.2  Radar  Signature  Variations 

In  the  Millimeter  Wavelength  (MMW)  region 
of  the  electroniagnetic  spectrum,  radar 
systems  have  been  designed  that  have  a 
narrower  beamwidth  than  the  more  prevalent 
tracking  radars  that  operate  in  longer 
wavelength  bands.  Higher  spatial  resolution 
is  obtained  by  using  MMW  radars  making  it 
possible  to  resolve  tactical  targets  at  ranges 
out  to  several  kilometers.  Radar  systems 
have  recently  been  developed  that  employ 
MMW  technology  and  provide  signal  data  in 
the  forms  of  high  range  resolution  profiles, 
spatial  super  resolution  returns,  polarimetric 
data,  and  Doppler.  In  all  of  these  cases,  the 
radar  pulses  are  reflected  by  targets  and 
clutter. 

Reflections  from  the  discrete  scatterers  that 
make  up  the  high  range  resolution  signal  have 
a  very  narrow  reflection  peak.  The  narrow 
width  of  the  peak  is  caused  by  the  nature  of 
the  reflection,  mainly  specular,  dihedral,  and 
trihedral;  all  of  these  have  very  narrow  peaks. 

Figure  3  shows  two  plots  of  MMW  radar 
HRR  profiles  from  an  M60  tank.  The  axes 


represent  the  range  in  meters  and  the  aspect 
angle  in  number  of  looks  at  the  target  where 
looks  are  spaced  approximately  0.02  degrees 
apart.  The  aspect  angle  varies  only  by  1.1 
degrees  over  the  plots.  The  prominent  peaks 
in  the  signal  appear  for  only  about  one  look 
or  0.02  degrees  and  then  become  less 
prominent,  yielding  to  other  new  peaks.  This 
behavior  is  consistently  observed  in  HRR 
profile  plots.  Unfortunately,  this  behavior 
makes  the  recognition  problem  extremely 
difficult. 

Figure  4  shows  the  length  feature  as  a 
function  of  target  aspect  angle.  The  length 
feature  is  extracted  from  an  HRR  profile  by 
locating  the  first  and  last  bins  in  the  profile. 
In  this  figure,  each  aspect  bin  is  a  histogram 
of  the  target  length  for  that  aspect  angle 
computed  over  many  looks  at  the  target.  The 
length  feature  is  quite  noisy.  There  appears 
to  be  a  dominant  length  for  each  aspect  that 
when  traced  connected  over  all  aspects,  it 
traces  out  the  characteristic  aspect  angle 
signature  of  the  target. 
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Figure  3.  MMW  Radar  HRR  Plots  for  a  Tank.  Range  axis  is  in  meters  while  the  aspect  axis  is  in 
looks  separated  by  approximately  0.02  degrees  in  aspect  for  a  total  aspect  change  of  1.1 
degrees.  Notice  how  irregular  the  peaks  are  from  look  to  look. 


M60  TURNTABLE  TAT  LENGTH  W 


Figure  4.  Histograms  of  the  Length  Feature  Extracted  from  HRR  Profiles  for  a  Tank.  Each 
aspect  angle  bin  is  a  histogram  of  the  len^h  feature  from  numerous  looks  at  the  target  at 
the  associated  aspect.  Although  there  is  a  trend  visible  over  all  aspect  angles,  each 
aspect  bin  exhibits  a  large  amount  of  uncertainty. 

vehicles  has  progressed  recently.  One 
example  is  the  TRAK  radar  modeling 
package  developed  by  Georgia  Tech 
Research  Institute^.  TRAK  couples 


Radar  Modeling  Software 

Research  for  predicting  accurate  radar 
signatures  from  geometrical  models  of 


multimode,  multiband  radar  modeling  with 
geometrical  vehicle  models  built  using  the 
MAX  Geometrical  modeler^. 

2.2  Target  Occlusions 

The  most  frequently  encountered  occlusions 
occur  when  terrain  between  a  sensor  and 
target  occludes  a  target.  This  type  of 
occlusion  is  of  concern  in  ground-to-ground 
scenarios  (both  the  sensors  and  the  target  are 
on  the  ground).  Other  types  of  occlusion  that 
occur  in  air-to-ground  scenarios  are  caused 
by  other  man-made  or  natural  objects.  Trees, 
rocks,  buildings,  and  other  targets  are 
occluding  objects.  Neither  the  percentage  of 
the  target  that  is  occluded  nor  the  part  of  it  are 
known  a  priori  thereby,  complicating 
recognition. 

2.2.1  Dynamic  Occlusions 

Dynamic  occlusions  occur  whenever  the 
occlusion  changes  over  time.  The  situations 
in  which  this  happens  are; 

1 .  A  moving  target  moves  behind  stationary 
objects  or  another  moving  target 

2 .  A  moving  target  occludes  a  stationary 
target. 

Dynamic  occlusions  begin  with  an 
unoccluded  target  becoming  progressively 
more  obscured  over  several  frame  times  until 
a  point  of  maximum  occlusion  after  which  the 
target  becomes  progressively  less  obscured 
until  it  is  once  again  unoccluded.  It  is 
reasonable  to  assume  that  since  the  target  will 
reappear  in  a  short  time,  it  is  unnecessary  to 
continually  recognize  the  target  through  the 
occlusion.  Rather,  a  target  tracking  algorithm 
might  be  employed  to  coast  the  target  state 
through  the  occlusion  and  then  continue 
performing  recognition  once  the  occlusion  is 
over.  Dynamic  occlusion  is  not  as  serious  as 
the  static  occlusion. 

2.2.2  Static  Occlusions 

Static  occlusions  do  not  change  over  time 
because  there  is  no  relative  motion  between 
target  and  occluding  object  or  terrain.  In  this 
situation,  recognition  must  be  performed 
using  only  the  visible  portion  of  the  target. 


This  is  perhaps  one  of  the  most  compelling 
justifications  for  using  detailed  target  models 
to  perform  recognition  tasks.  Since  the  part 
of  the  target  that  is  occluded  cannot  always  be 
predicted,  the  recognition  algorithm  should 
be  able  to  use  whatever  portion  is  unoccluded 
to  recognize  the  target. 

2.3  Sensor  Fusion 

The  generic  benefits  of  fusing  information 
from  sensors  of  different  type  are  well 
known;  developers  can  take  advantage  of  the 
specific  qualities  of  each  type  of  sensor  to 
achieve  performance  that  is  superior  to  that  of 
each  individual  sensor.  The  challenge  for  the 
ATR  developer  is  to  determine  when  to  fuse 
and  at  what  level  in  the  processing  hierarchy. 

Sensor  fusion  has  not  been  a  primary  focus 
of  lU  research.  However,  examples  of  it 
have  recently  been  published  by  researchers 
at  UPENN,  CMU,  and  SRI.  UPENN,  for 
example,  has  completed  some  elegant  work 
on  the  mathematical  foundations  of 
fusion"*’^’^.  Work  at  CMU  in  sensor  fusion 
has  focused  on  fusing  3D  data  from  a 
LADAR  with  color  imagery  for  vision  aided 
navigation  of  the  NAVLAB^>*.  SRI  has  used 
fusion  techniques  to  good  purpose  in  their 
work  on  stereo^. 

In  the  ATR  community,  sensor  fusion  is 
performed  at  several  distinct  levels  in  the 
processing  hierarchy:  the  data  association 
level,  the  information  fusion  level,  and  the 
decision  level. 

At  the  data  association  level,  the  association 
problem  is  due  to  the  vastly  different  spatial 
resolutions  of  radar  and  FLIR  sensors. 
Radar  spatial  resolution  is  approximately  an 
order  of  magnitude  lower  than  that  of  FLIR. 
A  symptom  of  this  problem  occurs  whenever 
targets  are  closely  spaced  such  that  a  single 
radar  return  contains  reflected  energy  from 
multiple  targets  while  the  targets  are  resolved 
by  the  FLIR.  The  association  problem  for 
sensors  having  differing  spatial  resolutions 
has  not  been  dealt  with  in  the  lU  community. 

At  the  information  fusion  level,  information 
from  each  of  the  sensors  has  already  been 
extracted  and  is  to  be  fused  in  order  to  form 
hybrid  information  that  can  aid  in  improving 
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the  discrimination  of  recognition  algorithms. 
Figure  5  is  an  example  of  fusion  at  this  level. 

Two  sets  of  radar  plots  are  shown  along  with 
two  sets  of  FLIR  images.  The  radar  plots 
look  very  similar  and  yet  they  are  correlated 
to  two  very  different  FLIR  images.  On  the 
other  hand  two  similar  FLIR  images  correlate 
to  two  very  different  radar  plots.  Supporting 
radar  recognition  with  FLIR  information 
would  help  distinguish  the  two  targets.  In 
the  second  case  the  radar  would  help 
determine  whether  the  two  targets  recognized 
differed  only  in  aspect  angle.  The  problem  is 
not  how  to  fuse  the  information  if  the 
methods  are  well  known.  What  is  not 
understood  is  when  to  fuse  the  information 
given  the  situational  context  and  at  what  level 
feature  level,  recognition  level,  detection 
level,  etc.) 

At  the  decision  level,  the  benefits  from  fusion 
come  from  fusing  interpretations  of  the  data 
from  the  radar  and  FLIR  sensors  using 
associated  confidences.  However,  this  level 
of  fusion  has  traditionally  required 
knowledge  of  a  priori  probabilities  for  the 


target  classes  and  class  conditional  densities 
that  are  empirically  derived.  The  way  a 
developer  makes  fusion  algorithms  work  is 
by  tuning  (tweaking)  them  until  they  achieve 
the  desired  results.  Success  is  specific  to  the 
particular  data  set  used  and  the  ability  of  the 
developer  to  make  enough  of  the  right 
parameters  available  to  tune. 

ATR  developers  have  been  relying  more 
heavily  on  models  both  to  alleviate  the 
problem  of  not  having  enough  development 
data  available  and  also  to  gain  detailed 
understanding  of  the  problem. 

2.4  Articulated  Targets 

Tactical  targets  consist  of  moving  parts,  some 
of  which  are  large  enough  to  cause  a  change 
in  the  target’s  appearance  when  they  are 
moved.  Tank  turrets  are  a  simple  example; 
but  others  are  trailers  towed  by  trucks  and 
missiles  mounted  on  missile  launchers.  In 
the  case  of  articulation,  the  component  of  a 
target  that  could  be  rotated  is  known  a  priori, 
but  the  amount  of  rotation  is  not  known. 


m3S  at  0  degrees  looks  similar  to  ml  13  at 
55  degrees  in  MMW  ...  but  not  m  IR 


m35  at  0  degreees  aspect 


m  1 1 3  at  55  degrees  aspect 


m60  at  55  degrees  looks  similar  to  m2  at  55 
degrees  m  IR  Out  not  in  MMW 


m60  at  55  degrees  aspect 


m2  at55  degrees  aspect 


Figure  5.  Examples  Showing  the  Potential  Benefits  of  Sensor  Fusion,  a.)  The  two  MMW  radar 
HRR  plots  are  ambiguous  but  the  FLIR  images  of  the  targets  are  quite  different 
indicating  that  the  FLIR  image  helps  disambiguate  the  targets,  b.)  The  FLIR  images  are 
ambiguous  while  their  MMW  radar  HRR  plots  are  quite  different  indicating  that  the 
radar  data  helps  disambiguate  the  targets. 


262 


Figure  6  shows  an  example  of  two  tanks  with 
and  without  rotated  turrets.  Tanks  are  often 
seen  in  motion  with  their  turrets  facing  the 
rear.  This  is  done  so  that  the  driver  has 
adequate  clearance  to  escape  from  the  drivers 
compartment  in  case  of  an  emergency.  Tanks 
are  also  sometimes  observed  in  this 
conHguradon  while  parked. 

As  in  situations  involving  occlusions,  it  is 
important  to  make  use  of  partial  target  models 
that  predict  the  possibility  of  articulated 
component  rotation. 


Figure  6.  Two  Tanks  With  and  Without 
Rotated  Turrets. 

3.  A  CHALLENGE  FOR  iU 

3.1  Target  Signature  Variations 

The  target  signature  variations  derive  from 
different  objects  and  require  different  models 
for  their  recognition.  The  augmented  model 
base  approach  increases  the  amount  of  search 
needed  for  recognition  based  on  the  number 
of  signature  variations.  By  treating  the 
problem  in  this  fashion,  very  efficient  search 
strategies  are  needed  possibly  involving 
precompilation  of  object  models  into  efficient 
search  strategies.  The  IU  community  has 
been  active  in  this  study  for  the  past  several 


years  and  could  have  a  positive  impact  on 
ATR  research  in  this  area. 

Alternatively,  signature  variation  derive  from 
instances  of  the  same  object  under  different 
illuminations.  In  this  case,  there  is  a  need  to 
exploit  the  properties  of  the  object  that  are 
invariant  to  illumination  or  thermal  signature 
in  the  case  of  FLIR.  The  IU  community  has 
also  done  research  in  this  area  as  well. 

Choosing  an  approach  depends  upon  the 
degree  of  ambiguity  between  objects  in  the 
model  base,  the  size  of  the  model  base,  and 
the  search  complexity  and  time  constraints 
placed  on  the  algorithm  by  the  processor. 

3.2  Occlusions 

The  IU  community  has  focused  on  the 
problem  of  occluded  object  recognition. 
Many  approaches  have  been  developed  that 
make  use  of  partial  object  model  matching  to 
occluded  object  data.  Some  examples  are: 

•  The  recent  work  on  perceptual  grouping 
of  image  features  prior  to  model 
matching^^ 

•  Searching  the  interpretation  tree  with  null 
model  matches^  ^ 

•  The  use  of  Hough  transforms  and 
geometric  hasMng  techniques  to  limit  the 
number  of  possible  object 
hypotheses. 

3.3  Sensor  Fusion 

Where  IU  can  be  beneficial  to  ATR  is  to 
develop  general  methods  for  determining 
which  information  to  fuse  (i.e.,  when  to 
believe  one  sensor  over  another)  and  when  to 
fuse  it  (i.e.  does  it  help  to  add  MMW 
information  if  it  is  raining).  The  overall 
problem  is  accounting  for  context.  The 
specific  problem  is  in  determining  what 
context  an  ATR  is  in  and  given  that  it  is  there, 
to  determine  what  can  be  believed  and  what 
can  not. 

3.4  Articulated  Targets 

Articulated  targets  present  a  difficult  problem 
because  of  the  unknown  relative  orientation 
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of  the  separated  parts.  The  problem  is  made 
mwe  complex  by  the  possibility  of  multiple 
articulated  objects  in  the  same  model  base  and 
unconstrained  view  angles.  The  more  recent 
work  on  perceptual  grouping  and  geometric 
hashing  may  prove  useful  in  providing 
solutions  to  this  problem.  Also  needed  are 
geometric  reasoning  approaches  for 
hypothesizing  instances  of  the  known 
articulated  objects  in  particular  relative 
orientations  given  the  partial  descriptions 
from  the  grouped  data. 


4.  CONCLUSION 

An  lU  program  devoted  to  ATR  challenges 
promises  significant  benefit  to  DoD.  The 
issues  raised  in  this  paper  are  by  no  means 
exhaustive  representing  only  a  quick  look  at 
the  kinds  of  challenges  for  and  applications 
of  lU  techniques  in  the  ATR  domain.  A 
more  thorough  look  would  no  doubt  produce 
a  longer  and,  perhaps,  more  interesting  list  of 
topics. 
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Abstract 

Emerging  requirements  for  detailed  spatial  databases  to 
support  a  variety  of  Department  of  Defense  (DoD) 
activities  may  require  substantial  investment  in  basic 
research  to  advance  the  state-of-the-art  in  automated 
cartographic  database  construction  and  maintenance. 
This  paper  provides  a  technical  rationale  for  pursuing  a 
new  initiative  focused  on  the  application  of  image 
understanding  technology  to  automated  cartography  in 
support  of  programs  for  simulation  and  training, 
mission  planning,  autonomous  navigation,  and  the 
analysis  of  intelligence  imagery.^ 

1.  Introduction 

The  goal  of  this  proposed  initiative  is  to  provide  a 
research  focus  in  automated  mapping  and  the 
interpretation  of  remotely  sensed  imagery  to  meet  the 
needs  of  existing  and  emerging  Defense  programs. 
Typically,  each  program  has  a  component  that  relies  on 
the  generation  and  maintenarKe  of  accurate  spatial  data 
to  support  decision-making  by  human  military 
personnel  and/or  intelligent  autonomous  agents.  A 
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concerted  effort  focused  in  this  area  (akin  to  the 
Strategic  Computing  Program  initiatives)  has  long¬ 
term  relevance  to  several  on-going  DoD  and  Defense 
Advanced  Research  Projects  Agency  (DARPA) 
programs.  These  include  major  programs  for  advanced 
distributed  simulation  such  as  Battlefield  Distributed 
Simulation  (BDS-D)  aixi  Close  Combat  Tactical 
Trainer  (CCTT),  high-fidelity  training  simulators 
represented  by  the  Special  Operations  Forces  Aircrew 
Training  System  (SOF-ATS),  Unmanned  Ground 
Vehicles  (UGV),  and  Research  and  Development  for 
Imagery  Understanding  Systems  (RADIUS).  The 
spatial  data  requirements  for  these  programs  generally 
include  significant  augmentation  of  standard  products 
produced  by  the  Defense  Mapping  Agency  (DMA)  to 
address  critical  issues  of  timeliness,  local  geographic 
intensification,  and  operational  security.  They 
currently  rely  largely  on  manual  and  interactive 
compilation,  which  are  labor-intensive  and  not  well- 
suited  to  demands  of  responsive  database 
intensification  and  maintenance.  Application-specific 
enhancements  are  seldom  shared  with  other  users. 
Moreover,  the  emergence  of  mature  networking 
technology  is  generating  high-level  interest  in 
achieving  interoperability  between  highly 
heterogeneous  systems  that  require  consistent  terrain 
databases  at  widely  varying  levels  of  detail.  Achieving 
and  maintaining  consistency  between  such  diverse 
terrain  databases  is  a  major  research  and  operational 
issue.  Thus,  there  is  a  need  to  advance  the  state-of-the- 
art  in  the  representation,  construction,  intensification, 
and  maintenance  of  spatial  databases  to  support  a 
diverse  and  growing  set  of  DoD  customers. 
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1.1.  Cartographic  Technology  ’Backbone’ 
DARPA  can  take  the  lead  in  developing  a  responsive 
DoD  cartographic  technology  ’backbone’  to  address 
those  areas  where  static  standard  mapping  products 
alone  can  not  provide  either  the  appropriate  level-of- 
detail  or  timeliness.  Currently,  specialized  DoD 
programs  with  requirements  that  can  not  be  fully  met 
by  off-the-shelf  DMA  mapping  products  typically 
pursue  generation  of  program-specific  terrain  databases 
that  are  neither  timely,  inexpensive,  standardized,  nor 
shared.  Col.  David  F.  Maune,  Commander  and 
Director  of  the  U.S.  Army  Topographic  Engineering 
Center  (TEC),  states  the  problem  succinctly: 

"Currently,  the  purchase  price  for  each  simulation 
system  procured  by  the  Department  of  Defense  (DoD) 
includes  the  cost  of  developing  its  own  unique  data 
base,  or  the  cost  of  transforming  data  bases  developed 
by  the  Defense  Mapping  Agency  (DMA)  or  others. 
This  results  in  the  proliferation  of  nonstandard, 
incompatible  data  bases  and  software  which  is  costly 
for  the  government  in  terms  of  both  recurring 
development  and  maintenance  costs.  In  many 
programs,  this  also  means  that  costly  data  bases  are 
only  produced  for  unique  training  scenarios  and  would 
not  be  applicable  for  rehearsals  of  actual  missions  on 
different  pieces  of  real  estate.  We  learned  in 
Operation  Desert  Storm  that  mission  rehearsal 
capabilities  on  the  well-mapped  and  digitized  terrain 
of  the  Fulda  Gap  in  Germany  didn’t  do  much  good  in 
Kuwait." 

Similar  observations  can  be  made  regarding  unique 
databases  developed  for  various  wargaming,  mission 
planning,  command  and  control,  and  intelligence 
applications. 

Several  initiatives  are  underway  to  develop  spatial  data 
exchange  specifications  (e.g.,  DMA’s  Vector  Product 
Format,  Air  Force  Project  2851s  SSDB  Interchange 
Format  or  SIF)  which  promise  improved  mechanisms 
for  sharing  extracted  spatial  data.  These  provide  a 
framework  for  exploring  the  technical  and 
programmatic  issues  associated  with  generating  and 
maintaining  consistent  terrain  databases  in  diverse  real¬ 
time  formats  and  at  varying  levels-of-detail. 

Our  technology  development  proposal  should  be 
viewed  as  addressing  the  gap  in  the  generation  of 
specialized  and  highly  timely  spatial  data,  not  as 
solving  the  DoD  mapping  problem  in  the  large.  There 
are  several  general  classes  of  high  visibility  DoD 
programs  that  could  benefit  from  an  enhanced 
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technical  base  to  support  the  production  and 
maintenance  of  cartographic  data.  These  include: 

1.  Simulation,  mission  planning  and  rehearsal, 
training,  and  image  perspective  transformation 
(IPT). 

2.  Advanced  weapon  systems  involving 
autonomous  guidance,  highly  accurate  terminal 
homing,  and  autonomous  target  detection 
(smart  and  brilliant  weapons). 

3.  Unmanned  air  and  ground  vehicles 
(UAV/UGV)  employing  autonomous 
navigation  technologies. 

4.  Aids  for  the  analysis  of  tactical  and  national 
imagery  including  basic  site  mapping  and 
terrain  modeling. 

2.  Technology  Development  Areas 

We  have  identified  four  technology  development  areas 
that  provide  a  rich  set  of  basic  and  applied  research 
problems,  represent  areas  that  are  currently  under 
emphasized  in  the  DARPA  lU  research  program,  and 
wiU  provide  leverage  for  and  synergy  with  ongoing 
DoD  development  and  DARPA  research  programs. 
These  areas  are: 

2.1.  Spatial  database  intensification 

Given  a  ’standard’  digital  map  product  such  as  Interim 
Terrain  Data  (ITD)  and  high  resolution  aerial 
photography,  investigate  automated  techniques  to 
improve  the  level-of-detail  by  detection,  delineation, 
and  attribution  of  man-made  objects  such  as  buildings 
and  roads  whose  scale  precluded  their  compilation  into 
the  original  digital  map.  Integrate  the  extracted  spatial 
information  into  the  original  ’standard’  digital  map 
product  format  to  illustrate  the  concept  of  a  seamless 
database  update. 

2.2.  Hierarchical  generalization  of  spatial  data 

Current  trends  in  heterogeneous  spatial  data  to  support 
mission  rehearsal,  simulation,  and  IPT  require  data  at  a 
variety  of  spatial  resolutions  and  accuracies.  Given  a 
standard  digital  map  product,  such  as  ITD,  aggregate 
and  generalize  the  spatial  data  by  well  defined  criteria 
to  generate  a  set  of  corresponding  digital  maps  at 
successively  coarser  levels  of  resolution.  Support  map 
update  by  propagating  changes  in  each  level  of 
representation.  Correlated  representation  of 
geographic  areas  at  multiple  levels  of  detail  are 
required  to  support  integration  of  diverse  models, 
simulations,  and  simulators  in  large  heterogeneous 
distributed  systems. 
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2.3.  Information  fusion  from  multiple  data 
sources 

For  the  most  part,  current  research  in  computer  vision 
for  cartographic  feature  extraction  and  photo- 
interpretation  has  relied  on  digitized  aerial 
photography  using  panchromatic  imagery.  The 
mapping  community  has  shown  the  utility  to  produce 
large-scale  photographic  map  products  using  Landsat 
TM  and  SPOT  satellite  imagery.  There  exists  a  body 
of  baseline  digital  elevation  and  coarse  resolution  map 
data  (DMA  DLMS,  USGS  DLG,  etc.).  A  research 
initiative  that  looks  for  applications  that  emphasize  the 
integration  and  fusion  of  multiple  data  sources  would 
bring  current  lU/AI  technology  to  bear  in  an  area 
which  heretofore  has  received  little  attention  by  the 
research  community. 

2.4.  High  resolution  multi-spectral  image 
analysis 

Over  the  last  twenty  years,  the  remote  sensing 
community  has  focused  its  efforts  on  the  analysis  of 
remotely  sensed  multi-spectral  imagery  such  as 
Landsat  MSS,  Landsat  TM,  and  SPOT  using 
computational  models  derived  from  statistical  analysis. 
As  the  pixel  size  has  shrunk  from  80  meters,  to  30 
meters,  to  10-20  meters,  the  opportunity  has  increased 
to  apply  structural  and  spatial  analysis  techniques 
developed  for  high  resolution  aerial  imagery  to  satellite 
data.  Further,  a  new  generation  of  high  resolution 
airborne  multi-spectral  scanners  (Daedaleus,  AVRIS, 
MEIS,  etc.)  can  be  flown  to  support  multi-spectral 
pixel  resolutions  well  below  five  meters.  An 
opportunity  exists  to  develop  lU  technology  to  support 
the  automated  analysis  of  high  resolution  multi- 
spectral  imagery  that  goes  beyond  traditional  statistical 
analysis.  The  variety  of  applications  and  potential 
impact  for  automated  surface  material  classification, 
more  accurate  map  feature  attribution,  and  improved 
thematic  and  land-use  maps  makes  this  an  important 
technology  development  area. 

3.  Project  Approach  and  Technology  Transfer 

A  more  substantive  study  should  be  tasked  to  produce 
detailed  descriptions  of  key  application  problems.  This 
should  be  initiated  as  soon  as  possible.  Once  existing 
image  understanding  (lU)  technology  areas  have  been 
matched  with  particular  problem  domains,  the  goal  is 
to  show  the  applicability  of  lU/AI  techniques  to  those 
problems  via  image  analysis  and  spatial  database 
construction  tasks.  This  will  require  cooperation 
between  lU  researchers,  developers,  and  the  end-user 
community  to  assemble  the  variety  of  imagery,  map. 


and  terrain  data  necessary  to  support  detailed 
experimentation  and  evaluation.  It  is  clear  that  the 
current  lU  technical  base  in  stereo  matching, 
monocular  image  analysis,  three-dimensional 
modeling,  information  fusion,  etc.  provides  the 
technical  foundation  for  this  project.  What  is  needed  is 
a  focused  and  concerted  effort  for  improved 
performance  across  more  complex  task  domains 
containing  a  variety  of  source  imagery,  a  range  of 
scene  complexities,  and  guided  by  human-level 
perfonnance  metrics  arising  from  the  requirements  of 
the  end-user  community. 

Technology  transfer  will  be  accomplished  in  part  by 
milestone  demonstrations  in  government  pilot  plant 
facilities  such  as  the  TEC  Terrain  Analysis  Center,  the 
National  Exploitation  Laboratory  (NEL),  and  various 
DMA  techniques  groups.  Such  demonstrations  serve 
to  show  proof  of  principle  on  representative 
operational  datasets. 

3.1.  Potential  Customers 

As  previously  discussed  there  are  many  potential 
customers  with  DoD  for  improved  techniques  to  build 
detailed  cartographic  databases  to  support  simulation, 
intelligence  analysis,  mission  rehearsal  and  planning. 
We  focus  on  two  illustrative  programs,  one  within 
DARPA,  and  the  other  within  the  Air  Force. 

DARPA  is  starting  a  diverse,  multi-office  program  in 
Advanced  Simulation  built  on  the  distributed 
simulation  technology  pioneered  in  the  highly 
successful  DARPA/Army  SIMNET  program.  One 
thrust  of  the  program  is  to  expand  the  original  focus  of 
armor  unit  training  to  include  all  service  training 
exercises  at  progressively  higher  echelons.  Proof-of- 
concept  was  initially  demonstrated  by  incorporating 
close  air  support  and  air  defense  artillery  to  augment 
the  armor  maneuver  force  in  the  simulation  network.  It 
was  then  extended  to  include  naval  forces  with 
coordinated  off-shore  naval  gunfire  and  cruise  missile 
capabilities  demonstrated  by  DARPA  and  the  Navy  in 
Battle  Fleet  In-Port  Training  (BFIT)  exercises. 

Advanced  Simulation  surfaces  many  open  issues  in  the 
representation  of  digital  terrain  data  to  provide  levels- 
of-detail  ranging  from  highly  efficient  formats  to 
support  fast-moving  strategic  and  tactical  aircraft  down 
to  fine-grained  terrain  databases  for  dismounted 
infantry.  During  the  course  of  a  battle  simulation  the 
terrain  may  be  modified  due  to  battlefield  activity 
requiring  the  instantiation  of  modified  terrain 
representations  throughout  a  network  of  heterogeneous 
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simulators.  Issues  in  temporal  representation  of  micro 
terrain  features  may  have  bearing  on  techniques  to 
extract  and  maintain  digital  elevation  models  including 
attributes  for  surface  material  and  cross  country 
mobility  derived  from  stereo  and  multi-spectral 
imagery. 

A  related  program  objective  is  the  integration  of 
traditionally  independent  analytical  models,  wargames, 
manned  and  unmanned  simulators,  and  actual 
equipment^  into  "seamless  simulations"  of  joint  forces 
in  the  modem  battlefield.  Advanced  Simulation  faces 
many  tough,  interesting  problems  and  has  joint 
services  interest. 

The  Special  Operations  Forces,  Aircrew  Training 
System  (SOF-ATS)  is  a  large  United  States  Specif 
Operations  Command  (USSOCOM)  program  being 
developed  by  the  Air  Force.  SOF-ATS  will  support 
crew  training  for  the  Air  Force  component  of 
USSOCOM  and  mission  rehearsal  for  both  the  Air 
Force  and  Army  aviation  components  of  USSOCOM, 
to  include  numerous  Special  Operations-vaiiant 
helicopters  and  fixed-wing  aircraft.  SOF-ATS  is  a 
broad-based  program  ranging  from  specialized 
computer  image  generation  hardware  (CIG)  with 
requirements  for  very  rapid  (48  hours)  and  large 
(500,000  nm^)  spatial  database  generation  and 
maintenance.  The  spatial  database  will  support  the 
simulation  of  visual,  radar,  forward  looking  infra-red 
(FLIR),  and  night  vision  devices  to  USSOCOM 
aircrews.  SOF-ATS  includes  the  requirement  to 
consolidate  intelligence,  cartography,  imagery,  and 
meteorology  source  data  into  an  environmental  model 
used  by  its  networked  mission  rehearsal  devices. 

As  with  Advanced  Simulation,  SOF-ATS  raises 
significant  technology  issues  in  the  development  and 
maintenance  of  multi-resolution  databases  with  large 
areas  of  coverage.  Level-of-detail  range  from  specific 
buildings  modeled  at  high  resolution  to  coarse 
resolution  terrain  simulation  over  areas  covering  many 
thousands  of  square  nautical  miles. 

There  are  many  additional  customers  for  spin-offs  from 
a  program  focused  on  the  timely  production  of 
cartographic  data.  These  include  the  DARPA 


^  i.e.,  Army  tanks  at  the  National  Training  Center,  Air  Force 
fighter  aircraft  at  Red  Flag,  and  Navy  fighters  at  "Strike 
University" 


RADIUS  program  whose  focus  is  on  the  automated 
generation  and  maintenance  of  detailed  site  models,  the 
DARPA  Unmarmed  Ground  Vehicle  program  which 
has  several  scenarios  based  upon  near  real-time  update 
of  cartographic  databases  to  support  vehicle  planning 
and  navigation,  and  enhancements  to  the  new  DMA 
Digital  Production  System  for  improved  generation  of 
digital  cartographic  data  at  the  baseline  plant. 

Given  the  current  state  of  the  art  it  can  be  anticipated 
that  initial  operational  capabilities  for  spatial  database 
generation  will  be  largely  interactive  and  labor 
intensive.  However,  such  systems  provide  the 
framework  for  insertion  of  emerging  technology.  A 
successful  program  based  upon  automation  will  have  a 
profound  impact  on  productivity,  flexibility,  and 
timeliness  of  cartographic  compilation  in  support  of  a 
variety  of  end-users. 

3.2.  Time  Plan  and  Budget 

In  order  for  a  new  initiative  of  this  type  to  be 
successful,  it  will  require  a  strong  DARPA  liaison 
between  the  image  understanding  community  and  key 
Defense  applications.  It  will  be  important  for  the 
research  community  to  become  quite  familiar  with 
specific  programmatic  shortfalls  in  cartographic  data  at 
a  level  *at  may  require  a  serious  investment  of  time 
and  resources.  Some  initial  ’windfalls’  can  be 
expected  from  the  current  image  understanding 
technical  base,  but  a  new  program  will  require  a 
significant  level  of  funding  ($5-10  million)  over  two- 
three  years  to  accelerate  the  rate  of  progress  so  as  to 
impact  long-term  programs  such  as  Advanced 
Simulation  and  SOF-ATS.  A  portion  of  this  funding 
can  be  viewed  as  ’in  place’  based  upon  the  DARPA 
programs  in  basic  lU,  UGV,  and  RADIUS,  but  these 
programs  have  their  own  needs  and  goals  and  can  not 
be  relied  upon  to  address  these  more  general  issues. 
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Abstract 

The  Image  Understanding  Architecture  (lUA)  effort  is  now 
entering  a  second  phase.  The  lUA  proof-of-concept  prototype 
has  been  completed  and  our  experience  with  both  the 
hardware  and  extensive  software  simulations  are  guiding  the 
development  of  a  second  generation  of  the  lUA. 
Furthermore,  the  initial  research-oriented  software 
development  environment  is  currently  being  replaced  by  a 
sophisticated  set  of  application-oriented  tools.  Thus,  the 
lUA  effort  is  in  the  process  of  making  the  transition  from  an 
isolated  research  project  to  being  in  a  position  of 
accessibility  to  the  wider  community.  This  article  describes 
the  current  status  of  the  effort  and  some  of  our  plans  for  the 
future.  lUA  development  is  taking  place  at  three  sites:  the 
University  of  Massachusetts  at  Amherst,  Hughes  Research 
Laboratories  in  Malibu,  and  Amerinex  Artificial  Intelligence 
Inc.  The  article  is  thus  divided  into  major  sections  that 
describe  the  efforts  taking  place  at  each  site. 

1.  University  of  Massachusetts 

Efforts  at  the  University  of  Massachusetts  have  focussed 
principally  in  three  areas:  the  design  of  the  second  generation 
lUA  hardware,  development  of  advanced  {xogramming  tools, 
and  algorithm  development.  The  second  generation  lUA 
design  is  nearly  complete  and,  although  we  expect  a  few 
aspects  to  change,  our  current  view  of  the  architecture  is 
briefly  described  below.  In  the  area  of  programming  tools, 
we  will  give  an  overview  of  the  multi-associative 
programming  model  that  we  have  developed  for  the  low 
(CAAPP)  level  of  the  lUA.  We  will  also  discuss  some  of 
the  issues  involved  in  building  a  parallel,  intermediate-level 
symbolic  representation  (ISR)  database  fn-  the  ICAP  level  of 
the  lUA.  We  will  also  summarize  an  lUA  application;  that 
of  deriving  dense  depth  maps  from  known  monocular 
motion. 

1.1  Second  Generation  lUA 

For  the  purpose  of  comparison,  we  first  summarize  the 
characteristics  of  the  original  lUA.  The  first  generation  of 


the  lUA  is  a  proof-of-concept  prototype  containing  4K  low- 
level  (CAAPP)  processors,  64  intermediate-level  (ICAP) 
processors,  and  a  single  high-level  (SPA)  processor  which 
also  serves  as  the  system  host  fWeems,  1989].  The  CAAPP 
processors  are  bit-serial,  each  with  320  bits  of  on-chip 
memory  and  32K  bits  of  backing-store  memory.  The  ICAP 
processors  are  16-bit  TMS320C25  chips,  each  with  256K 
bytes  of  private  memory,  256K  bytes  shared  with  64 
CAAPP  processors,  and  128K  bytes  shared  with  the  SPA. 
The  ICAP  {xocessors  communicate  via  a  centrally-controlled 
bit-serial  crossbar  switch,  using  their  built-in  serial 
communication  channels.  The  SPA  is  any  VME-bus 
compatible  processor,  typically  a  Sun-4. 

The  Array  Conut)l  Unit  (ACU)  for  the  prototype  is  a  very 
simple  memwy  buffer  for  streaming  instructions  to  the  array 
at  a  high  rate.  The  ACU  has  no  processing  or  branching 
capability,  and  thus  all  control  flow  is  managed  by  the  host. 
This  arrangement  is  adequate  for  its  purpose,  testing  and 
limited  demonstrations  of  the  system,  but  is  not  effective  for 
real  applications. 

Data  is  loaded  into  the  CAAPP  by  writing  to  an  image 
buffer,  which  is  then  shifted  into  all  of  the  processor  chips 
in  parallel,  via  their  nearest  neighbor  mesh.  Output  from  the 
CAAPP  follows  the  reverse  of  this  process.  I/O  with  the 
ICAP  is  performed  by  the  SPA/host,  via  the  dual-ported 
memory  tetween  the  ICAP  and  the  SPA/host. 

Riysically,  the  lUA  prototype  consists  of  16  12U  circuit 
boards,  plus  additional  boards  for  control,  I/O  and 
communication.  The  boards  are  sparsely  populated  to 
permit  easy  diagnosis  and  rework. 

The  second  generation  lUA  retains  the  basic  three-level 
structure  of  the  prototype,  but  the  SPA  and  host  will  be 
separate  processors.  We  expect  to  use  a  commercially 
available  multiprocessor  board  set  for  the  SPA.  The  host 
will  again  interface  via  a  VME  extender.  The  new  ACU  will 
be  a  full-fledged  processor,  consisting  of  a  microcode  engine 
with  a  128-bit  instruction  word  and  two  separate  arithmetic 
units  (one  for  computation,  and  one  for  address  arithmetic). 
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The  standard  configuration  for  the  second  generation  lUA 
will  contain  16K  low-level  processors,  64  intermediate-level 
p,  lessors,  and  four  high-level  processors.  Physically,  tlie 
P’‘'x:essor  array  will  consist  of  only  8  12U  processor  boards, 
plus  some  additional  boards  for  control  and  I/O. 

At  the  CAAPP  level,  the  basic  bit-serial  architecture  will 
again  be  used,  but  a  32-bit  comer-turning  register  increases 
the  on-chip  memory  to  352  bits  per  processor.  The  comer- 
turning  register  provides  greater  flexibility  in  formatting 
values  that  are  to  be  passed  to  the  ICAP.  Image  I/O  with  the 
CAAPP  still  involves  writing  to  a  frame  buffer,  but  the  data 
path  to  the  buffer  is  now  128  bits  wide,  permitting  a  data 
rate  of  160  MB  per  second.  Once  the  data  is  in  the  frame 
buffer,  it  appears  as  merely  another  segment  (HCSM)  of 
backing-store  memory  (CISM)  to  the  CAAPP  processors. 
Thus,  the  time  to  load  or  store  an  image  is  the  same  as  for 
any  other  backing  store  fetch.  HCSM  provides  4K  bytes  of 
storage  for  each  CAAPP  processor.  CISM  has  futher  been 
doubled  in  size  to  64K  bits  (8K  bytes)  per  processor. 

The  ICAP  level  has  been  completely  redesigned.  It  now  uses 
the  TMS320C40  32-bit  processor,  which  contains  both 


integer  and  floating-point  units,  and  operates  at  up  to  50 
MFLOPS.  Each  ICAP  processor  will  have  1  MB  of  private 
storage  in  addition  to  the  ability  to  access  the  2MB  of 
memory  it  shares  with  256  CAAPP  processors.  ICAP 
processors  are  now  arranged  in  groups  of  four  to  form  a 
quadnode  (see  Figure  1).  Each  quadnode  has  a  4  MB  local 
shared  memory  which  is  immediately  accessible  to  the  four 
processors.  The  local  shared  memories  of  all  of  the 
quadnodes  combine,  however,  to  form  a  distributed  shared 
memory.  Any  processor  has  access  to  all  of  the  shared 
memory,  although  the  latency  to  access  a  memory  outsicte  of 
the  local  quadnode  will  be  slightly  greater  than  a  local 
access.  In  the  standard  lUA  configuration  then,  there  is  a  64 
MB  global  shared  memory,  accessible  to  all  processors. 
Access  to  remote  segments  of  the  shared  memory  is  via  a 
four  by  four  mesh  of  buses. 

Communication  in  the  ICAP  also  takes  place  via  a  set  of 
message-passing  channels.  Each  processor  has  six  8-bit 
channels  together  with  six  DMA  controllers.  Thus,  each 
quadnode  has  a  pool  of  24  channels.  Of  these,  8  form  a 
token  ring  within  the  quadnode,  15  are  connected  directly  to 
all  of  the  other  quadnodes,  and  the  remaining  channel  is 
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Figure  1 .  ICAP  Bus  Structure 
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brought  to  an  external  channel  for  diagnostics,  or  custom  omitted  in  the  initial  release  of  the  hardware,  and  then  be 

I/O.  Thus,  each  quadnode  is  directly  connected  to  all  other  added  later  by  replacing  a  daughterboard.  In  essence,  we  have 

quadnodes  via  a  DMA  channel,  as  shown  in  Figure  2.  designated  a  minimal  subset  of  the  system  to  reduce  the 

risks  in  meeting  the  accelerated  development  schedule  for  the 
Note  that  in  Figure  1,  a  portion  of  the  architecture  is  second  generation.  If  the  optional  components  are  omiued,  it 

labelled  as  optional.  This  portion  of  the  system  can  be  is  possible  to  build  the  entire  second  generation  lUA  with 


Figure  2.  ICAP  Communication  Structure 
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off-the-shelf  components  (except  for  the  custom  CAAPP 
processor  chip  -  however,  that  chip  is  merely  a  four-times 
replication  of  the  original  CAAPP  chip  on  a  single  die,  so  it 
is  also  a  low-risk  component). 

1.2  Multi-associative  Processing  Model 

Computation  on  and  among  data  sets  mapped  to  irregular, 
non-uniform,  aggregates  of  processing  elements  (PEs)  is  an 
important  problem  in  parallel  vision  processing,  arising  in 
segmentation  and  in  support  operations  for  intermediate- 
level  grouping  tasks.  The  difficulty  is  that  the  SIMD 
processors  which  map  so  effectively  onto  pixel-based 
processing  are  restricted  in  these  data  dependent 
computations  by  the  inherent  limitations  of  their  control 
mechanism.  Previously,  we  have  used  associative  processing 
as  a  means  of  applying  parallel  processing  to  non-uniform 
computations  [Weems,  1984].  For  example,  this  approach 
uses  global  feedback  to  process  individual  regions 
efficiently,  but  often  requires  processing  to  take  place  on  one 
region  at  a  time.  In  our  current  work,  we  address  this 
problem  by  introducing  an  additional  level  of  parallelism, 
which  we  call  multi-associativity,  that  provides  a  framework 
for  performing  associative  computation  on  independent  data 
sets  simultaneously. 

A  typical  vision  processing  problem  to  which  multi¬ 
associativity  can  be  applied  is  the  characterization  of  regions 
obtained  from  a  connected  components  algorithm.  Some 
parameters  to  be  derived  may  include  the  number  of  pixels, 
boundary  length,  and  mean  and  median  of  various  spectral 
quantities.  However,  since  these  regions  are  arbitrarily 
shaped  collections  of  contiguous  processing  elements,  the 
communication  patterns  are  also  necessarily  non-uniform. 
Although  we  have  developed  routing  algorithms  to  collect 
data  using  roughly  2d  communication  steps  (where  d  is  the 
extent  of  the  largest  region)  [Herbordt,  1990],  we  would  like 
to  take  i:dvantage  of  the  constraints  provided  by  this  problem 
to  improve  that  performance. 

We  first  look  at  how  the  problem  would  be  solved  using 
traditional  associative  processing.  A  typical  associative 
operation  is  for  the  controller  to  broadcast  a  query  to  the 
array,  and  to  receive  a  response  in  the  form  of  a  count  of  the 
PEs  with  agreeing  tag  bits.  But  associative  processing,  as 
opposed  to  the  familiar  associative  memory  operations,  also 
enables  the  conditional  generation  of  symbolic  tags  based  on 
the  values  of  data,  and  the  use  of  those  tags  to  constrain 
further  processing.  Associative  algorithms  requiring  a 
number  of  steps  proportional  to  the  number  of  tag  bits  have 
been  developed  for  finding  the  maximum  or  minimum 
value,  the  mean  or  median  of  selected  values,  and  others. 
Descriptions  of  these  and  more  complex  algorithms  can  be 
found  in  [Foster,  1976;  Weems,  1984;  and  Weems,  1989]. 

Hardware  support  in  the  CAAPP  for  the  global  count 
operation  yields  performance  of  approximately  2  micro¬ 
seconds;  since  tag  fields  are  typically  16-20  bits,  these 
associative  algorithms  complete  in  roughly  120-200  micro¬ 
seconds.  Although  associative  processing  enables 


computations  based  on  PE  attributes  and  relationships  to 
other  PEs  and  events,  we  are  often  only  processing  one 
region  at  a  time  with  this  approach. 

We  have  developed  algorithms  for  the  coterie  network 
[Weems,  1989]  to  simulate  efficiently  within  non-uniform 
aggregates  of  PEs  simultaneously  the  associative  operations 
supported  directly  in  hardware  by  the  CAAPP  for  the  entire 
processor  array.  Most  significantly,  we  have  developed  an 
algorithm  to  count  selected  pixels  simultaneously  in  each 
region  in  a  number  of  steps  proportional  to  the  length  of  the 
PE  ID  (O(logN)).  Although  this  response  is  not  in  the 
micro-second  range  of  the  global  count,  it  is  significantly 
faster  than  previous  0(d)  algorithms.  The  consequence  is 
that  all  existing  associative  algorithms  that  were  previously 
run  in  parallel  but  region-serially,  can  now  be  run  region- 
parallel  (on  each  region  simultaneously.)  For  example,  the 
algorithm  to  find  the  mean  of  some  atuibute  in  each  region 
takes  0((logN)**2)  steps.  Although  the  elapsed  time  for  a 
single  region  is  significantly  longer  than  the  same  globally 
associative  algorithm,  the  gain  can  still  be  substantial  as 
often  thousands  of  regions  must  be  processed.  We  estimate 
the  break  even  point  at  between  50  and  100  regions. 

Other  results  we  have  obtained  are  new  multi-associative 
algorithms  for  parallel  prefix  and  convex  hull,  that  is 
algorithms  that  perform  these  computations  on  aggregates  of 
PEs  in  parallel  and  simultaneously.  The  multi-associative 
framework  also  extends  the  uaditional  associative  paradigm 
by  allowing  operations  on  and  among  aggregates  of  PEs  that 
are  not  defined  when  processing  is  always  performed 
globally.  Two  consequences  are:  the  support  of  divide-and- 
conquer  algorithms  within  aggregates,  and  communication 
among  aggregates.  The  latter  operation  is  especially  useful 
during  the  merge  phase  of  segmentation  algorithms,  when 
characteristics  of  a  region  can  be  uansferred  to  neighboring 
regions  in  a  single  communication  step. 


1.3  The  Object-Oriented  Store 

A  key  component  of  the  lUA  programming  environment  is 
the  intermediate-level  symbolic  (ISR)  database  [Brolio, 
1989;  Draper  1989].  The  purpose  of  this  database  is  to  store 
the  symbolic  representations  of  extracted  image  events, 
groups  of  events,  and  instantiated  models.  As  the  basis  for 
this  new  version  of  the  ISR,  we  are  developing  a  persistent, 
parallel,  object-oriented  store. 

The  store  will  be  objcct-orienu  1  by  Wegner's  definition 
[Wegner  1990],  in  that  it  support  objects  with  classes  and 
inheritance.  ObjecLs  are  the  encapsulation  of  data  with  the 
procedures  or  methods  that  operate  on  them.  Classes  group 
together  objects  with  a  common  template.  Inheritance  u.scs 
overloading  of  functions  or  operators  in  a  hierarchy  of 
cla.sscs  to  express  similarity  among  related  classes. 
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The  store's  persistence  is  derived  from  the  fact  that  objects 
may  outlive  the  process  that  creates  them.  Thus, 
programmers  are  not  required  to  translate  between  flat  file¬ 
system  storage  and  structured,  encapsulated,  in-memory 
storage. 

The  parallelism  stems  from  a  novel  style  of  parallel 
programming,  known  as  BOATS,  for  Parallel  Operators 
Applied  To  Structures.  POATS  provides  the  speed-up 
associated  with  parallel  execution  but  with  less  programmer 
effort  than  traditional  MIMD  methods.  It  allows  the 
programmer  to  concentrate  on  the  data's  structure  and 
operations  on  it,  instead  of  on  the  coordination  of  multiple 
processors.  The  programmer  uses  a  comprehensive  set  of 
data  structures  and  operators  to  specify  transformations  of 
data.  POATS  is  a  similar  but  higher  level  approach  to  that 
of  the  concurrent  aggregates  proposed  in  [Chien  1991]. 

The  operators  in  POATS  can  apply  functions  to  data 
structures  in  parallel.  These  functions  can  themselves  be 
composed  of  parallel  operators  so  that  nested  data  structures 
can  be  handled.  Predeflned  or  programmer-created  patterns  are 
used  to  specify  dependencies  among  elements  of  a  data 
structure.  Extensions  to  the  compiler  and  run-time  system 
of  the  host  Ianguage(s)  determine  the  mapping  of  existing 
processors  to  the  data  structures,  and  how  to  coordinate  the 
processcxs. 

POATS  combines  elements  of  data  parallelism  with  MIMD 
processing  to  permit  more  flexibility  in  the  manipulation  of 
complex  data  structures.  The  POATS  model  may  be  thought 
of  as  a  form  of  Single  Program  Multiple  Data  (SPMD) 
parallelism,  except  that  it  does  not  necessarily  enforce 
synchronization  at  the  frequent  intervals  that  are  typical  of 
SPMD  programs.  For  example,  several  POATS  operations 
might  active  at  once,  allowing  greater  utilization  of 
processing  resources. 

The  object-oriented  store  will  eventually  become  the  core  of 
the  parallel  ISR  database.  In  addition  to  the  capabilities  of 
the  object-oriented  store,  the  database  will  provide  meta-data 
descriptions  (schemas),  indexing  structures,  a  query 
language,  version  control,  garbage  collection,  recovery,  and 
perhtqjs  protection.  Built-in  objects  will  include  images  and 
image  sequences,  the  DARPA  Image  Understanding 
Environment  set  of  standard  objects,  and  libraries  of  reusable 
procedures. 

1.4  Dense  Depth  Map  Application 

An  SIMD  depth  from  motion  algorithm  has  been 
implemented  for  the  lUA  using  the  simulator  for  the  lUA 
prototype.  Image  correspondences  are  established  through 
correlation  for  two  temporally  separated  images.  The  depth 
map  is  computed  from  the  image  displacements  and 
approximately  known  motion  parameters.  The  map  is  then 
filtered  to  eliminate  some  possibly  erroneous  isolated  depth 
valiKs. 


The  algorithm  takes  roughly  0.53  seconds  to  compute  on 
the  lUA.  By  comparison,  a  similar  algorithm  for 
correspondence  alone  takes  about  3  minutes  on  the 
Connection  Machine,  about  10  minutes  on  a  Sequent 
Symmetry  multiprocessor  (12  Intel  80386  processors),  and 
about  2  hours  on  a  Vaxstation  3100.  The  majority  of  the 
time  is  spent  on  the  correspondence,  which  involves 
searching  a  41  x  41  window  in  tlie  second  image  of  the  pair. 

Qualitatively,  the  algorithm  appears  to  give  good  results, 
clearly  distinguishing  the  depths  of  strong  features. 
Quantitatively,  the  results  are  accurate  to  a  range  of  about  SO 
feet,  for  a  four  foot  forward  motion  of  the  camera  (which  has 
a  45  degree  field  of  view),  with  a  1.1%  mean  error  in  the 
calculated  depths. 

2.  Amerinex  Artiflcial  Intelligence 

Efforts  at  Amerinex  Artificial  Intelligence  (AAI)  have 
concentrated  in  two  areas:  completing  the  design  and 
development  of  a  C++  class  library  for  the  CAAPP 
(initially  begun  at  UMass),  and  building  lU-specific 
application  development  tools  for  the  intermediate  (ICAP) 
level  of  the  lUA.  Unlike  the  research-oriented  software 
produced  by  the  University,  the  effort  at  AAI  is  intended  to 
create  production-quality  software  support  for  the  lUA. 

2.1  lUA  Software  Philosophy 

Here  we  describe  the  progress  in  creating  the  software 
environment  that  will  exist  on  the  lUA.  The  usual  goals 
were  placed  on  the  ensuing  environment’ 

•  It  should  be  easy  to  use. 

As  an  example,  the  processing  elements  (PEs)  at  the 
CAAPP  level  are  bit-serial  devices  requiring  17 
operations  to  add  two  eight-bit  values.  It  should  not  be 
necessary  for  the  user  to  write  these  17  operations  or  to 
think  of  the  CAAPP  as  a  bit-serial  device. 

•  It  should  integrate  the  various  levels  of  the  lUA. 

The  lUA  consists  of  several  levels  of  processors  and 
multiple  communication  paths  between  and  within  these 
levels.  It  should  be  possible  for  users  to  easily  integrate 
these  levels  in  their  problem  solutions. 

•  It  should  be  effleient 

The  incentive  to  use  the  combined  hardware/software 
environment,  due  to  increased  performance  and 
programmer  productivity,  should  be  significantly  greater 
than  the  incentive  to  use  a  standard  uniprocessor 
environment 

•  The  environment  should  be  familiar. 
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The  user  should  not  be  required  to  learn  new  and  esoteric 
languages.  It  is  difficult  enough  to  utilize  the  concepts  of 
parallel  processing  without  having  to  learn  entirely  new 
syntax. 

As  a  general  philosophy,  we  have  decided  to  base  the 
software  on  objects  and  C++  where  ever  this  is  feasible.  As 
a  start,  and  a  framework  upon  which  to  tie  together  the 
requirements,  we  developed  a  Class  Library  for  the  lUA.  We 
therefore  utilize  C++  and  a  set  of  classes  to  describe 
operations  performed  at  the  CAAPP  level  of  the  lUA.  The 
programs  written  using  the  Class  Library  look  like 
conventional  C++  programs,  but  where  expressions  such  as 
(a  +  b)  /  (c  +  S)  may  refer  to  data  parallel  operations  on  a 
SIMD  array.  These  user  programs  actually  run  on  the  Array 
Control  Unit  (ACU)  and  communicate  with  the  host 
processor  using  messages  (requests).  The  same  programs 
communicate  with  the  ICAP  by  invoking  processes  at  that 
level.  We  intend  to  provide  libraries  that  implement  various 
communication  schemes.  Eventually,  to  obtain  greater 
object  code  optimization,  we  will  modify  the  C++  compiler 
to  recognize  our  classes,  but  these  changes  will  not  affect 
the  language  definition. 

In  the  following  section  we  briefly  describe  the  Class 
Library  for  the  lUA  and  how  programs  written  with  it 
communicate  with  the  other  parts  of  the  lUA.  We  describe 
the  major  process  that  runs  on  the  ICAP,  and  explain  how 
other  ICAP  processes,  both  predefined  and  user  defined,  are 
created  and  communicate. 

2.2  The  C++  Class  Library  for  the  lUA 

By  using  C++,  we  avoid  defining  a  new  language  and 
having  to  validate  its  syntax  and  semantics.  C++  provides 
proven  mechanisms  for  programming  that  can  be  used  to 
control  the  additional  operations  needed  for  the  lUA.  We 
provide  these  additional  operations  using  classes  (or  objects) 
which  are  defined  using  standard  C++  and  object  oriented 
concepts. 

The  base  class  is  the  plane,  which  is  understood  to  be  a  two- 
dimensional  grid  of  elements  where  each  element  of  the  grid 
exists  at  a  single  (virtual)  processing  element  (PE)  of  the 
CAAPP.  We  do  not  use  the  term  array  as  it  already  has  a 
defined  meaning  in  C++  for  another  construct.  In  contrast  to 
planes,  the  nominal  objects  in  C++  are  referred  to  as  scalars. 
Standard  arithmetic  operations  may  be  applied  to  planes  by 
applying  the  operation  to  each  element  of  the  grid  that 
comprises  the  plane.  Standard  arithmetic  operations  may  be 
applied  between  scalars  and  planes  as  well  by  replicating  the 
s^ar  for  each  element  of  the  grid. 

Just  as  scalars  are  distinguished  as  being  of  type  ini,  short, 
char,  etc.,  planes  are  also  distinguished  as  being  of  type 
IntPlane,  ShortPlane,  CharPlane,  etc.  These  new  classes  arc 
derived  from  the  plane  class  and  differ  in  the  number  of  bits 
used  to  represent  each  element  in  the  grid. 


Two  levels  of  control  must  be  provided.  The  first  is  "should 
an  operation  be  applied  to  an  entire  plane”  and  the  second  is 
"should  an  operation  be  applied  to  a  particular  element  of  a 
plane".  The  first  type  of  control  is  provided  by  the  C++ 
control  statements  such  as  if  and/or. 

The  second  method  of  control  is  provided  by  the  concept  of 
activity.  The  activity  is  specified  independently  for  each 
virtual  PE.  Activity  is  embodied  in  three  new  classes  Select, 
SeleclNot,  and  Everywhere  which  allow  the  activity  to  be 
set  for  each  PE  using  a  BitPlane.  Activity  controls  data 
transfer  within  the  PE  and  no  other  operations.  That  is,  an 
element  in  a  destination  plane  is  modified  under  an 
assignment  operation  if  and  only  if  its  PE  is  active.  A 
particular  activation  has  a  scope  in  the  same  way  that 
variables  have  scope.  Activity  may  be  nested,  allowing  a 
cumulative  winnowing  of  the  set  of  active  elements. 

Other  operations  are  provided  for  planes.  These  operations 
are  applied  as  C++  methods  to  particular  planes.  Examples 
inclu^ 

•  Any,  which  returns  a  scalar  1  if  any  elements  of  a  BitPlane 
are  1  and  a  0  if  no  elements  of  a  BitPlane  are  1. 

•  Count,  which  returns  a  count  of  the  number  of  elements  of 
a  BitPlane  which  have  the  value  1. 

•  West,  North,  East,  and  South  which  implement  neighbor 
communication  on  the  grid. 

•  Generalized  routing  operations  with  combining. 

In  addition  to  the  the  above  operations  that  exist  on  many 
SIMD  mesh  parallel  processors,  the  lUA  has  hardware  for 
allowing  operations  to  be  performed  in  parallel  by  regions. 
This  hardware  embodies  what  we  believe  are  important 
capabilities  for  image  understanding  applications.  There  are 
four  switches  at  every  PE  that  allow  four-way 
connectedness.  A  PE  is  connected  to  its  neighbor  if  its 
switch,  in  that  direction,  is  set  and  its  neighbor's  switch,  in 
the  opposite  direction,  is  set.  Once  the  switches  are  set,  all 
PEs  that  are  connected  define  a  region  called  a  coterie. 
Information  may  be  broadcast  by  some  PEs  on  the  circuit 
formed  by  the  switches  and  sampled  by  every  PE  also  on  the 
circuit  Thus,  if  only  one  PE  per  coterie  places  information 
on  the  circuit,  it  can  do  a  one-to-many  broadcast  of  this 
information  to  all  the  other  PEs  that  form  the  coterie.  If 
more  than  one  PE  broadcasts,  the  message  is  the  logical  OR 
of  the  multiple  messages  sent.  For  a  one-bit  message,  the 
result  is  thus  equivalent  to  the  Any  operation  being  applied 
in  parallel  to  all  coteries.  Coteries  are  implemented  by  the 
classes  CoterieWENS,  CoterieWE,  and  CoterieNS  as  well  as 
by  methods  applied  to  planes. 

Because  the  lUA  will  exist  in  several  geometries  that  result 
in  different  grid  sizxss  for  the  CAAPP,  it  must  be  possible  to 
write  the  programs  based  on  the  plane  size  and  not  the  lUA 
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size.  It  must  also  be  possible  to  run  the  same  size  problem 
on  both  large  and  sniall  instantiations  of  the  lUA  with  the 
only  difference  being  the  length  of  time  needed  for  the 
computation.  A  single  program  may  contain  several  plane 
sizes.  Therefore,  programs  written  using  the  Class  Library 
specify  the  size  of  each  plane,  and  the  lUA  software  maps 
this  to  the  actual  machine  that  is  being  used. 

For  example,  if  an  lUA  has  a  128  x  128  grid  of  physical 
PEs  at  the  CAAPP  level  and  the  size  of  a  plane  is  256  x 
2S6,  then  the  plane  must  be  split  into  2x2  tiles  to  fit  the 
actual  lUA.  A  plane  size  of  256  x  258  would  require  2x3 
tiles  and  the  tiling  factor  would  be  6. 

The  size  of  the  plane  is  specified  at  compile  time  by 
automatically  converting  the  specified  size  to  an  integral 
multiple  of  the  size  of  the  lUA.  For  example,  if  the  plane 
size  is  40  X  40  and  the  lUA  size  is  32  x  32  PEs,  there  will 
be  four  tiles  and  the  actual  size  will  be  64  x  64.  We  make  a 
distinction  between  the  problem  and  actual  sizes.  The 
programmer  must  consider  the  actual  size  for  mesh 
communications  with  the  West(),  East(),North(),  and 
SouthO  methods  using  the  toms  connections  of  the  mesh. 
Note  that  the  actual  size  is  the  size  of  the  virtual  processor 
array  and  is  not  necessarily  the  same  as  the  physical  size. 

One  of  the  benefits  of  this  class  library  is  that  it  does  not 
require  the  use  of  an  lUA.  The  class  library  can  be 
implemented  on  other  SIMD  architectures  with  more  or  less 
ability  to  support  the  operations  provided  by  the  library.  It 
has  also  been  implemented  on  sequential  machines.  The 
generality  of  the  Class  Library  for  the  lUA  allows  it  to  form 
the  basis  of  a  language  for  specifying  a  wide  range  of  image 
understanding  algorithms. 

Figure  3  is  an  example  function  which  calculates  the  integer 
square  root  for  each  element  of  an  IntPiane.  Note  how 
similar  this  function  is  to  one  for  scalars. 

The  function  shown  in  Figure  4  implements  a  simple  edge 
operation  in  the  x-direction,  and  is  an  example  of 
neighborhood  communication. 

The  function  in  Figure  5  forms  regions  based  on  connected 
component  equivalence  classes  and  then  labels  the  regions 
formed  using  the  address  of  one  of  the  PEs  in  each  region;  it 
is  an  example  illustrating  the  use  of  coteries. 


2.3  The  ICAP  and  the  ISR 

Software  for  the  lUA's  intermediate  (ICAP)  level  is  arranged 
hierarchically  with  each  layer  providing  additional 
functionality  or  an  abstraction  of  the  lower  levels. 

Figure  6  depicts  the  hierarchy.  This  section  describes  our 
current  designs  I(X  each  layer.  None  of  the  components  have 
been  implemented  yet. 


ShortPlane 

IntSqrtOntPlane  initial) 

(IntPiane  guess  (iniiial.SizeO); 

IntPlane  last_guess(initial.SizeO): 

IntPiane  res  (initial.SizeO); 

BitPlane  a  (initial.SizeO); 

int  iterations  =  1 8; 
int  count; 

guess  =  initial; 
a  =  (guess  !=  0); 
count  =  a.CountO; 

(Select  activc(a); 
while  (iterations-)  ( 
last_guess  =  guess; 

(Everywhere  active;  //  Set  Os  to  Is 

(Select  active(guess  =  0);  //  because  divide  will 
guess  =  1 ;  //be  done  everywhere 

) 

} 

res  =  initial  /  guess; 
res  +=  last_guess; 
guess  =  res  »  1 ; 

if  (count  <=  (guess  =  lasl_guess).CounlO)  break; 


) 

} 

return  ShortPlane(guess); 


Figure  3.  C++  Class  Library  Example  of  Calculating 
the  Integer  Square  Root  of  Every  Pixel  in  an  Image 


ShortPlane 

prewitt_x(UCharPIane  image) 

(ShortPlane  x(image.SizeO); 

//  Compute  the  first  derivative  in  the  X  axis  direction 
//  with  a  simple  edge  operator  that  applies  this  mask: 
//  -1-1-1 
//  0  0  0 
//  111 

X  =  image.SouthO  -  image.North0; 
return  x  +  x.WestQ  +  x.EasiO; 


Figure  4.  C++  Class  Library  Example  of 
Neighbor  Communication 


A  user’s  ICAP  program  consists  of  a  set  of  entry  points;  the 
ACU  causes  the  intermediate  level  to  begin  execution  at  one 
of  these  entry  points  as  part  of  the  execution  of  the  user's 
ACU  program. 

Once  begun,  the  ICAP  program  performs  some  complex 
operation  which  may  involve  communication  with  other 
portions  of  the  lUA,  communication  among  ICAP 
processors,  maintenance  of  a  shared  database,  servicing 
interrupts,  and  starting  or  interacting  with  additional  threads 
of  control  (tasks).  The  program  performs  these  actions  with 
the  help  of  the  software  components  in  the  hierarchy. 


//  Segment  'equivalence  classes'  into  regions  by 
//  comparing  the  values  of  neighboring 
//  PEs  and  then  label  each  region. 

IntPlane 

IabeLregions(UCharPlane  eq_class) 

{Everywhere  active;  //  Insure  that  every  PE  participates. 
BitPlane  west  (eq_class.SizeO); 

BitPlane  east  (eq_class.SizeO); 

BitPlane  north  (eq_class.SizeO): 

BitPlane  south  (eq_class.SizeO); 

BitHane  masters(eq_class.SizeO); 

IntPlane  labels  (eq_class.SizeO); 

//  Determine  the  switch  settings  for  the  coteries 
//  Do  not  wrap  regions  around  the  grid  edge 

west  =  (eq_class  =  eq_class.West  0)  & 
~eq_class.WestEdge_pO; 
nwth  =  (eq_class  =  eq_class.NaihO)  & 
-eq_class.NorthEdge_pO; 
east  =  (eq_class  =  eq_class.East  0)  & 
~eq_class.EastEdge_pO; 
south  =  (eq_class  =  eq_class.SouthO)  & 
-eq_class.SouthEdge_pO; 

//  form  the  regions 

(CoterieWENS  pattem(west,east^orth.south); 

//  Select  the  active  PE  with  the  highest 
//  address  in  a  region. 

masters  a  (eq_class.IndexO)-RegionSelectMaxO; 

*  //  Label  each  PE  with  the  address  of  the  master  PE. 
labels  =  (eq_class.IndexO)  J(egionBroadcast(masters); 

] 

return  labels; 


At  the  bottom  of  the  hierarchy  lies  an  ICAP  processing 
element  These  processing  elements  are  arranged  into  groups 
of  four  (quadnodes)  with  each  quadnodc  linked  to  every  other 
quadnode  via  the  communication  ports  of  the  processing 
elements  (and  by  a  shared  memory  structure). 

On  top  of  the  processing  elements  we  provide  basic  system 
runtime  support  using  SPOX,  a  real-time  multi-tasking 
operating  system  sold  by  Spectron  Microsystems,  Santa 
Barbara,  CA.  SPOX  is  a  widely  used  commercial  product 
for  real-time  applications  on  digital  signal  processors.  It  is 
expressly  designed  to  be  easily  ported  to  architectures  such 
as  the  lUA.  SPOX  provides  basic  system  support  functions 
such  as  simple  preemptive  scheduling,  software  interrupts, 
efficient  I/O,  management  of  multiple  memory  segments, 
and  other  functions,  with  low  overheki.  SPOX  provides  the 
basic  tools  with  which  we  build  the  higher-level  abstractions 
that  are  appropriate  to  our  programming  domain,  including 
custom  multi-tasking  absuactions,  and  synchronization  and 
communication  constructs. 

The  Task  Management  layer  is  an  interface  to  many  of  the 
SPOX  features  concerning  tasks.  This  layer  provides  both 
basic  routines  to  start  tasks,  change  priorities,  and  check  on 
the  status  of  tasks,  and  more  abstract  routines  such  as  those 
for  executing  functions  as  separate  tasks  (i.e.,  "background 
processes").  In  addition,  this  layer  provides  a  framework 
within  which  interrupts  are  defined  and  attached  to  events, 
and  the  framework  in  which  a  user's  program  defines  its 
entry  points.  Finally,  it  defines  such  abstract  objects  as 
mailboxes,  monitors,  and  other  objects  to  manage 
synchronization. 

The  Communication  layer  defines  the  basic  routines  for 
communicating  with  other  parts  of  the  lUA.  For 
communication  via  the  processor  ports,  it  presents  a  simple 
message  passing  interface;  there  are  routines  to  consuuct 
messages,  have  them  sent  to  one  or  more  destinations,  and 


Figure  5.  C++  Class  Library  Example  of  a  Coterie 
Operation  (Connected  Component  Labelling) 


Figure  6.  ICAP  Software  Hierarchy 
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to  receive  and  dispatch  them  to  the  appropriate  message 
handler.  For  communication  via  the  shared  memories,  this 
layer  presents  both  a  message  passing  interface  where  certain 
memory  locations  contain  the  message  queues,  and  block 
leadAmte  functions.  Both  interfaces  provide  operations  that 
also  hide  much  of  the  complexity  of  ICAP-CAAPP 
communication  when  running  programs  with  plane  sizes 
larger  than  the  physical  CAAPP  array.  The  communication 
layer  also  deflnes  such  abstract  objects  as  global  shared 
variid>les. 

Although  the  previous  layers  are  sufficient  for  intermediate 
level  programming,  the  interfaces  they  provide  are  still  fairly 
primitive.  The  Intermediate  Symbolic  Representation  (ISR) 
Database  layer  provides  a  higher-level  framework  within 
which  the  ICAP  processes  can  woik:  that  of  a  shared 
database  of  tokens  refnesenting  image  components  at  various 
levels  of  granularity.  The  usefulness  of  the  ISR  database  as  a 
framework  for  image  understanding  programs  has  been 
demonstrated  through  its  use  at  the  University  of 
Massachusetts,  where  reasearch  versions  being  developed, 
and  as  an  integral  part  of  KB  Vision,  a  commercial  research 
tool  for  image  understanding  programming  that  is  a  product 
of  Amerinex  Artificial  Intelligence.  The  ISR  database 
implemented  in  this  layer  is  based  on  that  of  KB  Vision.  The 
most  significant  extensions  are  those  to  support  a  distributed 
database  with  limited  memory,  issues  that  were  not  of 
concern  when  developing  KBVision's  ISR.  Section  2.3.1 
describes  the  ISR  and  our  extensions. 

The  final  software  component  in  Figure  6,  the  Programming 
Environment,  provides  the  tools  necessary  to  run  and  debug 
ICAP  programs.  These  tools  run  on  the  host  machine  and 
interact  with  the  ACU  and  shared  memories.  They  provide 
support  for  loading  ICAP  programs  onto  the  ICAP 
processors,  loading  data  into  both  the  shared  and  local 
memories,  executing  initialization  routines,  and  saving 
memory  data  after  the  program  has  run.  Despite  the  supixirt 
of  the  other  layers,  we  expect  ICAP  programs  to  be  difficult 
to  debug;  we  therefore  require  good  debugging  tools.  These 
tools  consist  of  a  portion  that  runs  on  the  host  or  ACU  and 
another  portion  thk  runs  on  the  ICAP  processors.  Together 
they  support  ICAP  program  I/O  and  possibly  an  interactive 
debugger.  When  available,  the  debugger  would  allow  the 
setting  of  breakpoints,  program  single  stepping,  and 
examining  memory  and  registers.  The  debugger  must  also  be 
able  to  handle  debugging  programs  running  on  multiple 
ICAP  processtHS  simultaneously. 

Our  layering  of  the  ICAP  level  software  provides  a  familiar 
programmer’s  model  (the  ISR  database)  for  doing  typical 
processing  while  providing  support  for  more  complex  but 
possibly  more  efficient  management  of  and  communication 
among  the  processes  running  at  the  ICAP  level. 

2.3.1  ISR  Database 

The  ISR  Database  is  a  repository  for  information 
representing  abstract  image  events,  such  as  lines,  regions, 
and  edges.  It  provides  tools  for  defining,  storing,  reuieving. 


manipulating,  filtering  and  organizing  these  events  using  an 
interf^ace  derived  from  the  ISR  database  in  KBVision 
[Amerinex,  1991].  Within  the  database  a  token  describes  an 
image  event  with  a  set  of  features  and  its  spatial  location.  A 
tokenset  is  a  group  of  similar  tokens,  such  as  the  tokens 
representing  lines  extracted  by  a  particular  algorithm.  A 
common  operation  is  to  find  the  tokens  in  a  tokenset  whose 
feature  values  satisfy  some  criteria.  The  results  of  such  an 
operation  is  a  tokensubset,  a  possibly  empty  subset  of  the 
tokens  in  the  initial  tokenset. 

On  the  lUA,  a  tokenset  may  be  disuibuted  over  multiple 
quadnode  IPSMs  (memory  shared  among  the  ICAP 
processors  in  a  single  quadnode).  In  a  typical  operation, 
ICAP  processing  begins  by  populating  tokensets  with 
tokens.  Since  all  ICAP  processors  generate  tokens  in 
parallel  based  on  their  portion  of  the  image  data,  the  tokens 
naturally  distribute  among  the  quadnodes.  This  distribution 
provides  a  natural  partition  for  parallel  computation,  but 
complicates  global  processing  on  the  tokenset.  Our  interface 
supports  use  of  the  natural  parallelism  and  hides  many  of  the 
details  of  global  access. 

After  populating  tokensets,  each  ICAP  typically  creates  its 
own  tokensubsets  for  analysis.  This  operation  requires 
communication  among  the  quadnodes  in  order  to  gather 
token  information  from  the  other  quadnodes.  An  ICAP 
broadcasts  a  request  to  every  quadnode  and  constructs  the 
tokensubset  with  the  token  information  that  is  returned. 
Since  tokensubsets  are  unordered,  an  ICAP  can  begin 
processing  a  tokensubset  as  soon  as  data  arrives,  suspending 
when  it  reaches  the  end  of  the  available  data.  In  the  general 
form  of  this  operation,  an  ICAP  broadcasts  to  the  quadnodes 
an  arbitrary,  although  previously  defined,  function  and 
receives  a  list  of  the  results,  which  may  or  may  not  be 
tokens.  The  first  form  gathers  data  for  processing  while  the 
second  distributes  processing  and  gathers  the  results.  The 
general  form  of  the  operation  may  be  more  efficient  than  the 
first  depending  both  on  the  relative  sizes  of  the  input  data 
and  the  results,  and  on  the  data  needs  of  future  operations.  A 
process  that  applies  a  function  to  a  small  number  of  tokens 
or  that  performs  multiple  operations  on  a  set  of  tokens  may 
want  to  gather  the  tokensubset  locally  and  perform  the 
operations.  Broadcasting  the  function  may  be  more  efficient 
for  a  process  that  applies  only  one  function  to  a  large 
number  of  tokens,  returning  a  small  result  (e.g.,  a 
summation).  We  provide  an  interf£K:e  that  hides  many  of  the 
details  of  the  tokensubset-specific  form.  This  interface  is 
based  on  the  DynamicList  object  that  implements  the  more 
generic  form. 

2.3.2  Tokensubset  Queries 

From  the  user’s  perspective,  tokensubset  creation  on  the 
lUA  is  similar  to  that  within  KBVision;  the  user  provides  a 
set  of  tokens  and  membership  criteria  and  gets  a  tokensubset 
in  return.  However,  when  using  compound  criteria,  such  as 
those  based  on  multiple  token  features,  the  KBVision  user 
creates  a  tokensubset  by  successively  adding  or  removing 
tokens  from  tokensubsets,  while  the  lUA  user  creates  a 
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criteria  record  describing  the  compound  criteria  and 
broadcasts  only  one  request 

Our  interface  allows  tdcensubset  processing  to  begin  before 
the  entire  tokensubset  has  been  received.  From  the  user's 
perspective,  the  only  change  is  an  additional  parameter  to  all 
of  the  tokensubset  access  functions  that  specifies  whether  to 
block  or  return  when  the  requested  data  is  not  yet  available. 
Non-blocking  operations  give  a  "data  not  available"  return 
code  if  data  is  unavailable. 

A  tokensubset  is  a  simple  list  of  identifiers  of  tokens  in  a 
tokenset.  A  process  uses  these  identifiers  to  access  the 
tokens'  feature  data  stored  in  the  tokenset.  We  provide  a  form 
of  lazy  evaluation  of  feature  values  to  reduce 
communication.  If  a  process  attempts  to  access  the  feature  of 
a  token  stored  in  a  different  quadnode,  the  system 
communicates  with  the  remote  quadnode  to  get  the 
information.  We  then  cache  this  information  for  future 
reference.  A  user's  program  can  prefetch  feature  data  by 
specifying  a  list  of  features  when  broadcasting  its 
tokensubset  criteria.  The  user's  program  can  also  indicate 
which  features  are  or  are  not  of  interest  by  using  some 
additional  directives.  Each  quadnode  records  to  whom 
information  is  sent  so  that  theii  caches  can  be  updated  or 
invalidated  as  features  change. 

The  lUA  has  a  limited  amount  of  memory  for  storing  a 
quadnode's  tokens  and  caching  other  quadnodes'  token  data. 
Long-lived  image  understanding  processing  requires  some 
form  of  garbage  collection.  Initially,  the  user's  program  will 
be  responsible  fw  nuuiaging  database  memory.  The  user's 
program  running  on  the  ACU  is  primarily  responsible  for 
allocating  and  freeing  tokensets.  The  ACU  allocates 
tokensets  so  that  they  are  known  to  all  of  the  ICAPs.  When 
some  phase  of  processing  is  complete,  the  ACU  frees  the 
tokensets  that  are  no  longer  needed.  Freeing  a  tokenset  frees 
the  memory  associated  with  the  tokens  in  the  tokenset  and 
frees  any  cached  information  about  tokens  in  the  tokenset. 
Tokensubsets  that  are  local  to  an  individual  ICAP  must  be 
freed  by  that  ICAP.  Tokensubsets  that  have  been  stored  as 
part  of  a  token  feature  will  be  deleted  when  the  token's 
tokenset  is  deleted.  ICAPs  also  have  control  over  their  own 
quadnode's  UAen  data  cache  through  directives  indicating 
when  tokens  or  token  features  are  no  longer  needed. 

Figure  7  shows  some  of  the  tokensubset  criteria  record 
operations.  This  and  the  following  examples  use  C-i-i- 
syntax  although  the  decision  on  whether  to  use  C-t-+  or  C 
h^  not  been  made  yeL  When  creating  a  criteria  record,  the 
user  specifies  the  trAenset  whose  uAens  are  candidates  for 
inclusion  within  the  tokensubset.  The  user  specifics  the 
tokensubset  criteria  with  calls  to  the  Add  and  Op  operations. 
The  Add  operation  specifies  one  criteria  and  Op  specifies  the 
boolean  conjunction  of  multiple  criteria.  In  the  arguments  to 
Add,  Operation  specifies  whether  matching  tokens  should  be 
added  to  or  removed  from  the  result,  and  Test  specifics  which 
of  a  number  of  tests  to  use  to  determine  matches.  As  in 
KBVision,  the  test  can  be  one  of  the  following: 


•  All:  Match  all  tokens  in  the  tokenset 

•  Value:  Match  tokens  whose  value  in  a  particular  feature  is 
within  some  bounds. 

•  Location:  Match  tokens  with  a  location  feature  that 
intersects  a  specified  rectangle. 

•  Undefined:  Match  tokens  for  which  a  particular  feature 
value  is  undefmed. 

•  NotComputed:  Match  tokens  for  which  a  particular  feature 

value  has  not  been  computed. 

•  Criteria:  Match  tokens  which  match  a  previously  defined 
criteria  record. 

class  TssCriteria 

{  ... 

TssCriteria  new  ( TokenSet ); 
void  Add  ( Operation,  Test,  TestArgs ); 
void  Op  ( BooleanOp ); 

Tokensubset  Broadcast  ( Destinations,  BlockTimeOut ); 


Figure  7.  Tokensubset  Criteria  Operations 


The  choice  of  Test  determines  the  arguments  that  follow. 
All  of  the  tests,  with  the  exception  of  All  and  Criteria, 
require  a  feature  name  and  a  range  of  values  as  the  next 
arguments.  All  takes  no  additional  arguments  and  Criteria 
takes  another  criteria  record  as  the  following  argument. 

The  argument  to  Op  is  one  of  the  boolean  logic  operators: 
And,  Or,  or  Not.  These  operate  on  the  "stack"  of  criteria 
entered  with  the  Add  operation,  allowing  arbitrary  criteria 
combinations.  For  example,  the  following  represents  a 
disjunction  of  two  conjunctions: 

critAddC..); 

criLAdd(...): 

criLOp(And); 

criLAdd(...); 

criLAdd(...); 

crit.Op(And); 

critOp{Or); 

The  Broadcast  operation  sends  a  completed  criteria  record  to 
some  subset  of  the  quadnodes,  and  either  waits  for  replies  or 
returns  immediately,  as  specified. 

Tokensubset  intersection  and  union  are  local  operations 
invoked  on  tokensubsets  after  they  have  been  received. 
Thus,  these  operations  cannot  be  specified  in  criteria  records. 

The  code  in  Figure  8  demonstrates  the  use  of  tokensubsets 
and  criteria  records.  This  code  creates  a  tokensubset 
containing  the  tokens  in  the  tokenset  Lines  that  have  start  or 
end  points  near  the  start  or  end  points  of  the  line  keyLinc 
and  have  a  contrast  (indicated  as  a  floating  point  number) 
similar  to  that  of  keyLinc.  The  fust  lines  of  code  define  the 
criteria  test  parameters.  The  variables  dist  and  conRange 
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ISRFLOAT  sminX,  smaxX,  sminY,  smaxY;  Warca  around  kcyLine's  startpoint 

ISRFLOAT  eminX,  emaxX,  eminY,  emax  Y;  \\  area  around  kcyLine’s  endpoint 

ISRFLOAT  minCon,  maxCon;  W  range  of  acceptable  contrast 

sminX  =  keyLine.StartPoint.x  -  dist; 

smaxX  s  keyLine.StartPoint.x  +  dist; 

sminY  =  keyLine.StanPoint.y  -  dist; 

smaxY  =  keyLine.StartPoint.y  +  dist; 

eminX  =  keyLine-EndPointx  -  dist; 

emaxX  =  keyLine.EndPointx  +  dist; 

cminY  =  keyLine-EndPointy  -  dist; 

emaxY  =  keyLine-EndPointy  +  dist; 

minCoi  =  keyLine.Contrast  -  conRange; 

maxCol  =  keyLine.Contrast  +  conRange;  \\  Create  Tokensubset  membership  criteria  record 

TssCriteria  crit(Lines);  W  criteria  record  for  subset  of  tokens  in  Lines 

criLAdd(InsertWhen,  Location,  "StartPoint",  sminX,  smaxX,  sminY,  smaxY); 
criLAdd(InsertWhen,  Location,  "StartPoint”,  eminX,  emaxX,  eminY,  emaxY); 
crit.Op  (Or); 

crit.Add(InsertWhen,  Location,  "EndPoint",  sminX,  smaxX,  sminY,  smaxY); 
crit-Op  (Or); 

criLAdd(InsertWhen,  Location,  "EndPoint”,  eminX,  emaxX,  eminY,  emaxY); 
criLOp  (Or); 

criLAdd(InsertWhen,  Value,  "Contrast",  minCon,  maxCon); 
critOp  (And);  W  Get  Tokensubset 

tss  =  crit.Bioadcast(AllQuadNodcs,  NoBlock); 
for(TssIndex  =  0;  tss.IsIndex(TssIndex,  Block);  Tssindex ) 

(Tdeenindex  =  tss.GetT<*enIndex(TssIndex,  Block  ); 

XX  =  Lines.GetFeature  (Tokenindex,  "FeatureX",  Block ); 

<  token  computation  > 


Figure  8.  Example  of  Tokensubset  Usage 


define  "near"  and  "similar  to".  The  line  beginning  with 
TssCriteria  creates  a  new  criteria  record,  crit,  for  specifying 
match  criteria  for  tokens  from  the  tokenset  Lines.  The  first 
call  to  Add  specifies  a  criteria  that  matches  tokens  in  Lines 
whose  starting  point  is  near  keyLine's  starting  |X)int.  The 
following  three  lines  add  criteria  that  match  the  other 
combinations  of  start  and  end  points.  The  Or  operations 
indicate  that  a  token  matches  this  criteria  if  either  of  its 
points  are  near  either  of  keyLine's  points.  The  last  Add 
specifies  a  criteria  that  matches  tokens  with  similar  contrast 
*1110  And  operation  that  follows  this  Add  specifics  that 
matching  tokens  must  both  be  near  and  of  similar  conuast. 

After  completing  the  criteria  record,  we  create  the 
tokensubset  by  broadcasting  the  criteria  to  all  quadnodes 
(including  the  local  quadnode),  returning  without  waiting  fur 
the  replies.  The  for-loop  iterates  over  the  tokens  in  the 
tokensubset.  Isindex  returns  true  if  the  index  is  within  the 
range  of  the  UdeensubseL  If  some  replies  are  pending  and  the 
index  is  beyond  the  range  of  the  elements  present,  this 
operation  will  block  until  an  element  appears  for  that  index 
position,  or  all  replies  have  been  received  and  the  index  is 
still  out  of  range.  The  first  line  in  the  body  of  the  loop 
determines  the  index  within  the  tokenset  Lines  of  the 
specified  element  in  the  tokensubset.  The  next  line  uses  the 
l^enset  index  to  get  a  particular  feature.  If  the  feature  is  not 
available  locally  this  operation  blocks  until  it  is  received. 
Replacing  Block  with  Timeout(time)  would  allow  these 


operations  to  return  with  status  "data  not  available"  if  the 
data  was  not  available  within  the  specified  timeout  period. 
NoBlock  is  synonymous  with  Timeout(O). 

2.3.3  Dynamic  Lists 

Underlying  the  tokensubset  interface  are  dynamic  lists,  a 
general  request/reply  mechanism.  A  dynamic  list  object 
assists  a  program  with  broadcasting  a  request  to  other 
quadnodes  and  gathering  the  replies.  The  program  specifies  a 
function  to  mn  on  the  remote  ICAP  processor,  and  a  handler 
to  append  replies  onto  a  list  as  they  arrive.  The  system 
broadcasts  this  request  to  the  other  qu^nodes  and  routes  the 
replies  to  the  appropriate  handler.  After  sending  the  request, 
the  program  may  begin  processing  the  list  of  results  as  soon 
as  data  begins  to  arrive. 

Figure  9  demonsuates  the  use  of  dynamic  list  objects  from 
the  caller's  perspective.  This  function  finds  the  token  with 
the  greatest  conuast.  Rather  than  asking  for  a  tokensubset 
containing  all  of  the  tokens  and  then  invoking 
FindMaximum  on  the  result,  this  function  asks  each 
quadnode  to  compute  the  maximum  of  its  local  tokens  and 
reply  with  this  token.  The  result  is  a  list  of  tokens,  one 
from  each  quadnode.  The  function  then  invokes 
FindMaximum  on  this  smaller  tokensubset  and  returns  the 
result.  This  function  also  asks  for  the  Conuast  and  Length 
features  of  each  quadnodc's  maximum  token. 
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Tokenlndex  FindMaximumContrast  ( tokenset ) 

Tokenset  tokenset; 

{DynamicList  dynl  Q; 
dyiil  J^unction  (  FindMax  ); 
dynI.AddArgs  ( String,  tokenseLName ); 

WTokensetName 

dynlAddArgs  (  String,  "Contrast” ); 

WFfcature 

dynlAddReplyFeatnie  ( "Contrast" ); 

WR^lyFeatures 

dynlAddRqriyFeature  (  "Length" ); 
d^.Handler  (  TokensubsetHandler ); 

^  Handler  for  relies 

tss  s  (Tokensubset)  dynl.Broadcast  (AllQuadNodes, 
NbBlock); 

maxEUTss  =  tss.FindMaxiinuni(Contrast31ock): 
maxEltlndex  =  tss.GetTokenIndex  ( maxEUTss,  Block  ); 
return  (maxEltlndex  ); 

J _ 

Figure  9.  Example  of  Dynamic  List  Usage 

The  FindMaximumContrast  function  first  creates  a  new 
dynamic  list  object  It  then  specifies  the  function  to  call  and 
its  arguments.  It  specifies  the  features  what  should  be 
contained  in  any  ttdcen  in  the  reply,  and  a  function  to  handle 
replies.  Since  the  replies  will  contain  token  information,  the 
function  uses  the  handler  for  receiving  tokensubset  data;  the 
result  of  the  handler  will  be  a  tokensubset  (i.e.,  a  list  of 
token  indices).  The  function  then  broadcasts  the  request, 
computes  the  maximum  of  the  tokens  in  the  resulting 
tokensubset,  and  returns  its  tokenset  index.  The  call  to 
Broadcast  uses  NoBlock  so  that  the  following  call  to 
FindMaximum  can  begin  its  processing  as  soon  as  data 
arrives.  This  call  uses  Block  so  that  it  waits  for  all  the  data 
to  arrive  before  returning  its  final  result.  Using  a  timeout 
would  let  the  function  return  the  maximum  of  the  data 
already  received,  with  a  return  status  of  "data  not  available". 

Figure  10  shows  the  FindMax  function  that  runs  on  the 
remote  quadnodes  as  a  result  the  dynamic  list  request.  The 
Hrst  argument  contains  information  needed  when  replying  to 
the  request  The  system  {wovides  this  argument  when  it  calls 
the  function.  The  remaining  two  arguments  were  provided 
when  initializing  the  dynamic  list  object  The  function  gets 
the  indicated  ttdcenset  and  uses  a  criteria  record  to  create  a 
tokensubset  containing  the  tokens  for  which  the  feature  has 
a  deflned  value.  The  argument  to  Broadcast  ^)ecifies  that 
only  local  uAens  should  go  into  the  tokenset.  The  function 
then  finds  the  uAen  with  the  maximum  feature  value  and 
creates  and  sends  a  reply  using  the  token's  tokenset  index. 
The  AddData  (iteration  automatically  inserts  the  Contrast 
and  Length  features  into  the  re{rfy  message  according  to  the 
dynamic  list  request  The  argument  to  Send  can  be  either 
Complete  or  Par^.  Complete  specifies  that  this  is  the  last 
tepiy  from  this  processor  for  this  request  Partial  specifies 
thitt  more  messages  will  follow.  Panial  replies  allow  the 
cmginating  ICAP  to  begin  processing  on  partial  results 
without  waiting  for  a  long  process  to  complete. 


Figure  1 1  shows  the  handler  for  this  dynamic  list  object. 
The  system  calls  this  handler  for  each  reply.  The  handler 
returns  a  dynamic  list  which  is  given  back  to  it  with  the 
next  message.  The  second  argument  contains  the  data  in  the 
reply  (i.e.,  the  result  of  the  AddData  operation  in  Figure  10). 
The  handler  calls  cachePartialToken  to  cache  the  token 
feature  data  in  the  local  database,  and  appends  the  tokenset 
index  onto  the  dynamic  list. 


void 

FmdMax  (lequestHeader,  tokensetName,  feature ) 

("‘ 

VvGet  the  tokenset  based  on  its  name, 
tokenset  =  GetTokensct  (tokensetName ); 

W  Create  Tokensubset  -  Match  all 
^tokens  with  feature  defined. 

{ 

TssCriteria  crit(tokenset); 

critAdd  ( InsertUnless,  Undefined,  feature ); 

tss  =  critBroadcast  ( Local ); 

) 

Get  Tokenlndex  of  token  with  maximum 
value  in  feature. 

maxEUTss  =  tss.FindMaximum  ( feature.  Block ); 
maxEltlndex  =  tss.GetTokenlndex  ( maxEUTss,  Block ); 
Create  reply  record. 

{ 

DynamicListReply  reply(  requestHeader ); 

reply.  AddData  (  PartialToken,  tokenset,  maxEltlndex  ); 

reply.Send  ( Complete ); 

) 


Figure  10.  Example  Dynamic  List  Function  --  Return 
Token  With  Maximum  Feature  Value. 


DynamicList 

T^ensubsetHandler  (dl.  msg ); 

DynamicList  dl; 

DynamicListMsg  msg; 

{ 

while  ( msg  )  { 
switch  ( msg->type )  { 

case  PartialToken: 

cachePartialToken  ( msg->partialTokenData ); 
dl.Append  ( msg->partialTokenData.Tokensetlndex ); 
break; 

default: 

error  ( "unknown  message  data  type" ); 

) 

msg++; 

) 

return  ( dl ); 

_ 


Figure  11.  Example  Dynamic  List  Reply  Handler  - 
Tokensubset  Handler 
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Local  Quadnode 


Remote  Quadnode 


Figure  12.  ICAP  Communication  Example 


In  this  example  the  resulting  dynamic  list  is  a  tokensubset 
that  can  be  accessed  with  the  tokensubset  commands 
discussed  above.  In  other  cases,  the  dynamic  list  is  accessed 
through  its  own  commands,  including  indexed  access, 
ma|q>ing  functions  to  invoke  a  function  on  each  element, 
status  functions  (e.g.,  current  length  and  replies  pending), 
and  list  deletion. 

2.3.4  Low-level  Communication 

Figure  12  depicts  the  communication  resulting  from  a 
tokensubset  query.  The  communication  lines  between 
quadnodes  are  distributed  among  all  of  the  processors  in  a 
quadnode.  Thus,  to  send  a  message  to  a  particular  quadnode, 
the  message  must  first  be  sent  to  the  processor  on  the 
sendng  quadnode  that  has  the  line  to  the  receiving  quadnode. 
Broadcasting  messages  to  all  quadnodes  requires  giving  the 
message  to  all  td'  the  processes  on  the  sending  quadnode. 

At  the  top  of  Figure  12  the  user's  program  invokes  the 
Broadcast  toleration  on  a  criteria  record.  This  operation  is 
part  of  the  ISR  client  library  and  runs  as  a  subroutine  call 
within  the  user's  pn^ram.  It  distributes  the  criteria  record  to, 
and  requests  service  of,  each  processor  on  the  local  quadnode. 
The  ISR  servo’  task  running  on  each  processor  responds  to 
the  service  request  by  sending  the  criteria  record  to  each  of 
its  connected  quadnodes  and  preparing  to  respond  to  their 
replies.  In  addition,  one  processor  is  responsible  for 
servicing  the  criteria  request  locally  as  if  it  had  come  from  a 
remote  processor.  As  replies  arrive,  the  ISR  server  task 
routes  die  messages  to  the  appropriate  handler  and  notifies 
any  blocked  tasks  that  more  data  is  available.  On  a  remote 
quadnode,  an  ISR  server  task  tunning  on  the  processor  with 
the  link  to  the  originating  quadnode  receives  the  request  and 
either  services  it  direcdy  or  forwards  it  to  another  processor 
in  the  quadnode.  A  reply  is  then  sent  back  to  the  originating 


quadnode.  If  the  remote  quadnode  forwards  the  request  to 
another  processor,  that  processor  must  forward  the  reply  to 
the  processor  with  the  link  to  the  originating  quadnode, 
which  then  sends  the  reply.  The  shaded  boxes  in  the  figure 
rqiresent  the  optional  forwarding  operations. 

All  communication  among  ICAP  processors  behaves  in  a 
manner  similar  to  that  of  a  tokensubset  query.  Dynamic  list 
requests  are  nearly  identical.  For  point-to-point 
communication  (from  one  process^  to  a  particular  processor 
on  a  remote  quadnode)  the  ISR  library  routine  gives  the 
request  to  the  single  processor  with  the  link  to  the  remote 
quadnode,  rather  than  to  all  processors;  the  remainder  of  the 
communication  is  unchanged. 

2.4  Inter-level  Communication 

In  this  section  we  describe  our  plans  for  other  aspects  of  the 
software  environment  being  constructed  for  the  RJA.  Please 
note  that  because  the  CAAPP  is  a  SIMD  array  of  {xocessing 
elements  whose  instructions  are  generated  by  the  ACU,  we 
treat  the  ACU  and  the  CAAPP  as  being  one  unit  in  this 
discussion. 

2.4.1  Between  the  Host  and  ACXJ/CAAPP 

The  Host  to  lUA  connection  is  that  of  general  purpose 
computer  with  a  special  purpose  attached  processor. 
Communication  consists  of  requesting  that  the  lUA, 
through  the  ACU,  perform  some  task  and  return  the  results 
of  performing  that  task  back  to  the  host.  That  is.  the  lUA  is 
an  allocated  device  that  performs  very  complex  tasks 
instantiated  in  the  form  of  large  programs  that  run  on  the 
ACU.  Therefore,  the  host  must  have  a  means  of  initiating 
tasks  on  the  ACU  and  then  communicating  with  those  tasks 
as  they  are  running.  The  host  will  be  executing  processes 
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under  some  version  of  Unix.  The  ACU  will  be  executing 
tasks  under  some  real-time  executive  (such  as  VX  Works). 
Communication  will  be  in  the  form  of  messages  sent  via 
sockets  on  the  host  side  and  implemented  through  library 
functions  callable  from  higher  level  languages.  On  the  ACU 
side,  this  communication  will  utilize  the  same  semantics 
but  be  implemented  via  a  separate  library  callable  from 
programs  written  using  the  Class  Library  for  the  lUA. 

In  a  developmental  environment,  the  users  will  interact  with 
the  host  and  thus  with  their  tasks  on  the  ACU  through 
normal  input  and  print  statements.  The  library  for  programs 
running  on  the  ACU  will  implement  the  standard  C-f+ 
stream  library  functions  to  provide  this  interactive  facility. 

2.4.2  Between  the  ACU/CAAPP  and  ICAP 


2.4.4  Between  the  Host  and  ICAP 

In  normal  cases,  we  do  not  expect  that  the  host  will  have  a 
need  to  directly  communicate  with  the  ICA?  level  of  the 
lUA.  While  the  host  has  access  to  some  of  the  same 
memory  that  is  available  to  the  processors  at  the  ICAP 
level,  issues  of  synchronization  make  it  unlikely  that  this 
facility  will  be  us^. 

3.  Hughes  Research  Laboratories 

Efforts  at  Hughes  Research  Laboratories  have  been  directed 
mostly  towards  debugging  the  lUA  prototype  hardware, 
designing  the  ACU  for  the  second  generation  lUA,  the  new 
custom  CAAPP  chip,  and  the  overall  second  generation  lUA 
architecture. 


There  are  three  mechanisms  for  communication  between  the 
ACU/CAAPP  and  the  ICAP:  via  the  shared  memory  layers 
above  and  below  the  ICAP,  and  via  ACU  broadcast 

There  is  a  layer  of  memory,  called  the  CAAPP  ICAP  Shared 
Memory  (CISM),  that  resides  between  the  CAAPP  and 
ICAP  levels  of  the  lUA  and  is  read/write  accessible  by  both. 
This  memory  is  used  by  the  ACU/CAAPP  for  the  storage  of 
planes  which  allows  these  planes  to  be  available  to  the 
ICAP  processor  and  to  serve  as  a  means  of  communication 
between  these  levels. 

The  ACU  is  capable  of  broadcasting  information  to  ail 
ICAP  processors  simultaneously.  Programs  to  be  run  at  the 
ICAP  level  are  loaded  and  initiated  using  this  mechanism. 
Issues  of  synchronization  are  handled  by  library  routines 
available  to  programs  written  using  the  Class  Library  for  the 
lUA.  These  routines  are  based  on  the  broadcast  mechanism 
and  the  ability  to  interrupt  the  ICAP  processors. 

The  ACU  also  has  access  to  another  layer  of  memory,  called 
the  ICAP-SPA  Shared  Memory  (ISSM),  accessible  by  the 
ICAP  and  SPA  processors.  ISSM  is  addressable  from  the 
ACU  using  a  function  based  upon  an  ICAP  processor's 
address.  It  may  be  used  by  the  ACU  to  make  requests  to  a 
particular  ICAP  processor.  The  same  memory  may  be  used 
by  an  ICAP  process^  to  return  results  or  make  requests  to 
the  ACU  (in  conjunction  with  CISM  and  the  CAAPP 
global  feedl^k  mechanism). 

2.4.3  Between  the  ACU/CAAPP  and  Sensors 

The  ACU  determines  when  input  images  arc  sent  to  the 
Host-CAAPP  Shared  Memory  (HCSM)  and  when  images 
are  sent  from  the  HCSM  to  die  outside  world.  This  control 
is  exercised  by  the  user's  program  through  another  library 
that  contains  routines  to  control  the  IDTS  (Image  Data 
Transfer  System).  These  routines  control  the  underlying 
hardware,  accessing  it  via  the  VME  bus.  The  host  also  has 
access  to  the  hardware  through  the  VME  bus.  But,  issues  of 
synchronization  with  the  CAAPP  require  that  only  the  ACU 
exocise  this  control. 


The  lUA  prototype  became  operational  in  June  of  1991  but, 
as  with  most  prototype  efforts,  several  problems  were 
encountered.  Two  of  the  more  serious  problems  were  related 
to  subtle  errors  in  the  custom  CAAPP  chip  that  were  not 
detected  in  the  circuit  simulations.  One  of  the  errors 
involves  a  control  line  that  passes  underneath  a  portion  of 
the  on-chip  memory  and  can  cause  bits  to  be  lost  due  to 
parasitic  effects.  This  error  has  since  been  corrected  in  the 
second  generation  CAAPP  chip.  The  second  problem 
involves  ground-loop  noise  due  to  the  spacing  of  ground 
pins  in  the  carrier,  and  will  be  alleviated  by  rerouting  all 
ground  lines  to  the  inner  ring  of  pins  in  the  second 
geneiauon  chip. 

Other  problems  included  resolving  interference  between  Unix 
and  the  software  initiated  memory  refresh  (refresh  is  now 
generated  in  hardware),  compensating  for  clock  skew  in  the 
system,  and  repairing  numerous  unreliable  solder  joints. 

A  preliminary  version  of  the  C-n-  class  library,  together 
with  the  lUA  prototype  simulator,  has  been  used  to  develop 
a  missile-tracking  related  demonstration  which  has  been  run 
successfully  on  the  hardware.  In  addition,  numerous  testing 
and  diagnostic  routines  have  been  run,  and  further  software  is 
being  developed  to  exhaustively  exercise  the  prototype. 

As  stated  in  section  1,  a  new  ACU  has  been  designed  that 
will  include  a  128-bit  horizontal  microengine  built  from 
AMD  29000  series  bit-slice  logic.  The  microengine  contains 
much  of  the  run-time  library  for  the  lUA,  and  is  capable  of 
issuing  instructions  to  the  CAAPP  and  ICAP  arrays  as 
quickly  as  they  can  accept  them,  and  with  very  little 
overhead.  The  instruction  issue  rate  of  the  ACU  is  decoupled 
from  the  execution  rate,  and  instructions  are  actually  issued 
asynchronously. 

The  ACU  also  contains  a  "macroengine"  consisting  of  a 
single-board  computer  based  on  a  SPARC  processor.  The 
macroengine  executes  the  high-level  control  portion  of  the 
user's  program  and  issues  insuuctions  to  an  abstract  machine 
consisting  of  the  microengine  and  its  subroutine  library. 
Thus,  a  macroengine  command  might  be  to  perform  floating 
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point  division  of  one  plane  by  another,  and  the  microengine 
will  expand  this  into  the  appropriitte  stream  of  instructions 
for  the  CAAPP  array. 

Hughes  Research  Laboratories  has  also  participated  in  the 
design  of  the  second  generation  lUA  architecture,  developing 
a  separate  initial  proposal  from  that  of  UMass.  Ideas  from 
both  proposals  were  combined  into  the  design  presented  in 
section  1,  and  as  mentioned  there,  a  few  of  the  details  for  the 
design  remained  to  be  ironed  out,  but  we  expect  it  to  be 
comfdeted  befwe  the  end  of  1991. 

Other  efforts  at  Hughes  include  the  design  of  the  IDTS 
(which  is  partially  dependent  on  the  specification  of  the 
DARPA  UGV  sensor  suite),  and  advanced  packaging  for  the 
lUA  to  further  reduce  its  size  while  increasing  the  number  of 
processors. 

4.  Conclusions 

The  Image  Understanding  Architecture  is  undergoing 
significant  change  with  the  development  of  the  second 
generation.  Both  the  hardware  and  the  software  are  being 
substantially  enhanced.  We  expect  the  new  system  to  be 
computationally  more  powerful  than  the  prototype,  and  to 
be  much  easier  to  use  for  vision  applications.  In  particular, 
the  new  lUA  is  being  targetted  for  use  in  the  DARPA  UGV 
program,  which  will  present  a  wide  range  of  applications 
challenges,  and  no  doubt  lead  to  further  insight  into  the 
means  for  exploiting  the  potential  parallelism  in  image 
undostanding. 
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the  role  of  perceptual 

Abstract 


Recent  work  in  computer  vision  is  based 
on  the  assumption  that  edge  detection  pre¬ 
cedes  grouping  and  object  recognition.  We 
(like  others)  argue  against  such  an  assumption 
and  suggest  that  grouping  precede  the  compu¬ 
tation  of  discontinuities  and  most  other  early 
visual  tasks. 

We  abo  present  a  perceptual  organization 
scheme  that  works  without  edges.  Our  scheme 
includes  a  new  ridge  detector  that  is  non-linear 
and  independent  of  scale. 

Results  will  be  shown  of  a  Connection  Ma^ 
chine  implementation  of  our  scheme  for  per¬ 
ceptual  organization  (without  edges)  using 
color  (the  scheme  is  designed  to  work  also  for 
brightness  and  texture). 

1  Introduction 

Perceptual  organization  is  a  process  that  establishes 
what  regions  of  the  image  come  from  one  single  object 
(of  interest  if  possible),  generally  with  very  little  de¬ 
tailed  knowledge  of  the  p^uticular  objects  present.  Re¬ 
cent  work  on  computer  vision  has  emphasized  the  role 
of  edge  detection  and  discontinuities  in  perceptual  or¬ 
ganization  and  recognition.  This  line  of  work  stresses 
that  edge  detection  should  be  done  at  an  early  stage  on 
a  brightness  representation  of  the  image,  and  segmen¬ 
tation  and  other  early  vision  modules  operate  later  on 
(see  Figure  1  left).  In  this  paper,  we  (like  others)  ar¬ 
gue  against  such  an  approach  and  present  a  scheme  that 
segments  an  image  without  finding  brightness,  texture. 


or  color  edges  (see  Figure  1  right).  In  our  scheme  dis¬ 
continuities  are  found  as  a  byproduct  of  the  perceptual 
organization  process. 

Segmentation  without  edges  is  not  new.  Previous  ap- 
'proaches  fall  into  two  classes.  The  first  class  is  based  on 
coloring  or  region  growing  [Hanson  and  Riseman  1978], 
[Horowitz  and  Pavlidis  1974],  [Haralick  and  Shapiro 
1985].  These  schemes  proceed  by  laying  a  few  “seeds”  in 
the  image  and  then  growing  these  until  a  complete  re¬ 
gion  is  found.  The  growing  is  done  using  a  local  thresh¬ 
old  function.  These  growing  schemes  are  limited  in  two 
ways:  First,  the  growing  function  does  not  incorporate 
global  factors  which  results  in  fragmented  and  super¬ 
posed  regions  (see  Figure  2).  Second,  there  is  no  way 
to  incorporate  a  priori  knowledge  on  the  shapes  that 
we  are  looking  for.  For  example,  the  Gestadt  principles 
extensively  used  on  current  grouping  algorithms  have 
not  been  incorporated  in  existing  coloring  algorithms. 
In  this  paper  we  present  a  non-local  perceptual  orga¬ 
nization  scheme  that  uses  no  edges  and  which  embod¬ 
ies  gestalt  principles  such  as  symmetry,  convexity  and 
proximity.  The  second  class  of  segmentation  schemes 
which  work  without  edges  are  based  on  volumetric  de¬ 
scriptions  of  shapes.  Exaunples  include:  [Badler  and  Ba- 
jcsy  1978],  [Binford  1971],  [Brooks,  Russel  and  Binford 
1979],  [Brooks  1981],  [Guzman  1968],  [Pentland  1988] 
and  [Waltz  1975].  These  schemes  make  strong  assump¬ 
tions  about  the  three-dimensional  primitives  used  to  de¬ 
scribe  the  data.  We  believe  that  this  is  too  strong  a 
commitment  to  be  a  representation  for  late  vision.  It 
may  be  adequate  for  a  restrictive  set  of  recognition  tasks 
but  not  in  general.  Our  scheme  conforms  more  closely 
to  the  data  than  these  schemes  and  does  not  impose  any 
three-dimensional  interpretation  of  the  data. 

The  scheme  that  we  will  present  is  an  extension 
of  [Subirana-Vilanova  1990]  brightness-based  perceptual 
organization  scheme.  Such  scheme  is  based  on  a  filter- 
based  ridge  detector  which  has  a  number  of  important 
problems  that  we  will  discuss.  These  include  its  depen¬ 
dence  on  scale  and  its  sensitivity  to  curved  shapes.  This 
poses  a  number  of  additional  constraints  to  be  taken  into 
account.  Our  analysis  will  lead  us  to  a  filter  that  we 
will  show  overcomes  most  of  such  problems.  Such  fil¬ 
ter  is  based  on  a  non-linear  combination  of  two  (oth¬ 
erwise  lineatr)  filters.  Our  scheme  is  designed  to  work 
for  brightness,  texture  and  color  but  our  implementa- 
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Figure  2:  Le/t;  An  image  of  a  shirt.  Center:  OriginaJ  seeds 
for  a  region  growing  segmentation  algorithm.  Right:  Final 
segmentation  obtained  using  a  region  growing  algorithm. 


Figure  3:  Left:  Model  of  an  edge.  Right:  Model  of  a  ridge 
or  box.  Are  these  appropriate? 


tion  deals  only  with  color.  Color  is  an  interesting  case 
to  study  because  it  is  a  three-dimensional  property,  not 
one-dimensional  like  intensity.  And  therefore  the  exten¬ 
sion  of  brightness  b£ised  schemes  to  color  is  non-trivial. 

We  begin  in  the  next  section  by  listing  several  reasons 
for  exploring  non-edge  based  schemes  and  then  present 
our  approach.  Results  of  a  version  of  our  scheme  imple¬ 
mented  on  the  Connection  Machine  will  be  shown. 

2  In  favor  of  regions 

What  is  an  edge?  Unfortunately  there  is  no  agreed 
definition  of  edge.  It  can  be  defined  in  several  related 
ways:  as  a  discontinuity  in  a  certain  property,  as  some¬ 
thing  that  looks  like  a  step  edge  [Canny  1983]  (see  Fig¬ 
ure  3),  or  by  an  algorithm  (e.g.  zero-crossings  [Marr 
and  Hildreth  1980]).  Characterizing  edges  has  proven  to 
be  difficult  especially  near  corners,  junctions  and  when 
there  are  edges  at  multiple  scales,  noise  or  transparent 
surfaces. 

What  is  a  region?  Attempting  to  define  regions  bears 
problems  similar  to  those  encountered  in  the  definition  of 
an  edge.  Roughly  speaking,  it  is  a  collection  of  pixels  in 
an  image  that  share  a  common  property.  In  this  context, 
an  edge  is  the  border  of  a  region.  But  how  can  we  find 
regions  in  images?  We  could  proceed  in  a  similar  way  as 
with  edges,  so  that  a  region  be  defined  (in  one  dimension) 
as  a  structure  that  looks  like  a  box  (see  Figure  3).  But 
this  suffers  from  problems  similar  to  the  ones  mentioned 
for  edge  detection. 

Thus,  regions  and  edges  are  two  concepts  closely  re¬ 
lated.  It  is  unclear  how  we  should  represent  the  informal 
tion  contained  in  an  image.  As  regions?  As  edges?  Fur¬ 
thermore,  independently  of  our  choice,  which  structures 
should  we  try  to  recover  first?  Edges  or  regions?  We  be¬ 
lieve  that  computer  vision  has  over  emphasized  the  early 
computation  of  discontinuities  (whether  brightness,  tex¬ 
ture  or  color  discontinuities).  Here  are  some  reasons  why 
exploring  the  computation  of  regions  (without  edges) 
may  be  a  promising  approach  (see  [Subirana-Vilanova 
and  Sung  1991]  for  a  more  comprehensive  discussion): 

•  There  is  psychological  evidence  that  humans  can 
recognize  images  with  region  information  better 
than  line  drawings  ([Cavanaugh  1991],  but  see  also 


Figure  4:  Edges  computed  at  three  different  scales  for 
an  image  of  a  person.  Note  that  the  results  are  notably 
different.  Which  scale  is  best? 


[Ryan  and  Schwartz  1956],  [Biederman  and  Ju 
1988]). 

•  Representations  which  maintain  some  region  infor¬ 
mation  such  as  the  sign-bit  of  the  zero  crossings 
(instead  of  just  the  zero  crossings  themselves)  are 
useful  for  perceptual  organization. 

•  The  performance  of  most  rigid-object  schemes  is 
bounded  by  the  complexity  of  the  feature  space 
used  for  exploring  possible  matchs.  Additional  re¬ 
gion  groups  should  reduce  such  complexity. 

•  Previous  research  on  recognition  has  focused  on 
rigid  objects.  Related  grouping  research  has  fo¬ 
cussed  on  finding  small  sets  of  features  with  high 
likelihood  of  coming  from  the  same  object.  For 
non-rigid  objects  this  is  not  sufficient.  Instead,  it 
is  necessary  to  group  most  of  the  features  coming 
from  a  single  object.  We  find  it  hard  to  believe 
that  edge-features  will  be  sufficient  for  bottom-up 
grouping  in  this  case. 

•  Scale  and  stability  are  recognized  as  important 
problems.  However,  is  it  the  stability  and  scale  of 
an  edge?  or  that  of  a  region  that  we  are  interested 
in?  Our  scheme  addresses  stability  in  terms  of  ob¬ 
jects  (not  edges).  In  addition,  our  scheme  commits 
to  one  scale  corresponding  to  the  object  of  interest 
chosen  by  our  scheme. 

3  Regions?  What  Regions? 

In  the  last  section  we  have  set  forth  an  ambitious  goal: 
Develop  a  perceptual  orgeinization  scheme  that  works  on 
the  image  itself,  without  edges.  But  what  constitutes  a 
good  region?  What  “class”  of  regions  ought  to  be  found? 
How  can  we  know  if  our  scheme  is  performing  properly? 

Our  work  is  beised  on  the  observation  that  many  ob¬ 
jects  in  nature  (or  their  parts)  have  a  common  color  or 
texture,  and  are  long,  wide,  symmetric  and  convex.  This 
hypothesis  is  hard  to  verify  formally,  but  it  is  at  least 
true  for  a  collection  of  common  objects  [Snodgrass  and 
Vanderwart  1980]  used  in  psychophysics.  And  as  we  will 
show,  it  can  be  used  in  our  scheme  yielding  seemingly 
useful  results.  In  addition,  humans  seem  to  organize 
the  visual  array  using  this  type  of  principles  as  demon¬ 
strated  by  the  Gestalt  Psychologists.  In  fact,  these  were 
the  starting  point  for  much  of  the  work  in  computer  vi¬ 
sion  on  perceptual  organization  for  rigid  objects.  We 
use  these  s^tme  principles  but  in  a  different  way:  With¬ 
out  edges  and  with  non-rigid  shapes  in  mind. 
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4  Color,  Brightness  Or  Texture? 

The  perceptual  organization  scheme  presented  in  this  pa¬ 
per  includes  color,  brightness  and  texture.  We  decided 
to  implement  it  on  color  first,  without  texture  or  bright¬ 
ness.  Color  based  perceptual  organization  (without  the 
use  of  other  cues)  is  indeed  possible  for  humans  since 
two  adjacent  untextured  surfaces  viewed  under  isoillu¬ 
minance  can  be  segmented.  (Although  the  human  visual 
system  has  certain  limitations  in  isoilluminant  displays, 
e.g.  [Cavanaugh  1987].)  And,  as  we  will  discuss  later  in 
the  paper,  color  is  also  useful  when  there  are  brightness 
changes. 

Color  is  a  perceived  property  of  a  surface  that  under 
normal  conditions,  depends  mostly  upon  surface  spectral 
reflectance  and  very  little  on  the  spectral  characteristics 
of  the  light  that  enter  our  eyes.  It  is  therefore  useful 
for  describing  the  materid  composition  of  a  surface  (in¬ 
dependently  of  its  shape  and  imaging  geometry)  [Rubin 
and  Richards  1981].  Lambertian  color  is  indeed  uniform 
over  most  untextured  physical  surfaces,  and  is  stable  in 
shadows,  and  under  changes  in  the  surface  orientation  or 
the  imaging  geometry.  In  general  it  is  more  stable  than 
texture  or  brightness. 

It  has  long  been  known  that  the  perceived  color  (or 
intensity)  at  any  given  image  point  depends  on  the  light 
reflected  from  the  various  parts  of  the  image,  and  not 
only  on  the  light  at  that  point.  This  is  known  as  the 
simultaneous-contrast  phenomena  and  has  been  known 
at  least  since  E.  Mach  reported  it  at  the  beginning  of  the 
century.  [Marr  1982]  suggests  that  such  a  strategy  may 
be  used  because  one  way  of  achieving  some  compensation 
for  illuminance  changes  is  by  looking  at  differences  rather 
than  absolute  values.  According  to  this  view,  a  surface 
is  yellow  because  it  reflects  more  “yellow”  light  than  a 
blue  surface,  and  not  because  of  the  absolute  amount 
of  yellow  light  reflected  (of  which  the  blue  surface  may 
reflect  an  eurbitrary  amount  depending  on  the  incident 
light). 

The  exact  algorithm  by  which  humans  compute  per¬ 
ceived  color  is  still  unclear.  Our  scheme  only  requires  a 
rough  estimate  of  color  which  is  used  to  segment  the  im¬ 
age,  see  Figure  1.  We  believe  that  perceived  color  should 
be  computed  at  a  later  stage  by  a  process  similar  to 
the  ones  described  in  [Helson  1938],  [Judd  1940],  [Land 
and  McCann  1971].  This  model  is  in  line  with  the  ones 
presented  in  [Subirana-Vilanova  and  Richards  1991]  and 
[Jepson  and  Richards  1991]  which  suggest  that  percep¬ 
tual  organization  is  a  very  early  process  which  precedes 
most  early  visual  processing. 

In  our  images,  color  is  entered  in  the  computer  as 
a  “color  vector”  with  three  components:  the  red,  green 
and  blue  channels  of  the  video  signal.  Our  scheme  works 
mostly  on  color  differences  between  pairs  of  pixels  c 
and  c/t.  The  difference  that  we  used  is  defined  in  equa¬ 
tion  1  and  was  taken  from  [Sung  1991]  (0  denotes  the 
vector  cross  product  operation)  and  responds  very  sen¬ 
sitively  to  color  differences  between  similar  colors  [Sung 
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This  similarity  measure  is  a  decreasing  function  with 


Figure  5:  The  similarity  measure  described  in  Equation  1 
is  illustrated  here  for  an  image  of  a  person.  Left:  Image. 
Center:  Similarity  measure  using  as  reference  color  the 
color  of  the  pixel  located  at  the  intersection  of  the  two 
segments  shown.  Right:  Plot  of  the  similarity  measure 
along  the  long  segment  using  the  same  reference  color. 


respect  to  the  angular  color  difference.  It  assigns  a  max¬ 
imum  value  of  1  to  colors  that  are  identical  to  the  ref¬ 
erence  “ridge  color”,  c/j,  and  a  minimum  value  of  0  to 
colors  that  are  orthogonal  to  cr  in  the  RGB  vector  space. 

The  discriminability  of  this  measure  can  be  seen  intu¬ 
itively  by  looking  at  the  normalized  image  see  Figure  5. 
The  exact  nature  of  this  measure  is  not  critical  to  our 
algorithm.  What  is  important  is  that  when  two  adja¬ 
cent  objects  have  different  perceived  color  (in  the  s^une 
background)  this  measure  is  positive^.  Mauiy  other  mea¬ 
sures  have  been  proposed  in  the  literature  and  could  be 
incorporated  in  our  scheme.  What  most  color  similar¬ 
ity  measures  have  in  common  is  that  they  are  based 
on  vector  values  and  cannot  be  mapped  onto  a  one- 
dimensioned  [Judd  and  Wyszecki  75]  field*.  This  makes 
color  perception  different  from  brightness  from  a  compu¬ 
tational  point  of  view  since  not  all  the  one-dimensional 
techniques  used  in  brightness  images  extend  naturally  to 
higher  dimensions. 

5  Previous  work  on  perceptual 
organization  without  edges 

The  scheme  that  we  present  in  this  paper  is  an  exten¬ 
sion  of  Curved  Inertia  Frames  (CIF),  a  brightness-based 
segmentation  scheme  presented  in  [Subirana-Vilanova 
1990],  which  in  turn  is  an  extension  of  an  edge-hased 
perceptual  organization  scheme  presented  ii.  the  same 
paper.  We  choose  this  scheme  for  two  reasons,  first  it 
is  the  only  existing  scheme  that  can  compute  global  re¬ 
gions  directly  on  the  image  without  imposing  a  three- 
dimensional  representation  of  the  data.  Second,  we  have 
been  able  to  overcome  a  number  of  problems  in  the 
scheme  so  that  it  is  useful  for  a  large  class  of  images. 

[Subirana-Vilanova  1990]’s  scheme  proceeds  in  three 

*  Note  that  the  perceived  color  similarity  among  arbi¬ 
trary  objects  in  the  scene  will  obviously  not  correspond  to 
this  measure.  Specially  if  we  do  not  take  into  account  the 
simultaneous-contrast  phenomena 

*Note  that  using  the  three  chanels,  red,  green  and  blue 
independently  works  for  some  cases.  However  it  is  possible 
to  construct  cases  in  which  it  does  not  as  when  an  object 
has  two  discontinuities,  one  in  the  red  channel  only  and  the 
other  in  one  of  the  other  two  chanels  only.  In  addition,  the 
perceived  similarity  is  not  well  captured  by  the  information 
contained  in  the  individual  chanels  alone  but  on  the  combined 
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stages.  In  the  first  one,  it  computes  two  local  measures 
at  each  point  p  f  mber  of  orientations  6:  the  inertia 
value  I(p,  9)  ar  ’.rated  length  T (p,  6).  These  two 

local  values  art.  i  the  output  of  elongated  gabor 

filters  and  are  .y  associate  a  saliency  measure  to 

each  curve  C(t)  in  the  image  plane  as  defined  in  equa^ 
tion  2,  were  the  curve  is  assumed  to  be  parameterized 
between  0  and  L,  1(1)  (T(f))  is  the  inertia  value  (toler¬ 
ated  length)  at  the  point  with  parameter  /  and  with  the 
orientation  of  the  curve  at  that  point,  and  p  and  a  are 
suitable  constants. 


Si  =  /o‘ !(!)/“ 


(2) 


In  the  second  stage,  the  scheme  computes  the  skeleton 
which  yields  the  maximum  saliency  using  the  network 
introduced  by  [Shashua  and  Ullman  1988].  In  fact,  the 
form  of  equation  2  closely  matches  what  the  network  can 
compute.  The  inertia  value  and  the  tolerated  length  can 
be  used  in  the  second  stage  using  other  schemes  such  as 
[Kass,  V.’itkin  and  Terzopoulos  88]  and  [Zucker,  Dobbins 
and  Iverson  89]. 

The  scheme  favors  curves  which  are  long,  smooth  (ac¬ 
cording  to  the  associated  tolerated  length  values)  and 
central  to  the  shape  (i.e.  which  have  high  inertia  val¬ 
ues).  This  second  stage  yields  the  skeleton  sketch  a  rep¬ 
resentation  of  the  potential  skeletons  in  the  image.  See 
[Subirana-Vilanova  1990],  [Subirana-Vilanova  1991]  for 
more  details. 

In  the  third  stage,  the  scheme  computes  a  succession 
of  individual  curves  (or  skeletons)  and  the  corresponding 
perceptual  groups  by  growing  outward  from  the  skele¬ 
tons. 

The  extension  of  this  scheme  to  color  seems  at  first 
quite  obvious:  use  elongated  “vector”  filters  to  compute 
the  inertia  values  and  the  tolerated  length.  However, 
there  are  a  number  of  important  problems  with  such  an 
approach.  These  will  be  discussed  in  the  next  section. 
One  of  the  contributions  of  this  paper  is  a  new  way  of 
estimating  these  local  parameters  by  defining  a  novel 
one-dimensional  multi-scale  color  ridge  detector  that  ad¬ 
dresses  the  problems  with  previous  approaches. 


6  Problems  in  Finding  Brightness 
Ridges 

The  one-dimensional  version  of  the  problem  that  we  are 
trying  to  solve  is  to  locate  ridges  in  a  signal.  By  ridge 
we  mean  something  that  looks  like  a  pair  of  step  edges 
(see  Figure  3).  A  simple-minded  approach  is  to  find 
the  edges  in  the  image,  and  then  look  for  the  center  of 
the  two  edges.  This  was  the  approach  used  in  [Subirana- 
Vilanova  1990].  Another  possibility  is  to  design  a  filter  to 
detect  such  a  structure  as  in  [Canny  1985],  [Noble  1988]; 
this  was  the  essence  of  the  brightness  based  approach 
used  in  [Subirana-Vilanova  1990]. 

However,  there  are  a  number  of  problems  with  using 
such  filters  as  estimators  for  ridge  detection.  These  prob¬ 
lems  are  not  particular  to  [Subirana-Vilanova  1990]’8 
scheme,  but  are  linked  to  the  nature  of  ridges  in  real  im¬ 
ages.  Some  of  these  problems  are  in  fact  very  similar  for 
coIot  and  for  brightness  images.  [Canny  1983]  developed 


Figure  6:  Left:  Plot  with  multiple  steps.  A  ridge  detector 
should  detect  three  ridges,  /fight;  Plot  with  narrow  valleys. 
A  ridge  detector  should  be  able  to  detect  the  different  lobes 
independently  of  the  size  of  the  neighboring  lobes. 


an  optimal  one-dimensional  operator  for  edge  detection 
which  he  then  extended  to  two-dimensional  images  with 
remarkable  performance. 

[Canny  1983]  used  a  similar  methodology  to  find 
ridges.  His  model  of  a  ridge  was  similar  to  the  one  shown 
in  Figure  3.  This  is  a  limited  model  since  ridges  in  im¬ 
ages  are  not  well  suited  to  it.  Perhaps  the  most  evident 
reason  why  such  model  is  not  realistic  is  the  fact  that  it 
is  tuned  to  a  particular  scale.  However,  in  most  images, 
ridges  appear  at  multiple  and  unpredictable  scales.  This 
is  not  so  much  of  a  problem  in  edge-detection  as  we  have 
discussed  in  the  previous  sections,  because  the  edges  of 
a  wide  range  of  images  can  be  assumed  to  have  “a  very 
similar  scale” .  Thus,  Canny’s  ridge  detector  works  only 
on  images  where  all  ridges  are  of  the  same  scale.  Thb  is 
true  in  the  text  images  shown  in  [Canny  1983]  (see  also 
Figure  8)  and  in  the  images  used  by  [Subirana-Vilanova 
1990]. 


Therefore,  an  important  feature  of  a  ridge  detector 
is  its  scale  invariance.  We  now  summarize  a  number  of 
important  features  that  a  ridge  operator  should  have  (see 
Figure  6): 

•  Scale:  See  previous  paragraph. 

•  Non-edgeness:  The  filter  should  give  no  response 
for  a  step  edge.  This  property  is  violated  by  [Canny 
1985]. 

•  Multiple  steps:  The  filter  should  also  detect  small 
steps.  These  are  frequent  in  images  for  example 
when  an  object  is  occluding  the  space  between  two 
other  objects.  This  complicates  matters  in  color 
images  because  the  surfaces  are  defined  by  vectors 
no  just  scalar  values. 

•  Narrow  valleys:  The  operator  should  also  work  in 
the  presence  of  multiple  ridges  even  when  they  are 
separated  by  small  valleys. 

•  Noise:  As  with  any  operator  that  is  to  work  in  real 
images,  tolerance  to  noise  is  necessary. 

•  Localization:  The  output  of  the  ridge-detector 
should  be  higher  in  the  middle  of  the  ridge  than 
on  the  sides. 

•  Strength:  The  strength  of  the  response  should  be 
somehow  correlated  with  the  strength  of  the  per¬ 
ception  of  the  ridge  by  humans. 

•  Large  scales:  Large  scales  should  receive  higher 
response.  This  is  a  property  used  by  [Subirana- 
Vilanova  1990]’s  scheme  and  is  important  because 
it  embodies  the  preference  for  large  objects. 
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Figure  7:  Top  Left:  Gaussian  second  derivative,  an  ap¬ 
proximation  to  Canny’s  optimal  ridge  detector.  Top  Cen¬ 
ter  and  Right:  Individual  one-dimensional  masks  used  by 
our  operator.  Bottom:  Response  for  an  edge  (left)  and 
a  ridge  (right)  for:  the  left  operator  of  our  ridge  detec¬ 
tor  (second  row),  the  right  operator  (third  row)  and  the 
combined  minimum  of  the  two  responses  (fourth  row).  In 
all  cases,  sigma  is  roughly  the  size  of  the  ridge.  This  is 
a  qualitative  demonstration.  It  illustrates  why  combining 
the  responses  of  two  filters  yields  a  positive  response  for  a 
ridge  but  no  response  for  an  edge. 


7  A  color  ridge  detector 

In  the  previous  section  we  have  outlined  a  number  of 
properties  that  we  would  like  our  ridge-detector  to  have. 
As  we  have  mentioned,  the  Canny  ridge-detector  fails 
because  it  can  not  handle  multiple  scales.  A  naive  way 
of  solving  this  problem  would  be  to  apply  the  Canny 
ridge  detector  at  multiple  scales  and  define  the  output 
of  the  filter  at  each  point  as  the  response  at  the  scale 
which  yields  a  maximum  value.  This  filter  would  work 
in  a  number  of  occasions  but  has  the  problem  of  giving  a 
response  for  step  edges  (since  the  ridge-detector  at  any 
single  scale  responds  to  edges  so  will  the  combined  filter 
do). 

One  can  prevent  the  response  of  edges  by  splitting 
Canny’s  ridge  operator  into  two  pieces,  one  for  each  edge, 
and  then  combining  the  two  responses  by  looking  at  the 
minimum  of  the  two  responses.  Figure  7  outlines  the 
philosophy  of  our  approach.  Figure  8  illustrates  how  our 
filter  behaves  according  to  the  different  criteria  outlined 
before.  The  Figure  also  compares  our  filter  with  that 
of  the  second  derivative  of  a  Gaussian,  which  is  simi¬ 
lar  to  the  filter  used  by  Canny.  There  are  a  number  of 
potential  candidates  within  this  framework  such  as  split¬ 
ting  Canny  filter  by  half,  using  two  edge  detectors  and 
many  others.  We  tried  a  number  of  possibilities  on  the 
Connection  Machine  on  a  real  and  on  a  synthetic  image 
with  varying  degrees  of  noise.  Table  1  presents  the  filter 
which  gave  a  response  most  similar  to  the  inertia  values 
and  the  tolerated  length  that  one  would  obtain  using 
the  formulas  for  the  corresponding  edges  as  described  in 
[Subirana-Vilanova  1990]. 


Var 

Description  /  Expression 

f^max 

F. 

Fc 

Gradient  penalization  coeff:  (3) 

Filter  Side  Lobe  size  coeff:  (1/8) 

Local  Neighborhood  size  coeff:  (1/8) 

9[x) 

9mai 

Color  gradient  at  location  x. 

Maximum  color  gradient  in  image. 

<T 

<r, 

Oc 

Size  of  Main  Filter  Lobe. 

Size  of  Side  Filter  Lobe:  F,<t 

Reference  Color  Neighborhood:  Fc<t 

c(x) 

Color  vector  at  x:  \JL(,x)  Q{x)  B{x)]^ 
Normalized  Color  at  x:  c(a:)/|c(z)| 

Cr(a:) 

Refr.  Color:  J”"  —  e  Cn(x  +  r)  dr 

■>  -<'c  y/2ir(Tc  ' 

FUr) 

Left  and  Right  Halves  of  Filter: 

r-l-g 

-<T<r<<T 

-^o  +  o,)<r<-<T 

0  otherwise 

Fair) 

Fd-r) 

^Lix) 

Inertia  from  Left  and  Right  Halves: 
/l(a+<,.)5»(cr(x),c„(i  -l-r)).^L(r)  dr 
/lt’"’5®(cr(i),c„(i-l-r)).FR(r)  dr 

Mx) 

Inertia  at  location  z  (Scale  a) 
mm(lL(x),Ifl(x))  " 

Overall  Inertia  at  location  x: 

V<T  max{Xa(x)) 

Table  1:  Steps  for  Computing  Directional  Inertias 

Our  approach  uses  two  filters  (see  profile  in  Figure  7), 
each  of  which  looks  at  one  side  of  the  ridge.  The  out¬ 
put  of  the  combined  filter  is  the  minimum  of  the  two 
responses.  Each  of  the  two  parts  of  the  filter  is  asym¬ 
metrical,  reflecting  the  fact  that  we  expect  the  object  to 
be  uniform  (central  part  of  the  filter  is  large),  and  that 
we  do  not  expect  that  a  region  of  equal  size  be  adjacent 
to  the  object  (lateral  part  of  the  filter  is  small).  In  other 
words,  our  ridge  detector  is  not  fooled  by  narrow  valleys. 

The  extension  to  color  is  tricky  because  there  is  no 
clear  notion  of  what  is  positive  and  what  is  negative  in 
vector  quantities.  We  solve  this  problem  by  adaptively 
defining  a  reference  color  at  each  point  as  the  weighted 
average  color  over  a  small  neighborhood  of  the  point 
(about  eight  times  smaller  than  the  scale  of  the  filter). 
Thus,  this  reference  color  will  be  different  for  different 
points  in  the  image.  Then,  sc2ilar  deviations  from  this 
reference  color  are  computed  as  defined  in  section  4. 

8  Results 

We  have  tried  our  filter  extensively.  Figure  8  shows 
that  the  output  of  our  filter  is  better  than  that  of  a  sec- 
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Figure  8:  First  column;  Different  input  signals.  Second  column:  Output  given  by  second  derivative  of  the  gaussian. 
Third  column:  Output  given  by  second  derivative  of  the  gaussian  using  reference  color.  Fourth  column:  Output  given 
by  our  ridge  detector.  The  First,  Second,  Fourth  and  Sixth  rows  are  results  of  a  single  scale  filter  application.  The 
Third,  Fifth  and  Seventh  rows  are  results  of  a  multiple  scale  filter  application. 
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Figure  9:  Two  images.  Left:  Ribbons  image.  Right:  Blob 
image.  See  inertia  surfaces  for  the  ribbon  image  in  Figure 
10.  Note  that  our  scheme  recovers  the  blob  at  the  right 
scale,  without  the  need  of  specifying  the  scale. 


ond  derivative  of  a  Gaussian  even  with  the  notion  of  a 
reference  color.  First,  our  filter  localizes  all  the  ridges 
for  a  single  ridge,  for  multiple  or  step  ridges  and  for 
noisy  ridges.  The  second  derivative  of  the  Gaussian  in¬ 
stead,  fails  under  the  presence  of  multiple  or  step  ridges. 
Second,  the  scale  chosen  by  our  operator  matches  the 
underlying  data  closely  while  the  scale  chosen  by  the 
second  derivative  of  the  Gaussian  does  not  match  the 
underlying  data.  This  is  important  because  the  scale  is 
necessary  to  compute  the  Tolerated  Length  which  is  used 
in  the  second  stage  of  our  scheme  to  find  the  Curved  In¬ 
ertia  Frames  of  the  image.  And  third,  our  filter  does 
not  respond  to  edges  while  the  second  derivative  of  the 
Gaussian  does. 

In  the  previous  paragraph  we  have  discussed  the  one¬ 
dimensional  version  of  our  filter.  The  filter  can  be  ap¬ 
plied  to  two-dimensional  images  at  different  orientations. 
Figure  10  shows  such  output  (aka  inertia  surfaces)  of 
our  filter  for  four  images.  The  two-dimensional  version 
of  the  filter  can  be  used  with  different  degrees  of  elon¬ 
gation.  In  our  experiments  we  used  one  pixel  width  to 
study  the  worst  possible  scenario.  An  elongated  filter 
would  smooth  existing  noise;  however,  large  scales  are 
not  good  because  they  smooth  the  response  near  dis¬ 
continuities  and  in  curved  areas  of  the  shape  (this  can 
be  overcomed  by  using  curved  filters  [Malik  and  Gigus 
1991]). 

The  inertia  surfaces  and  the  tolerated  length  are  the 
output  of  the  first  stage  of  our  scheme.  In  the  second 
stage  we  use  these  to  compute  the  Curved  Inertia  Frames 
(see  [SubiranarVilanova  1990])  as  shown  in  Figures  11 
and  12.  These  skeleton  representation  is  used  to  grow 
the  corresponding  regions  by  a  simple  region  growing 
process  which  starts  at  the  skeleton  and  proceeds  out¬ 
ward.  This  process  is  very  stable  because  it  can  use 
global  information  provided  by  the  frame  such  as  the 
average  color  or  the  expected  size  of  the  enclosing  re¬ 
gion.  See  Figures  11  and  12.  for  some  examples  of  the 
regions  that  are  obtained.  Observe  that  the  shape  of  the 
regions  is  accurate,  even  at  corners  or  junctions. 

Thus,  each  stage  of  our  scheme  has  been  tested  exten¬ 
sively.  Note  that  each  region  can  be  seen  as  an  individual 
test  since  the  computations  performed  within  it  are  in¬ 
dependent  of  those  performed  outside  it. 


Figure  10;  Inertia  surfaces  for  Three  images  at  four  orien¬ 
tations  (clockwise  12,  1:30,  3  and  4:30).  Note  that  exactly 
the  same  lisp  code  (without  changing  the  parameters)  was 
used  for  all  the  images.  Top:  Shirt  image.  Middle:  Ribbon 
image.  Bottom:  Person  image. 
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Figure  11:  Regions  obtained  for  the  person  image.  The 
white  curves  are  the  Curved  Inertia  Frames  from  which  the 
regions  were  recovered. 


Figure  12:  Left:  Blob  obtained  using  our  scheme  in  the 
blob  image.  Right:  Most  salient  Curved  Inertia  Frame  ob¬ 
tained  in  the  shirt  image.  Note  that  our  scheme  recovers 
the  structures  at  the  right  scale,  without  the  need  of  chang¬ 
ing  any  parameters. 

9  Brightness  and  Edges 

We  have  implemented  our  scheme  on  the  Connection 
Machine  for  color.  The  scheme  can  be  extended  nat¬ 
urally  to  brightness  and  texture  (using  filter-based  ap¬ 
proaches  applied  to  the  image,  see  [Turner  86],  [Malik 
and  Perona  89]  and  [Bovik,  Clark  and  Geisler  1990]). 
The  more  cues  a  system  uses,  the  more  robust  it  will  be. 
In  fact,  brightness  is  crucial  in  some  situations  because 
brightness  boundaries  do  not  always  come  together  with 
color  boundaries  (e.g.  cast  shadows). 

But,  should  these  different  schemes  be  applied  inde¬ 
pendently?  Consider  a  situation  in  which  a  surface  is 
defined  by  an  isoluminant  color  edge  on  one  side  and  by 
a  brightness  edge  (which  is  not  a  color  edge)  on  the  other. 
Our  scheme  would  not  recover  this  surface  because  the 
two  sides  of  our  filter  would  fail  (on  one  side  for  the 
brightness  module  and  on  the  other  for  the  isoilluminant 
one).  We  believe  that  a  combined  filter  should  be  used 
to  obtain  the  inertia  values  and  the  tolerated  length  in 
this  case.  The  second  stage  would  then  be  applied  only 
to  one  set  of  values.  Instead  of  having  a  filter  with  two 
sides,  our  new  combined  filter  should  have  four  sides. 
Two  responses  on  each  side,  one  for  color  Re,i  and  one 
for  brightness  Rt^i,  the  combined  response  would  then 
be  min{max{Ri,jejt,  Re,Uft),  max{Rt^^igHt,  Rc,right))- 
In  addition,  there  are  a  number  of  situations  in  which 
edges  are  clearly  necessetry  as  in  line-drawings  or  in  the 
Kanizsa  figures.  Surfaces  are  also  perceived  but  it  is  hard 
to  believe  that  discontinuities  do  not  play  a  role  in  this 
case. 

10  Discussion 

Our  scheme  solves  the  problem  of  finding  different  re¬ 
gions  by  looking  at  the  large  structures  one  by  one. 
The  larger  structures  are  the  first  ones  in  being  recov¬ 
ered,  this  cuts  small  structures  that  are  covered  by  larger 
structures  into  different  parts.  This  embodies  the  con¬ 
straint  that  larger  structures  tend  to  be  perceived  as 
occluding  surfaces  [Fetter  1956].  (See  Figure  13). 

As  mentioned  in  [SubiranarVilanova  1991],  the  em¬ 
phasis  of  C.I.F.  is  towards  finding  large  structures.  How 
ever,  this  may  be  misleading  as  evidenced  by  Figure  14. 
The  interesting  structure  is  not  composed  by  individual 
elements  that  pop-out  in  the  background.  Instead,  in 
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Figure  13:  Large  shapes  occlude  small  ones.  From 
[Kanizsa  1979]. 


Figure  14:  Small  structures,  whether  edges  or  regions  are 
sometimes  more  salient.  Left:  Drawing  of  Mir6.  Right: 
From  [Rock  1984]. 


this  case,  what  seems  to  capture  our  attention  can  be 
described  as  what  is  not  large.  That  is,  looking  for  the 
large  structures  and  finding  what  is  left  would  recover 
the  interesting  structure  as  if  we  where  getting  rid  of 
the  background.  It  is  unclear  though,  if  this  observation 
would  hold  in  general.  Future  research  is  necessary. 

11  What’s  New 

In  this  paper  we  have  argued  that  early  visual  process¬ 
ing  should  seek  representations  that  make  regions  ex¬ 
plicit,  not  just  edges.  Furthermore,  we  have  argued  that 
region  representations  should  be  computed  directly  on 
the  image  (i.e.  not  directly  from  discontinuities).  These 
suggestions  can  be  taken  further  to  imply  that  an  atten- 
tional  “coordinate”  frame  (which  corresponds  to  one  of 
the  perceptual  groups  obtained)  is  imposed  in  the  image 
prior  to  constructing  a  description  for  recognition.  See 
also  [Subirana-Vilanova  and  Richards  1991]. 

We  have  provided  evidence  in  favor  of  our  sugges¬ 
tions  by  considering  a  number  of  issues  related  to  edge- 
detection  such  as  stability,  scale,  data  representation,  hu¬ 
man  perception,  perceptual  organization,  junctions  and 
corners. 

Our  model  suggests  that  vision  should  start  by  com¬ 
puting  a  set  of  features  all  over  the  image  (corresponding 
to  the  inertia  values  and  the  tolerated  length).  This  can 
be  thought  of  as  “smart”  convolutions  of  the  image  with 
suitable  filters  plus  some  simple  non-linear  processing. 
In  fact,  recently,  remarkable  success  has  been  achieved 
by  filter-based  approaches  to  texture  [Knuttson  and 
Granlund  1983],  [Turner  1986],  [Fogel  and  Sagi  1989], 
[Malik  and  Perona  1989],  [Bovik,  Clark  and  Geisler 
1990],  stereo  [Kass  1983],  [Jones  and  Malik  1990]  bright¬ 


ness  edge  detection  [Canny  1986],  [Morrone,  Owens  and 
Burr  1987,  1990],  [Freeman  and  Adelson  1990]  and  mo¬ 
tion  [Heeger  1988].  (See  also  [Abramatic  and  Faugeras 
1982],  [Marrone  and  Owens  1987]).  Our  proposal  differs 
from  theirs  in  that  we  use  the  output  of  the  filters  to 
look  for  regions,  not  discontinuities. 

This  has  been  the  motivation  for  designing  a  new  non¬ 
linear  filter  for  ridge-detection.  Our  ridge  detector  has  a 
number  of  advantages  over  previous  ones  since  it  selects 
the  appropriate  scale  at  e^u:h  point  in  the  image,  does 
not  respond  to  edges,  can  be  used  with  brightness  as 
well  as  color  data,  is  tolerant  to  noise^  and  can  handle 
narrow  valleys  and  multiple  steps. 

The  resulting  scheme  can  segment  an  image  without 
making  explicit  use  of  discontinuities  and  is  computa^ 
tionally  efficient  on  the  Connection  Machine  (takes  time 
proportional  to  the  size  of  the  image).  Our  scheme 
works  properly  near  junctions  and  corners  because  it 
uses  global  information.  The  performance  of  the  scheme 
can  in  principle  be  attributed  to  a  number  of  interven¬ 
ing  factors;  but  we  believe  that  the  critical  aspect  of  the 
scheme  (and  one  of  the  contributions  of  this  paper)  is 
our  ridge-detector.  Running  the  scheme  on  the  edges 
or  using  simple  gabor  filters  would  not  yield  comparable 
results.  The  effective  use  of  color  makes  the  scheme  very 
robust  but  we  believe  that  comparable  results  would  be 
obtained  on  brightness  or  texture  data. 
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Abstract 

Visual  processing  tasks  often  require  the 
matching  of  contours  in  two  images.  Exam¬ 
ples  include  determining  optical  flow,  match¬ 
ing  features  for  alignment-based  object  recog¬ 
nition,  and  finding  correspondence  for  long- 
range  and  apparent  motion.  We  propose  a 
scheme  for  matching  partially  constrained  con¬ 
tours  in  two  images  using  local  afllne  transfor¬ 
mations.  This  new  scheme  is  motivated  in  part 
by  existing  ideas  for  both  recovering  optical 
flow  and  matching  features  for  object  recogni¬ 
tion.  The  scheme  approximates  the  structure 
of  orthographically  projected  contours  with 
planar  patches.  Each  patch  is  determined  us¬ 
ing  oriented  elliptical  Gaussian  neighborhoods 
that  smoothly  integrate  information  over  prox- 
imally  connected  contours  at  several  spatial 
scales.  At  the  largest  scale  satisfying  avidlable 
constraints  a  minimal  solution  mechanism  em¬ 
ploys  a  modified  general  pseudoinverse  to  pre¬ 
dict  matches  that  are  closest  to  the  simplest 
purely  translational  correspondence.  This  re¬ 
alisation  avoids  previously  encountered  diffi¬ 
culties  that  are  typically  caused  by  oversimpli¬ 
fying  assumptions  regarding  contour  transfor¬ 
mations,  lack  of  efficient  implementation,  and 
instability  due  to  singular  conditions.  Addi¬ 
tional  advantages  include  the  ease  with  which 
the  proposed  scheme  can  incorporate  the  in¬ 
fluence  of  specially  matched  features.  These 
properties  are  verified  by  simulation  results 
obtained  on  noisy  synthetic  and  natural  im¬ 
agery. 

1  Introduction 

At  the  heart  oi  many  visual  processing  tasks  lies  the  de¬ 
termination  of  correspondence  between  features  in  two 
images.  For  example,  correspondence  is  at  least  implic¬ 
itly  required  in  stereopsis,  structure  from  motion,  and 
object  recognition.  Often  the  matching  is  nndercon- 
strained  because  the  features  can  be  matched  in  more 
than  one  way.  Furthermore,  the  search  space  for  pos¬ 
sible  matches  may  be  prohibitively  large.  Assumptions, 
usually  derived  from  regularities  of  the  physical  world. 


Figure  1:  To  find  a  point-to-point  correspondence  be¬ 
tween  contours  m  two  images  we  are  given  constxaint 
lines  for  several  points  in  the  first  image.  Bach  con¬ 
straint  line  narrows  down  the  match  a  point  in  the  first 
image  to  a  line  in  the  second. 


are  therefore  necessary  to  constrain  matching  solutions 
uniquely.  For  example,  in  stereopsis  one  can  constrain 
the  search  to  smaU  areas  along  an  epipolar  line  by  ex¬ 
ploiting  uniqueness,  continuity,  coarse  to  fine  approxi¬ 
mations,  and  probabilistic  assumptions  about  disparity 
[23,  24,  28]. 

We  address  the  general  problem  of  matching  partially 
constrained  contours  in  a  pair  of  images.  As  illustrated 
by  Figure  1,  we  are  given:  (1)  an  image  containing  a  set 
of  contours;  (2)  a  second  image  obtained  by  arbitrarily 
and  independently  displacing,  stretching,  rotating,  scal¬ 
ing,  and  distorting  these  contours;  and,  (3)  matching 
constraint  lines  for  several  points  along  the  first  set  of 
contours  constraining  the  match  for  each  of  these  pmnts 
to  a  line  in  the  second  image.  These  three  sources  of  in¬ 
formation  are  application  specific.  The  contour  images 
may,  for  instance,  be  obtained  by  applying  an  edge  find¬ 
ing  algorithm  to  either  sampled  time-varying  imagery 
or  two  different  views  of  a  scene.  Furthermore,  the  con¬ 
straint  lines  accompanying  these  two  images  can  often  be 
estimated  for  many  constrained  contour  matching  appli¬ 
cations. 

One  such  application  is  the  measurement  of  visual 
motion.  In  determining  optical  flow  along  isobrightness 
contours,  the  tangential  component  of  motion  must  be 
constrained  by  additional  assumptions  due  to  the  aper- 
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ture  problem  [10,  17,  1,  14].  Because  local  measurements 
of  motion  capture  the  component  of  image  velocity  only 
in  the  direction  normal  to  isobrightness  contours,  the 
match  for  a  point  in  one  image  frame  is  constrained  to 
a  line  in  the  next  image  frame. 

Another  problem  requiring  the  matching  of  contours 
arises  in  alignment-based  methods  for  object  recognition. 
In  these  methods,  global  transformations  between  model 
and  object  views  are  compensated  for  by  a  normalization 
stage,  which  aligns  the  two  views  and  allows  subsequent 
comparison.  These  views  can  be  represented  by  con¬ 
tours.  Though  a  major  benefit  of  alignment  approaches 
is  the  avoidance  of  exponential  search  by  compensat¬ 
ing  for  possible  transformations  before  verification  [18], 
alignment  schemes  require  that  object  and  model  fea¬ 
tures  be  matched  at  some  stage  (For  a  broader  overview 
of  the  alignment  approach,  see  [31,  18,  6].)  Assuming 
that  model  and  object  views  can  be  roughly  aligned  using 
global  image  properties  or  several  pre-matched  “anchor 
points”  [18],  we  claim  that  heuristic  constraint  lines  can 
be  estimated  for  each  object  contour  point  by  the  tan¬ 
gent  at  the  closest  model  contour  point  [4].  (The  main 
inaccuracies  in  this  method  will  occur  for  constraint  lines 
at  high  curvature  points,  which  are  simply  ignored  in 
the  matching  process.)  A  similar  method  can  be  used 
for  finding  correspondence  for  long-range  and  apparent 
motion. 

This  paper  describes  a  new  scheme  for  solving  the 
constrained  contonr  matching  problem  using  local  affine 
transformations.  First,  we  briefly  review  several  previ¬ 
ous  methods  for  both  determining  optical  flow  and  find¬ 
ing  matches  for  object  recognition.  Second,  we  describe 
the  new  scheme,  which  reworks  sor  e  of  the  previous 
ideas  into  a  robust,  accurate,  and  efficient  framework. 
Following  the  full  development  of  this  scheme,  we  dis¬ 
cuss  simulation  results  obtained  on  noisy  synthetic  and 
natural  imagery. 

2  Previous  Matching  Ideas 

In  computing  optical  flow,  several  methods  have  as¬ 
sumed  that  the  motion  of  patterns  can  be  described 
at  least  locally  by  pure  translation  [21,  27,  1].  How¬ 
ever,  while  this  assumption  accurately  tracks  objects  in 
the  case  of  observer  motion,  a  more  general  assumption 
is  necessary  to  account  for  objects  undergoing  combi¬ 
nations  of  translation  with  three-dimensional  rotation, 
scaling,  and  deformation.  Other  methods  find  the  veloc¬ 
ity  field  with  the  least  variation  that  is  consistent  with 
local  motion  measurements  [17,  25,  2,  11,  19].  When 
implemented  along  contours  [12,  13,  26]  such  schemes 
can  avoid  smoothing  over  sharp  discontinuities  in  the 
motion  field.  However,  they  usually  require  slowly  con¬ 
verging  iterative  methods,  such  as  the  conjugate  gradi¬ 
ent  method,  to  efficiently  recover  a  global  solution  con¬ 
strained  by  hundreds  of  linear  equations  [13].  A  third 
approach  is  the  assumption  that  objects  in  motion  can 
be  described  by  planar  patches  [33].  In  this  approach 
the  second  order  terms  of  the  Taylor  series  expansion 
of  optical  flow  are  determined  by  satisfying  the  normal 
flow  constraints  within  fixed  image  neighborhoods.  How¬ 
ever,  because  the  size  of  these  neighborhoods  remains 


fixed,  highly  non-planar  segment  boundaries  cannot  be 
accurately  described.  Recent  affine  motion  models  that 
include  methods  for  selecting  neighborhood  sizes  using 
Gaussian  scale  pyramids  require  iteration  [8,  7j.  Finally, 
a  major  drawback  of  all  these  methods  is  the  existence  of 
singular  conditions  for  which  no  unique  solution  exists. 
Solutions  typically  become  highly  unstable  (sensitive  to 
noise)  as  objects  approach  such  configurations.  For  ex¬ 
ample,  the  motion  of  near-linear  contours  is  ambiguously 
interpreted  by  all  of  these  methods. 

In  object  recognition,  several  matching  schemes  em¬ 
ploy  correlation  measures  or  examine  angles  and  inter¬ 
sections  between  contours.  Such  methods  either  assume 
perfect  alignment,  or  conduct  a  complete  search  through 
feature  space.  They  cannot  deal  with  un-parameterized 
distortions,  or  normalization  inaccuracies  prevalent  in 
a  roughly  aligned  contour  image.  They  also  typically 
lack  efficient  implementation.  Recently,  a  number  of 
methods  have  been  devised  for  comparing  contours  using 
affine  invariant  curvature  [9],  arc  length  [3],  or  moments 
[15].  The  main  drawback  of  these  methods  is  that  they 
are  global  measures,  applicable  only  to  complete,  un¬ 
occluded  curves.  Furthermore,  as  the  objects  producing 
the  contour  image  becomes  less  planar,  the  affine  invari¬ 
ance  becomes  less  valid  for  describing  transformutiuno. 
A  second  disadvantage  is  that  the  calculation  of  these 
affine  invariant  quantities  requires  a  high  degree  of  dif¬ 
ferentiation  along  contours,  implying  both  that  contour 
tracing,  contour  thinning,  and  contour  enhancement  be 
performed  in  order  to  ensure  a  unique  path  along  the 
contonr,  and  that  contonr  smoothing  be  performed  in 
order  to  avoid  large  errors  in  the  differentiation  due  to 
noise.  Most  importantly,  though,  is  that  all  of  these 
recognition  techniques  are  used  primarily  as  verification 
steps  taken  after  a  matching  has  already  been  hypothe¬ 
sized  by  exponential  search  or  inaccurate  heuristics.  This 
is  also  true  of  other  affine  matching  techniques  [20,  32]. 

3  The  Proposed  Solution 

Despite  the  inadequacies  of  the  previous  attempts,  we 
selectively  incorporate  several  of  these  ideas  to  solve  the 
constrained  contour  matching  problem.  In  addition,  new 
thought  is  given  to  many  key  issues,  such  as  finding 
the  spatial  extent  over  which  these  ideas  are  valid.  To 
address  these  issues,  we  develop  a  scheme  for  match¬ 
ing  constrained  contours  that  is  based  on  local  affine 
transformations.  It  constrains  the  matching  by  assuming 
that  contours  are  the  orthographic  projections  of  locally 
coplanar  points,  thereby  reducing  the  recovery  of  corre¬ 
spondence  to  a  local,  linearly  constrained,  non-iterative 
calculation. 

First,  the  transformations  of  contours  within  local  im¬ 
age  neighborhoods  are  assumed  to  be  affine.  We  implic¬ 
itly  find  the  affine  transformation  through  a  least  squares 
fit  to  available  match  constraint  lines  and  pre-matched 
points.  Second,  local  neighborhoods  are  constructed  so 
that  constraint  line  information  is  smoothly  integrated 
primarily  over  proximal  connected  contours.  This  is  ac¬ 
complished  by  using  elliptical  Gaussian  neighborhoods 
oriented  along  the  contour.  Third,  we  consider  sev¬ 
eral  neighborhoods  of  differing  sizes  simultaneously  for 


a  given  point.  From  these  neighborhoods,  we  choose  the 
largest  neighborhood  within  which  an  affine  transforma¬ 
tion  can  accurately  :>atisfy  available  constraints.  Finally, 
a  stable,  unique  solution  is  guaranteed  for  the  chosen 
neighborhood  by  using  a  modified  general  pseudoinverse 
to  find,  subject  to  the  constraints,  the  matches  that  de¬ 
viate  the  least  from  the  smallest  purely  translational  cor¬ 
respondence. 

4  Matching  Globally  Planar  Contours 

Let  ns  first  assume  that  the  transformation  between  two 
image  contours  can  be  described  by  a  global  afRne  map¬ 
ping.  That  is,  each  contour  point  Pc  =  (ze,ye)  in  the 
first  image  maps  to  contour  point  p',  = 
second  image  by  the  linear  equation 

P',  =  Ape  +  t  (1) 

where  the  2x2  matrix  A  accounts  for  the  two- 
dimensional  shearing,  scaling,  and  rotation  in  the  image 
plane  about  the  global  origin,  Po  =  (0, 0),  and  the  vector 
t  accounts  for  two-dimensional  translation  in  the  image 
plane.  Next,  consider  the  contours  in  Figure  2.  Suppose 
that  the  exact  location  of  p'^  is  unknown,  but  that  it  lies 
along  some  known  constraint  line  in  the  image.  If  n  is 
the  perpendicular  from  Pe  to  the  constraint  line,  and  n  is 
the  unit  normal  to  the  constraint  line,  then  the  distance 
from  the  constraint  line  is 

**’(Pe  +  n  -  P'J  =  0.  (2) 

Substituting  (1)  into  (2)  yields 

(i4pe)^n-f- t^rk  =  pfn+ |n|.  (3) 

Therefore,  we  obtain  one  linear  equation  constraining 
the  six  parameters  of  the  affine  transformation  for  each 
point.  Let  us  represent  the  six  afRne  parameters  as  a 
vector 

a  =  [  i4oo  -Aio  >loi  ■dll  f*  1  •  (4) 


Then  for  each  contour  point  with  constraint  normal 
Ui  equation  (3)  can  be  rewritten  as 

Ci&  =di  (5) 

where 

«i  =  [  ^.y..  ^,*c.  1 

(6) 

<k  -  P^^  +  ini!-  (7) 

Thus,  for  a  system  of  k  constrained  points  and  associated 
equations, 

Ca  =  d  (8) 

where  C  is  a  k  x  6  matrix  with  rows  ci . . .  Cfc,  and  d  is 
a  vector  with  elements  di  ...d*.  Note  that  (8)  should 
be  solved  in  the  least  squares  sense  when  there  are  more 
than  six  independent  equations.  Speciftcally,  we  solve 
the  system 

C^Ca  =  C^d.  (9) 

This  equation  predicts  the  match  for  any  point,  even  an 
underconstrained  point,  providing  that  an  afRne  trans¬ 
formation  is  uniquely  determined  by  existing  constraint 
lines  in  the  image. 


5  Incorporating  Specially  Matched 
Points 

la  seme  matching  problems,  the  exact  point-to-point 
correspondence  may  be  known  for  some  special  con¬ 
tour  points,  such  as  corners,  terminators,  high  curvature 
points,  inflection  points,  and  isolated  points.  To  incor¬ 
porate  the  influence  of  a  special  point  match  between  p, 
and  p' ,  we  minimize  the  distance, 

|dp, +t-p'.|  =  0.  (10) 

The  two  constraint  equations  for  each  special  point  p,^ 
are  therefore  given  by 


Sil 

»<2 


a  =  p 


(11) 


(12) 


where 

=  [  *.;  y.i  0  0  1  0  ] 

»«  =  [  0  0  y„  0  1  ] . 

Note  that  these  two  equations  are  equivalent  to  two  per¬ 
pendicular  constraint  lines  that  intersect  at  p'^..  Again, 
for  a  system  of  q  special  points,  we  have 

Sa  =  g  (13) 

where  5  is  a  2^  x  6  matrix  consisting  of  rows 
>iii>i2  --Sfi,s,2i  and  g  is  the  length  2q  concatenation 
of  p',1 . . .  p',y.  Finally,  finding  the  best  afRne  transfor¬ 
mation  amounts  to  solving  the  least  squares  relation 

S^5a  =  S^’g.  (14) 

Since  both  Cf^C  and  S^S  are  positive  semidefinite, 
we  can  combine  the  two  minimization  equations,  (9)  and 
(14),  and  find  a  by  solving 

((1  -  a)CFC  -1-  aS^’S)  a  =  (1  -  a)C^d  +  a5^g  (15) 

where  a,  a  number  between  0  and  1,  is  the  accuracy 
of  special  point  matches  relative  to  the  accuracy  of  con¬ 
tour  point  constraint  lines.  Again,  solving  this  system  of 
equations  for  a  determines  the  best  global  afRne  trans¬ 
formation  about  the  global  origin,  p^. 
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6  Matching  Locally  Planar  Contours 

Since  contouts  are  generally  perspective  projections  of 
non-planar,  non-rigid  objects,  (15)  accurately  describes 
matches  only  locally.  An  attractive  way  to  locally  en¬ 
force  the  affine  ztssumption  involves  constructing  a  local 
coordinate  system  at  each  contour  point,  pj.  First,  the 
location  of  each  point,  Pi,  is  measured  with  respect  to 
the  local  origin,  pj,  instead  of  the  global  origin,  p^.  To 
make  the  calculations  local  we  abo  weigh  constraints  for 
each  point,  pi,  by  some  locedity  measure,  a',-.  The  more 
local  Pi  to  Pj ,  the  larger  this  weight. 

Employing  the  method  of  weighted  least  squares  to  in¬ 
corporate  this  weighting  scheme  into  (15),  we  find  the  lo¬ 
cal  affine  transformation,  and  hence  the  match,  for  each 
contour  point,  pj ,  by  satisfying  a  system  of  equations 

Ra  =  1  (16) 

where 

R  =  (l-a)C^WfW^C  +  aS^Wj'W.S  (17) 

1  =  (1  -  oi)C'^WjW^A  +  aS'^WjW,^.  (18) 
The  diagonal  matrices  and  W,  establish  the  local 
neighborhood  at  pj  and  are  given  by 

Wt  =  diag(t.;a  . . .  w**) 

where  cl . . .  dt  and  si . . .  sq  are  the  indices  for  contour 
and  special  points  respectively.  The  6x6  matrix,  R,  and 
the  six-dimensional  vector,  1,  can  both  be  written  explic¬ 
itly  in  terms  of  point  locations,  normals,  and  weights  by 
expanding  the  matrix  definitions.  Therefore,  the  ele¬ 
ments  of  these  matrices  can  easily  be  calculated  in  par- 
aUel  for  each  local  neighborhood  using  simple  adders. 

7  Oriented  Elliptical  Neighborhoods 

The  weight  for  each  point  determines  the  relative  influ¬ 
ence  of  its  constraints  upon  the  solution.  The  set  of  all 
point  weights  therefore  determines  the  extent  and  shape 
of  the  local  neighborhood.  We  determine  the  weights 
according  to  several  neighborhood  criteria.  First,  the 
neighborhood  integration  should  monotonically  decrease 
with  distance  from  the  local  origin,  since  distant  points 
are  less  likely  to  be  coplanar  with  Pj.  Second,  the  neigh¬ 
borhood  should  be  smooth  so  that  matching  solutions 
vary  continuously  along  contours.  (Note,  however,  that 
this  will  not  mandate  the  smoothest  solution  along  the 
contour.)  Finally,  the  neighborhood  should  be  maxi¬ 
mally  elongated  and  oriented  along  the  contour  in  order 
to  integrate  information  primarily  along  connected  prox¬ 
imal  contours  without  serial  contour  tracing. 

With  these  criteria  in  mind,  we  suggest  that  the 
weight  for  p{  actually  be  the  product  of  two  Gaussian 
weights: 

Wj  =  yiPi  (20) 

as  iUustrated  in  Figure  3.  The  first  weight,  ji,  is  given 
by 

7i  =  exp  (-|p<|V2<r*).  (21) 

The  set  of  all  7’s  constructs  a  circularly  symmetric  Gaus¬ 
sian  neighborhood  about  the  local  origin.  The  standard 


Figure  3:  (a)  Integration  along  nearly  linear  contour 
segments  is  primarily  along  the  contour,  while  (b)  in¬ 
tegration  over  symmetric  sections  of  the  contour  is  cir¬ 
cular.  Oriented  local  neighborhoods  are  determined  by 
the  product  o/G(|pj|),  a  Gaussian  of  the  distance  from 
the  local  origin  {pjJ,  and  G(Si),  a  Gaussian  of  the  dis¬ 
tance  from  the  axis  of  local  orientation  passing  through 
Pj .  The  widA  of  the  latter  Gaussian  is  determined  by 
the  strength  of  the  local  orientation. 


deviation  of  this  Gaussian,  ay,  is  the  effective  sise  of  this 
neighborhood.  The  second  weight,  /*<,  is  given  by 

Pi  =  exp  {-Sf/2al)  (22) 


where  Si  is  the  perpendicular  distance  from  pi  to  axis 
of  local  orientation  passing  through  Pj .  It  modulates 
the  circularly  symmetric  Gaussian  and  orients  the  re¬ 
sulting  elliptical  neighborhood  to  integrate  information 
primarily  over  connected  contours.  The  width,  a^,  of 
this  Gaussian  ranges  between  0  and  00,  corresponding 
to  maximal  local  orientation  and  circular  symmetry,  re¬ 
spectively.  Hence,  the  stronger  the  local  orientational 
preference  of  the  contour,  the  higher  the  aspect  ratio  of 
the  elliptical  neighborhood,  and  the  larger  the  relative 
influence  of  points  that  lie  closer  to  the  local  axis  of  ori- 
entaion. 

We  now  determine  the  parameters  Si  and  a,,,  keeping 
in  mind  that  the  the  oriented  neighborhood  must  be  nar¬ 
row  for  linear  contours  and  wide  for  circularly  symmetric 
contours.  To  capture  this  notion,  we  propose  that  the 
major  and  minor  axes  of  the  elliptical  neighborhood  be 
respectively  aligned  with  the  axes  of  minimum  and  maxi¬ 
mum  local  inertia  and  proportional  to  the  maximum  and 
minimum  local  second  moments.  These  quantities  can 
be  derived  through  principal  component  analysis  of  the 
inertia  matrix 


where 


'  _  fxa  dat 
day  dyy 

(23) 

f  -  T*  -v*** 

»•  “  7.,*.. 

V  = 

^s  =  ELyf.y!. 

(24) 
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are  the  second  local  moments  about  the  local  y,  -*y,  and 
*  axes  respectively.  First,  the  second  moment  extremes 
are  simply  the  eigenvalues  of  J, 

_  Jzt  +  Jyy  -f  P  _  J„  +  Jyy  -  p  .  . 

—  2  ’  '^min  —  2 

where  _ 

P  =  yiZ.  -  Jyy?  +  (26) 

Since  the  major  axis  is  simply  =  PXmzxt  the  minor 
axis  is  0Xriiin  =  some  proportionality  con¬ 

stant).  This  desired  minor  axis  is  the  effective  width  of 
the  modulated  Gaussian  in  the  direction  perpendicular 
to  the  axis  of  orientation,  or 


(27) 


Therefore, 


i/Xfna 


(28) 


Next,  Si  can  be  found  by  dotting  p{  with  the  unit  normal 
to  the  axis  of  local  orientation,  which  is  simply  the  unit 
eigenvector  of  J  corresponding  to  the  eigenvalue  : 


©x  = 

Hence, 

Si  =  0Ip<  (30) 

(see  [16]  for  a  more  complete  analysis  of  using  the  prin¬ 
cipal  component  method  to  find  the  orientational  pref¬ 
erence  of  binary  images). 

Note  that,  although  there  may  be  problems  with  find¬ 
ing  the  orientation  of  a  contour  that  is  surrounded  by 
several  other  contours,  surrounding  contours  should  not 
in  general  affect  the  final  solution  long  as  we  have  chosen 
a  sufRciently  small  neighborhood  in  which  to  carry  out 
the  calculations.  This  issue  is  addressed  next. 


8  Choosing  From  Multiple 
Neighborhood  Scales 

If  the  local  neighborhood  is  too  small,  the  affine  trans¬ 
formation  may  not  be  determined  uniquely  and  is  there¬ 
fore  easily  degraded  by  noise.  On  the  other  hand,  if  the 
neighborhood  is  too  large,  the  local  planarity  assump¬ 
tion  becomes  less  accurate,  and  we  also  run  the  risk  of 
inclnding  within  the  same  neighborhood  independently 
matched  contours.  We  determine  the  optimal  scale  of 
each  local  neighborhood  by  simultaneously  examining 
the  constraints  for  several  spatial  scales  (several  values  of 
<ry)  computed  in  parallel.  Specifically,  we  select  the  solu¬ 
tion  at  the  smallest  spatial  scale  for  which  the  condition 
number  of  iZ  is  lower  than  some  number,  fSmam  ■  Hi  serial 
simulation  this  criterion  allows  us  to  try  smaller  scales, 
which  require  less  integration  time,  before  larger  ones. 
Furthermore,  it  does  not  require  that  we  actually  solve 
(16)  at  each  scale.  Note  that  Kmaa  is  directly  related  to 

303 


the  amount  of  expected  noise  in  the  constraint  informa¬ 
tion.  One  must  set  this  parameter  high  when  available 
constraints  are  known  to  be  accurate,  and  low  when  high 
levels  of  noise  exist.  Later  in  the  paper  we  heuristically 
determine  this  parameter  for  simulation  purposes. 

9  Minimal  Solutions  for  Ambiguous 
Cases 

Despite  selection  of  the  optimal  neighborhood,  the  solu¬ 
tion  to  (16)  can  be  underdetermined  or  unstable,  even 
in  the  largest  neighborhood.  This  situation  occurs  when 
there  exists  more  than  one  affine  mapping  between  two 
contours.  Consider,  for  example,  a  linear  contour  where 
the  constraint  lines  for  all  points  are  identical.  In  this 
case  any  combination  of  scaling  or  shearing  along  the 
constraint  line,  in  addition  to  any  translation  taking  the 
line  to  the  constraint  line,  satisfies  the  constraints.  An¬ 
other  example  is  the  matching  of  concentric  circles.  If  at 
each  point  along  the  first  circle  the  constraint  line  is  par¬ 
allel  to  the  tangent  to  the  circle  at  that  point,  then  any 
amount  of  rotation,  coupled  with  an  appropriate  scaling, 
is  possible  (see  examples  in  Figure  4). 

For  these  singular  cases,  we  select  the  minimal  solu¬ 
tion,  which  minimixes 

k 

^  =  5Z(‘*'/<I^P/i+*-(P,,+*m<n)l)*  (31) 

•=1 

where  n  is  the  number  of  points  (not  just  contour 
points)  in  the  neighborhood.  A  summarizes  over  the 
entire  neighborhood  the  squared  deviation  between  pre¬ 
dicted  point  matches  and  tmin  >  the  smallest  pure  trans¬ 
lation  that  satisfies  neighborhood  constraints  in  the  least 
squares  sense. 

There  are  a  few  reasons  for  minimising  A.  First,  it 
yields  intuitive  results  because  the  default  is  simply  a 
purely  translational  matching  that  is  closest  to  the  nor¬ 
mal  component  of  the  match,  as  illustrated  in  Figure  4. 
Any  amount  of  rotation  of  the  circle,  or  any  amount  of 
tangential  translation  of  the  line,  results  in  a  set  of  match 
vectors  that  deviate  more  from  their  normal  components. 
Second,  minimising  A  uniquely  (stably)  determines  the 
match,  regardless  of  the  constraint  information.  Even  for 
collinear  points  with  parallel  constraint  lines  (which  can 
be  matched  by  many  purely  translational  transforma¬ 
tions),  the  average  neighborhood  normal  component  is 
itself  the  only  matching  solution  that  minimises  A.  Fur¬ 
thermore,  A  is  robust  to  noise  because  it  integrates  infor¬ 
mation  over  the  entire  local  neighborhood,  as  opposed  to 
directly  minimising  the  deviation  between  the  solution 
and  the  normal  component,  which  is  sensitive  to  noise  in 
the  constraint  line.  Third,  minimising  A  preserves  con¬ 
tinuous  matches  along  contours.  As  before,  calculation 
over  Gaussian  local  neighborhoods  guarantees  continu¬ 
ous  variation  of  A.  Finally,  a  closed  form  solution  for 
minimising  A  is  efficiently  calculated.  The  optimal  solu¬ 
tion  is  selected  at  each  point  independently  of  the  final 
solutions  at  other  points,  as  opposed  to  global  selection 
methods  which  select  the  set  of  contour  matches  to  op¬ 
timize  an  overall  measure,  such  as  smoothness  along  the 
contour. 


Figure  4:  Minimal  solutions  for  ambiguous  transfor¬ 
mations  are  intuitive:  (a)  concentric  circles  could  be 
matched  with  any  amount  of  rotation,  but  pure  expan¬ 
sion  should  be  the  default;  (b)  parallel  lines  could  be 
matched  using  a  translation  in  any  direction  within 
90  degrees  of  the  normal  component,  but  the  default 
should  be  the  normal  component. 


Given  tm«n>  minimising  A  is  identical  to  finding 
the  affine  transformation  that  matches  special  points. 
Therefore,  to  minimise  A  we  solve  a  system  of  linear 
equations,  similar  to  those  of  (13)  with  p,/  =  p,.+tm<n> 
given  by 

F'^Wj‘WfFn=  F'^WjW/y  (32) 

Here  F  is  a  2n  x  6  dimensional  matrix  with  rows 
futfij  -  Cii.Cij,  where 

iir  =  [  0  0  y,,  0  1  ] , 

V  is  a  length  2n  concatenation  of  vi . . .  v„,  where 

~  P/j  "i"  ^mini  (34) 

and  Wf  is  the  diagonal  matrix  of  weights  for  neighbor¬ 
hood  points  given  by 

Wt  =  diag(w/i,w/i,w/j,w/j,.. .,«•;/«, «/n).  (35) 

To  minimise  A  subject  to  the  matching  constraints  we 

solve 


min(a^Qa  -  2a^h)  subject  to  IZa  =  1  (36) 


where 


Q  =  F'^WjWfF,  ]i  =  F^WfWfY.  (37) 

Once  again,  tmin  is  the  smallest  least  squares  neigh¬ 
borhood  pure  translation.  In  the  next  section  we  simplify 
the  existing  constraints  in  order  to  directly  compute  this 
default,  and  eventually  the  match  vector  itself,  later  in 
the  paper. 

10  Simplifications  Arising  from  an 
Implicit  Formulation 

Before  deriving  closed  form  solutions  for  tm<n  mid  a,  it 
is  useful  to  consider  the  final  match  given  by  the  local 
affine  transformation  applied  to  the  local  origin  itself. 


P'j  =  Apj+t  =  A 


Since  the  match  itself  depends  only  on  the  translational 
component,  we  need  only  implicitly  solve  for  the  rota¬ 
tion,  scaling,  and  shearing  components  and  thereby  di¬ 
rectly  compute  the  match  vector.  In  fact,  an  implicit  for¬ 
mulation  should  be  more  efficient  and  stable.  Increased 
efficiency  follows  from  inverting  2x2  and  4x4  subma- 
trices  of  the  6  x  6  A  matrix  instead  of  inverting  R  itself. 
Increased  stability  follows  from  analysis  of  the  elements 
of  R:  since  the  coefficients  of  t  do  not  depend  on  the 
size  of  the  local  neighborhood,  while  the  coefficients  of 
A  grow  as  the  square  of  the  local  neighborhood,  uniform 
random  noise  affects  these  elements  differently.  By  sep¬ 
arating  the  translational  components,  this  undesirable 
property  should  disappear. 

First,  let  (16)  be  rewritten  as 


ir  ««  1  r  r  1 

if  Ft  J  [  *  J 


where  At  is  a  2  x  2  matrix  relating  the  translational 
affine  coefficients,  t,  to  the  2-dimensional  vector.  It,  Rr 
is  a  4  X  4  matrix  relating  the  rotational,  shearing,  and 
scaling  affine  coefficients,  r,  to  the  4-dimensional  vector, 
1,.,  and  Fe  is  a  4  X  2  coupling  matrix.  Likewise,  Q  and 
h  can  be  represented  by 


At  this  point,  it  is  advantageous  to  expand  and  sim¬ 
plify  severd  of  these  submatrices.  Expanding,  we  obtain 


V'-  0  Q. 


Qc  =  o 


(taking  advantage  of  the  fact  that  1^7=1 - 
‘‘'/tV/i  =  6  loi  elliptical  neighborhoods  due  to  odd 
symmetry),  where 


‘'ti 


+  t  =  t. 


and  /  is  the  2x2  identity  matrix.  Note  that  the  ele¬ 
ments  of  Qt  are  second  moments  taken  within  the  el¬ 
liptical  neighborhood.  With  this  in  mind,  these  second 
moments  should  be  roughly  proportional  to  the  second 
moments,  J„,  7,^,  and  Jyy,  that  are  used  to  derive  this 
neighborhood  (see  (24)).  Therefore,  a  reasonable  esti¬ 
mation  is 

Q.  ^  J.  (46) 


11  Determining  The  Default 
Translation 

Using  (39),  finding  t^in  fdf  the  neighborhood  is  much 
like  finding  the  best  local  affine  transformation  with  the 
rotational,  shearing,  and  scaling  components  set  to  zero 
(i.e.  j4  =  /,  the  2x2  identity  matrix),  or 

r  =  r™.-n  =  [  1  0  0  1  ]^  .  (47) 

Thus,  we  simply  solve  a  system  of  equations 

Rt^min  —  (1«  ~  -R^rmin)*  (^8) 

Since  the  determination  of  a  purely  translational  match¬ 
ing  merely  requires  the  intersection  of  two  constraint 
lines,  this  system  of  equations  is  underconstrained  only 
when  there  are  no  special  point  matches  and  the  con¬ 
straint  lines  are  parallel.  In  this  case,  we  choose  the  so¬ 
lution  closest  to  the  average  normal  component  by  find¬ 
ing,  using  the  general  pseudoinverse  (denoted  by  ->-),  the 
smallest  translation  satisfying  the  constraints.  Hence, 

tn.<«  =  <(!*- fifr„.«).  (49) 

The  pseudoinverse  formulation  used  for  this  and  all 
subsequent  calculations  is  presented  in  Appendix  B.  It 
differs  from  the  convenlionaUy  used  form  in  several  ways. 
First,  to  promote  stability,  the  SVD  threshold  is  cho¬ 
sen  such  that  the  maximum  allowed  condition  number, 
Kmu  (introduced  previously),  mandates  the  absolute 
minimum  eigenvalue  for  a  given  matrix.  Second,  values 
below  this  threshold  are  not  immediately  deemed  singu¬ 
lar,  but  rather  continuously  default  to  sero  as  they  drop 
below  the  threshold.  Without  this  latter  modification, 
matching  solutions  along  contours  may  vary  discontinu- 
ously,  defeating  the  smoothing  properties  of  the  Gaus¬ 
sian  neighborhoods  and  producing  rather  non-intuitive 
results. 

12  Direct  Match  Determination 

Making  use  of  the  separation  of  affine  components  and 
the  simplifications  made  possible  by  this  separation,  (36) 
is  equivalent  to 


where 

X.  =  -i(/?fr, +  «rr,),  X,  =  -^(RjTr  +  RcT,) 

(55) 

since  ^hr  =  rmin,  the  default  for  rotational,  shearing, 
and  scaling  components. 

Finally,  substitution  of  (53)  and  (54)  into  (51)  and 
(52)  then  allows  us  to  solve  for  X|  and  derive 


/  1.  -  RjRfU 

t  =  (fl,  -  Af  RfR,)  -  {RjRfR,  -  Rt)t 

V  -  {R'^RfRr  -  HDr 

^mtn 


rntn 

min 


where 


(56) 


Rf  =Q;HRrQ;^)* 

[  J-MJI  0  ](j,\J-^\J\  0  iV 

[  0  0  J-V\\) 


and 


,  j  _  j  1 

j-^\j\=  y  I 


(57) 

(58) 


Equation  (56)  not  only  minimises  the  overall  devia¬ 
tion  between  neighboihood  matches  and  the  best  pure 
translation,  but  it  also  guarantees  a  unique,  stable  match 
regardless  of  the  constraints  or  the  neighborhood.  When 
Rr  and  Qr  are  non-singular,  then  Rf  =  and  the 
default,  Tmini  factors  out  of  the  solution.  Otherwise,  one 
must  use  the  pseudoinverse.  The  latter  case  occurs  when 
the  neighborhood  contour  points  are  coUinear,  since  at 
least  three  non-coUinear  point  matches  are  required  to 
uniquely  define  an  affine  transformation.  When  this  sit¬ 
uation  occurs,  Qr  affects  the  solution  by  altering  the 
eigenvalues  oz  R,.  It  essentially  transforms  the  Eu¬ 
clidean  affine  parameter  space  to  one  in  which  finding 
the  closest  affine  solution  vector  to  the  purely  transla¬ 
tional  affine  transformation. 


•m»n  — 


Jfmtn 

^mtn 


(59) 


yields  an  affine  transformation  that  minimises 


/ 


min  I 
t.r.r.,r,  I 


r*’(Q,t  -  21i,) 

+  (Er=l  "/<)  **’(*  -  2*m<«) 

-l-r^(A,r-t-A.t-I,) 

.-)-rr(A,t-fAfr-l,) 


\ 


(50) 


where  Ft  and  F,  are  the  Lagrange  multipliers  for  the 
translational  and  rotational  parts  of  the  constraints  re¬ 
spectively.  Taking  the  partial  derivatives  with  respect 
to  F|,  F,,  t  and  r  and  setting  them  to  zero  respectively 
leaves 

AfT+A,t  =  l,  (51) 

RtT  +  R^t=lr  (52) 

^  5”  ^min  (53) 

V  —  Qr  ^  "t"  (54) 


13  Implementation  and  Results 

The  matching  scheme  was  implemented  in  C  on  a  Sun 
workstation.  Five  spatial  scales  were  employed.  The 
sizes,  in  terms  of  the  variance  of  the  load  Gaussian 
neighborhood,  a-y,  were  4,  8, 16,  32,  and  64  pixeb.  These 
sizes  were  chosen  so  that  the  largest  neighborhood  was 
on  the  order  of  the  size  of  the  examples,  thereby  en¬ 
suring  that  the  largest  local  neighborhood  could  roughly 
include  the  constraints  of  the  entire  example.  Input  con¬ 
sisted  of  two  256  X  256  pixel  binary  contour  images. 
Some  were  synthetically  produced,  while  others  were 
extracted  zero  crossings  of  smoothed  and  differentiated 
natural  imagery  [22].  Though  the  method  should  be  able 
to  deal  with  any  set  of  contour  images  with  partially  con¬ 
strained  matches,  the  two  images  chosen  for  each  simu¬ 
lation  were  relatively  aligned  to  reflect  the  fact  that  con¬ 
straint  lines  are  most  easily  derived  from  mildly  differing 
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imagery  (».e.  time  sampled  imagery  for  short-range  mo¬ 
tion,  or  roughly  aligned  objects  for  object  recognition). 
The  “correct”  matches  between  each  pair  of  images  were 
known  for  synthetic  imagery  and  hand-picked  for  natural 
imagery.  The  constraint  line  for  each  contour  point  was 
determined  by  finding  the  local  orientation  of  the  con¬ 
tour,  then  taking  the  line  with  the  same  orientation  that 
passed  through  the  “correct"  match.  In  the  context  of 
visual  motion,  these  constraint  lines  roughly  mimicked  a 
local  measurement  of  normal  velocity  and  allowed  com¬ 
parisons  to  be  made  with  the  results  of  previous  methods 
for  recovering  optical  flow. 


To  assess  the  affect  of  noise  upon  the  matching  pro¬ 
cess,  10%  random  noise  was  added  to  the  components  of 
the  normals  describing  the  constraint  lines  and  special 
point  matches.  Assuming  this  equal  noise  distribution,  a 
value  of  0.5  was  used  for  a,  the  accuracy  of  special  points 
relative  to  the  accuracy  of  constraint  lines,  ^rthermore, 
the  parameter  Km**  >  used  both  for  selecting  the  sixe  of 
the  neighborhoods  and  for  setting  the  SVD  threshold 
in  psendoinverse  operations,  was  experimentally  deter¬ 
mined  by  ^plying  the  scheme  to  the  worst  case  (most 
ambiguous)  matching  problem,  the  matching  of  parallel 
lines  (see  example  lb  in  appendix  A).  The  average  per¬ 
centage  deviation  of  matches  with  respect  to  the  actual 
normal  components  of  the  match  (the  expected  minimal 
solution)  was  determined  for  several  values  of  K,nmm  ■  Us¬ 
ing  the  results,  shown  in  Figure  5,  the  tradeoff  between 
stability  (low  Km^a)  &i>d  accuracy  (high  svas  bal¬ 

anced  by  choosing  =  74.0,  the  highest  value  for 
which  the  average  percentage  error  did  not  exceed  the 
percentage  noise.  This  value  was  used  for  all  subsequent 
examples,  the  results  of  which  are  shown  in  Appendix 
A.  Note  that  this  choice  seems  particularly  appropriate 
considering  the  large  increase  in  matching  error  as 
increases  beyond  this  value. 

14  Discussion  of  Results 

As  demonstrated  by  the  results  presented  in  Appendix 
A,  the  proposed  contour  matching  scheme  robustly  re¬ 
covers  the  correspondences  of  both  synthetic  and  natu¬ 
ral  contour  imagery.  First,  unlike  most  other  methods 
for  matching  contours,  this  scheme  yields  intuitive  so¬ 
lutions  for  ambiguous  cases,  such  as  parallel  lines.  A 
second  attribute  of  the  scheme  concerns  the  usefulness 


of  specially  matched  points.  While  special  points  ob¬ 
viously  aid  the  matching  process  of  the  ambiguously 
matched  parallel  lines,  the  scheme  did  not  exploit  any 
special  points  for  subsequent  examples  and  nevertheless 
recovered  the  actual  correspondences.  This  observation 
indicates  that,  while  special  points  are  especially  impor¬ 
tant  for  the  matching  of  ambiguous  examples,  they  are 
no  more  powerful  than  redundant  constrmnt  lines  when 
used  in  conjunction  with  an  unambiguous  example. 

A  third  observation  is  that  analysis  at  several  neigh¬ 
borhood  scales  greatly  enhances  the  scheme’s  ability 
to  recover  exact  matches,  particularly  those  collective 
matches  not  modeled  precisely  by  a  global  affine  trans¬ 
formation.  For  example,  the  three-dimensional  distor¬ 
tions  used  to  generate  the  space  curve  did  not  present 
a  problem  to  the  matching  scheme.  The  attribution 
of  this  capability  to  several  neighborhood  sizes  is  not 
obvious  from  the  results,  but  instead  from  simulations 
that  reported  the  sizes  of  the  selected  neighborhoods  at 
every  contour  point.  Matching  the  views  of  the  face, 
for  example,  required  small  neighborhoods  for  match¬ 
ing  contours  with  intricate  detail  (such  as  the  eyes  and 
nose)  and  large  neighborhoods  for  matching  nearly  lin¬ 
ear  contours  (such  as  the  occluding  boundary  between 
the  face  and  the  background).  It  is  not  until  we  examine 
the  matches  predicted  for  the  three-dimensional  rotated 
wire  frame  that  we  see  the  limitations  of  multiple  scales. 
Though  the  overall  matching  solution  is  quite  accurate, 
the  matching  of  overlapping  contours  undergoing  sepa¬ 
rate  motion  is  highly  inaccurate  because  information  has 
in  this  case  been  integrated  over  independent  contours. 

15  Conclusion  and  Future  Work 

A  method  for  determining  matches  between  contours  in 
two  images  has  been  proposed,  developed,  and  tested, 
assuming  that  constraint  lines,  each  narrowing  down  the 
match  for  a  contour  point  in  the  first  image  to  a  line  in 
the  second  image,  are  available  for  several  contour  points 
in  the  first  image.  Suggested  applications  include  the  de¬ 
termination  of  optical  flow  in  short-range  motion  and  the 
matching  of  aligned  contour  views  in  either  alignment- 
based  object  recognition  or  long-range  motion. 

To  determine  the  match  for  a  contour  point  we  find  the 
best  affine  transformation,  in  a  weighted  least  squares 
sense,  that  satisfies  the  match  constraint  lines  and  spe¬ 
cially  matched  points  within  an  oriented  elliptical  neigh¬ 
borhood.  This  neighborhood  is  established  by  weighing 
the  constraint  equations.  The  weight  for  a  given  point 
constraint  is  the  modulation  of  the  Gaussian  distance  to 
the  local  origin,  which  establishes  a  circularly  symmet¬ 
ric  local  neighborhood,  by  the  Gaussian  distance  from 
the  axis  of  local  orientation,  which  attempts  to  limit  the 
neighborhood  to  a  single  contour.  The  width  of  the  mod¬ 
ulating  Gaussian  is  set  such  that  the  axes  of  the  elliptical 
neighborhood  are  proportional  to  the  local  axes  of  iner¬ 
tia.  To  determine  the  width  of  the  circularly  symmetric 
Gaussian,  the  effective  size  of  the  local  neighborhood, 
we  consider  several  sizes  simultaneously  and  choose  the 
smallest  one  which  yields  a  stable,  unique  solution.  Sta¬ 
bility  is  tested  by  comparing  the  condition  number  of  the 
constraint  matrix,  R,  with  the  maximum  condition  num- 


b«r,  Kmaa<  according  to  the  expected  noise  level.  In 
the  event  that  this  condition  number  exceeds  Umax  even 
at  the  largest  neighborhood,  there  is  more  than  one  pos¬ 
sible  solution,  from  which  we  choose  the  smallest  affine 
transformation  which  predicts  the  set  of  neighborhood 
matches  which  deviate  the  least  from  the  smallest  least 
squares  purely  translational  matching.  Unique  determi¬ 
nation  of  both  the  smallest  pure  translation  and  the  final 
match  is  guaranteed  through  the  use  of  a  modified  gen¬ 
eral  pseudoinverse. 

Note  that  the  match  for  each  point  may  be  computed 
in  parallel.  Furthermore,  since  the  predicted  match  for 
a  point  is  simply  given  by  the  translational  component 
of  the  local  affine  transformation,  finding  the  match  in¬ 
volves  explicitly  solving  for  only  the  two  translational 
components  of  the  affine  transformation.  A  closed  form 
solution  for  the  match  involves  using  a  continuous  ver¬ 
sion  of  the  pseudoinverse  to  invert  a  2  x  2  matrix  and  a  4 
X  4  matrix,  the  coefficients  of  which  are  weighted  sum¬ 
mations  of  local  point  constraint  parameters  that  may 
be  determined  in  parallel. 

Simulation  results  show  that  the  scheme  performs  well 
for  most  examples,  including  noisy  synthetic  imagery 
and  edges  extracted  from  natural  imagery.  Minimal  so¬ 
lutions  obtained  when  the  recovery  is  ill-posed  are  intu¬ 
itive.  Incorporation  of  terminators  improves  the  corre¬ 
spondence  significantly  for  near-ambiguous  examples. 

Despite  the  scheme’s  performance,  further  research  re¬ 
mains.  First,  the  precise  conditions  under  which  the  lo¬ 
cal  affine  transformation  is  uniquely  determined  by  the 
constraints  should  be  better  understood.  Second,  simu¬ 
lations  should  more  rigorously  test  matching  capabilities 
and  limitations.  Third,  the  application  of  the  scheme 
to  recovering  optical  flow,  matching  features  for  object 
recognition,  and  finding  correspondence  for  long-range 
motion  can  be  addressed.  Finally,  extensions  can  be  con¬ 
sidered,  including  the  use  of  depth  information  in  the  for¬ 
mation  of  ellipsoidal  neighborhoods  that  avoid  difficul¬ 
ties  caused  by  overlapping,  independently  matched  con¬ 
tours,  and  the  incorporation  of  prediction  mechanisms 
(such  as  Kalman  filtering  techniques)  for  applications 
that  are  aided  by  the  temporal  improvement  of  match¬ 
ing  solutions  over  multiple  image  frames. 

16  Acknoledgements 

We  thank  Eric  Crimson  and  Amnon  Shashua  for  looking 
over  drafts  of  this  paper.  This  report  describes  research 
done  at  the  Artificial  Intelligence  Laboratory  of  the  Mas¬ 
sachusetts  Institute  of  Technology.  Support  for  the  labo¬ 
ratory’s  artificial  intelligence  research  is  provided  in  part 
by  National  Science  Foundation,  contract  number  IRI 
8900267,  and  in  part  by  the  Advanced  Research  Projects 
Agency  of  the  Department  of  Defense  under  Office  of 
Naval  Research  contract  N00014-85-K-0124. 

A  Simulation  Results 

In  each  example  presented  the  first  contour  image  is 
shown  in  black,  and  the  second  contour  image  is  superim¬ 
posed  as  dotted  contours.  In  column  /,  vectors  describ¬ 
ing  the  constraint  lines  are  shown  for  sampled  points 


along  the  first  contour,  and  special  match  vectors  are 
indicated  by  small  squares  at  the  endpoints.  10%  noise 
has  been  added  to  these  vectors.  The  computed  and  ac¬ 
tual  match  vectors  for  sampled  points  are  then  shown 
in  columns  II  and  III,  allowing  qualitative  comparison. 
Quantitative  results  are  also  presented.  At  each  sampled 
point,  the  relative  error  is  computed  by  normalizing  the 
distance  between  predicted  and  actual  matches  by  the 
length  of  the  actual  match  vector.  The  average  relative 
error,  computed  over  all  sampled  points,  is  reported  for 
each  example.  Note  that  predicted  matches  do  not  nec¬ 
essarily  lie  exactly  upon  the  second  contour,  especially 
for  the  matching  of  ambiguous  examples.  In  practice, 
final  matches  may  be  found  by  taking  the  point  on  the 
second  contour  that  is  closest  to  the  predicted  match. 

Rows  la  and  lb  respectively  present  the  matching 
of  synthetic  parallel  lines  with  and  without  terminator 
matches.  In  row  2a  an  orthographic  projection  of  a  syn¬ 
thetic  wire-frame  is  matched  for  purely  translational  cor¬ 
respondence,  and  in  row  2b  the  same  wire-frame  is  ro¬ 
tated  by  20  degrees  about  each  axis  and  matched.  Row 
3  presents  the  matching  of  an  orthographic  projection 
of  an  arbitrary  synthetic  3D  space  curve  rotated  by  10 
degrees  about  each  axis,  translated,  scaled  by  a  &ctor  of 
1.2,  and  stretched  by  a  factor  of  1.1  along  each  axis.  Fi¬ 
nally,  the  matching  of  the  edges  obtained  from  a  pair  of 
roughly  aligned  natural  views  of  a  pair  of  scissors  (bor¬ 
rowed  from  [6]),  a  tank,  and  a  doll  face  (borrowed  from 
[29])  are  respectively  presented  in  rows  4  through  6. 

B  Modified  Pseudoinverse 

This  appendix  describes  a  modified  general  pseudoin¬ 
verse.  First,  an  n  X  m  matrix  A  is  uniquely  diagonalized 
using  singular  value  decomposition  techniques  (SVD): 

A  =  QxAQf  (60) 

where  Qi  and  Qj  are  n  x  m  and  m  x  n  orthonormal  ma¬ 
trices  respectively,  A  is  the  m  x  n  matrix  of  the  singular 
values  of  A, 

v/AT  0  0  C 


A  = 


v%r  0 


0 


(61) 


(in  this  particular  case  for  n  <  m)  and  Ai  ■  ■  ■  An  are 
the  eigenvalues  of  A'^A.  Given  this  diagonalization,  the 
modified  pseudoinverse  is 

A+  =  Q,A+Qf  (62) 


where 


={ 


l/Ay  if  A<>  > 

otherwise 


Ayf— 


(63) 


As  a  first  modification,  the  elements  of  A'*'  are  continu¬ 
ous  functions  of  the  elements  of  A  and  gradually  default 
to  zero.  As  a  second  modification,  the  singular  value 
threshold  is  chosen  for  stability  by  guaranteeing  that  the 
effective  condition  number  of  A,  the  ratio  of  the  maxi¬ 
mum  to  minimum  eigenvalues,  is  at  most  Kmaa-  (See  [30] 
for  a  more  detailed  presentation  of  SVD  and  the  general 
pseudoinverse.) 

307 


(XII)  ActB*l  Mfttclieft 


(Ill)  Actm*l  M»«cket 


(III)  Actsftl  M»tck«s 


References 

[1]  E.  H.  Adelson  and  J.  A.  Movshon.  Phenomenal  co¬ 
herence  of  moving  visual  patterns.  Nature,  300:523- 
525,  1982. 

[2]  P.  Anandan  and  R.  Weiss.  Introducing  a  smooth¬ 
ness  constraint  in  a  matching  approach  for  the  com¬ 
putation  of  optical  flow  fields.  In  IEEE  Workshop 
on  Computer  Vision:  Representation  and  Control, 
pages  186-194,  Bellaire,  MI,  October  1985. 

[3]  K.  Arbter.  Aiiine-invarient  fourier  descriptors.  In 
COST  13  Workshop,  pages  22-27,  Bonas,  France, 
August  1988. 

[4]  I.  A.  Bachelder.  Contour  matching  using  local  afflne 
transformations.  Master’s  thesis,  Massachusetts  In¬ 
stitute  of  Technology,  Cambridge,  MA,  June  1991. 

[5]  I.  A.  Bachelder  and  S.  Ullman.  Contour  matching 
using  local  afiine  transformations.  A.I.  Memo  1326 
(in  press).  The  Artificial  Intelligence  Lab.,  M.I.T., 
1991. 

[6]  R.  Basri.  Recognition  of  S-D  solid  objects  from  2-D 
images.  PhD  thesis,  Weiimann  Institute,  Rehovot, 
Isreal,  October  1990. 

[7]  P.  J.  Burt,  J.  Bergen,  R.  Hingcraiti,  S.  Pelsg,  and 
P.  Anandan.  Dynamic  analysis  of  image  motion 
for  vehicle  guidance.  In  IEEE  International  Work¬ 
shop  on  Intelligent  Motion  Control,  pages  75-82, 
Bogazici  University,  Istanbul,  August  20-22  1990. 

[8]  P.  J.  Burt,  J.  R.  Bergen,  R.  Hingorani,  R.  Kol- 
csynski,  W.  A.  Lee,  A.  Leung,  J.  Lubin,  and 
H.  Shvaytser.  Object  tracking  with  a  moving  cam¬ 
era.  In  Proceedings  of  the  workshop  on  visucd  mo¬ 
tion,  pages  2-12,  Irvine,  CA,  March  20-22  1989. 

[9]  D.  Cyganski  and  J.  A.  Orr.  Applications  of  ten¬ 
sor  theory  to  object  recognition  and  orientation  de¬ 
termination.  IEEE  Trans  Pott  Anal  Mach  Intell, 
PAMI-7(6):662-673, 1985. 

[10]  C.  L.  Fennema  and  W.  B.  Thompson.  Velocity 
determination  in  scenes  containing  several  moving 
objects.  Comput.  Graph.  Image.  Proc.,  9:301-315, 
1979. 

[11]  J.  J.  Gibson  and  E.  J.  Gibson.  Continuous  perspec¬ 
tive  transformations  and  the  perception  of  rigid  mo¬ 
tion.  Journal  of  Experimental  Psychology,  54:129- 
138,  1957. 

[12]  E.  C.  Hildreth.  The  Measurement  of  Visual  Motion. 
MIT  Press,  Cambridge,  1984a. 

[13]  E.  C.  Hildreth.  The  computation  of  the  velocity 
field.  Proc.  R.  Soc.  London  B,  221:189-220,  1984b. 

[14]  E.  C.  Hildreth  and  S.  Ullman.  The  computational 
study  of  vision.  A.I.  Memo  1038,  The  Artificial  In¬ 
telligence  Lab.,  M.I.T.,  April  1988. 

[15]  J.  Hong  and  X.  Tan.  The  similarity  between  shapes 
under  affine  trasformation.  Robotics  Res.  Rep  133, 
New  York  University,  December,  1987. 

[16]  B.  K.  P.  Horn.  Robot  Vision.  The  MIT  Press  and 
McGraw-Hill,  Cambridge  and  NY,  1986. 


[17]  B.  K.  P.  Horn  and  B.  G.  Schunk.  Determining  op¬ 
tical  flow.  Artif.  Intell.,  17:185-203,  1981. 

[18]  D.  P.  Huttenlocher  and  S.  Ullman.  Recognizing 
solid  objects  by  eilignment  with  an  image.  Interna¬ 
tional  Journal  of  Computer  Vision,  5(2):195-212, 
1990. 

[19]  J.  J.  Koenderink  and  A.  J.  Van  Doom.  Local  struc¬ 
ture  of  movement  parallax  of  the  plane.  Journal  of 
the  Optical  Society  of  America,  66:717-723,  1976. 

[20]  Y.  Lamdan,  J.  T.  Schwartz,  and  H.  J.  Wol&on. 
Affine  invmiant  model-based  object  recognition. 
IEEE  Transactions  on  Robotics  and  Automation, 
6(5):578-589,  1990. 

[21]  J.  S.  Lappin  and  H.  H.  Bell.  The  detection  of  co¬ 
herence  in  moving  random  dot  patterns.  Vision  Re¬ 
search,  16:161-168,  1976. 

[22]  D.  Marr  and  E.  C.  Hildreth.  Theory  of  edge  detec¬ 
tion.  Proc  R  Soc.  London  B,  207:187-217,  1980. 

[23]  D.  Marr  and  T.  Poggio.  Cooperative  computation 
of  stereo  disparity.  Science,  194:283-287,  1976. 

[24]  J.  E.  W.  May  hew  and  J.  P.  Frisby.  Psychophysical 
and  computational  studies  towards  a  theory  of  hu¬ 
man  stereopsis  Artifirinl  Intelligence,  17:349— -?8.5, 
ir.8J. 

[25]  H.  H.  Nagel.  Recent  advances  in  image  sequence 
analysis.  In  Proc.  Premier  Collogue  Image  -  Traite- 
ment,  Synthese,  Technologic  et  Applications,  pages 
545-558,  Biarritz,  France,  May  1984. 

[26]  H.  H.  Nagel  and  W.  Enkelmann.  Towards  the 
estimation  of  displacement  vector  fields  by  “ori¬ 
ented  smoothness”  constraints.  In  7th  Int.  Conf.  on 
Pattern  Recognition,  pages  6-8,  Montreal,  Canada, 
July  1984. 

[27]  A.  J.  Pantle  and  L.  Picciano.  A  multistable  display: 
Evidence  for  two  separate  motion  systems  in  human 
vision.  Science,  193:500-502,  1976. 

[28]  G.  F.  Poggio  and  T.  Poggio.  The  analysis  of  stere- 
opsis.  Annual  Reviews  of  Neuroscience,  7:379-412, 
1984. 

[29]  A.  Shashua.  Illumination  and  3D  object  recogni¬ 
tion.  In  John  E.  Moody,  Steve  Hanson,  and  Richard 
Lippmann,  editors,  Advances  in  neural  information 
processing  systems  J.  Morgan  Kaufinann,  1992  (in 
press).  Proc.  NIPS  ’91,  Denver  CO. 

[30]  G.  Strang.  Introduction  to  Applied  Mathematics. 
Wellesley-Cambridge  Press,  Wellesly,  MA,  1986. 

[31]  S.  Ullman.  Aligning  pictorial  descriptions:  An  ap¬ 
proach  to  object  recognition.  Cognition,  32(3):193- 
254,  1989. 

[32]  T.  Wakahara.  On-line  cursive  script  recognition  us¬ 
ing  local  affine  transformation.  Syst.  Comput.  Jpn. 
(USA),  20(7):10-19,  1988. 

[33]  A.  M.  Waxman  and  K.  Wohn.  Contour  evolution, 
neighborhood  deformation,  and  global  image  flow: 
Planar  surfaces  in  motion.  International  Journal  of 
Robotics  Research,  4:95-108,  1985. 


310 


HyperBF  Networks  for  Gender  Classification 

R.  Brunelli^,  T.  Poggio^’^ 

^  Istituto  per  la  Ricerca  Scientifica  e  Tecnologica 
1-38050  Povo,  Trento,  ITALY 
^Artificial  Intelligence  Laboratory 
Massachusetts  Institute  of  Technology 
Cambridge,  Massachusetts  02139,  USA 


Abstract 

A  set  of  geometrical  features  is  extracted  au¬ 
tomatically  from  digitised  pictures  of  frontal 
views  of  people  without  facial  hair.  This  com¬ 
pact  description  is  then  used  to  train  two  com¬ 
peting  HyperBF  networks  to  classify  accord¬ 
ing  to  gender.  The  results  using  a  database  of 
twenty  males  and  twenty  females  show  an  av¬ 
erage  performance  of  79%  correct  gender  clas¬ 
sification  on  images  of  new  faces.  Correct  clas¬ 
sification  on  vectors  corresponding  to  new  face 
images  present  in  the  training  set  but  not  used 
in  the  training  phase  rises  to  86%.  Preliminary 
experiments  to  assess  human  performance  on 
the  same  set  of  grey  level  images  give  an  aver¬ 
age  result  of  90%  which,  while  higher  than  net¬ 
work  performance,  suggests  that  peoples’  per¬ 
formance  is  comparable.  Interestingly,  the  Hy¬ 
perBF  technique  finds  the  relative  weights  of 
the  different  features  and  converges  to  proto¬ 
types  of  the  male  and  female  face  that  seem  to 
exaggerate  their  difference,  somewhat  like  car¬ 
icatures  do. 

1  Introduction 

Faces  allow  people  to  establish,  among  other  things,  the 
gender  of  a  person,  his  (her)  age  and,  to  a  certain  ex¬ 
tent,  emotions.  In  the  current  paper  we  address  gender 
classification  and  will  show  how  limited  geometrical  in¬ 
formation  accounts  for  correct  sex  attribution. 

There  are  two  main  strateipes  for  face  recognition  (and 
for  object  recognition  in  general);  feature  comparison 
and  template  matching.  The  former  relies  on  a  set  of  se¬ 
lected  features  which  must  be  computed  from  an  avail¬ 
able  image  while  the  latter  directly  compares  the  ap¬ 
pearance  of  a  given  instance  with  a  reference  image  by 
mesms  of  a  suitable  metric.  The  first  strategy,  when  fea¬ 
sible,  works  with  a  compact  representation  of  the  objects 
to  be  matched  which  are  usually  represented  by  low  (as 
compared  to  the  number  of  pixek  of  a  template)  dimen¬ 
sional  vectors.  The  set  of  features  used  for  recognition 
or  classification  is  critical  as  it  must  capture  the  discrim¬ 
inating  ones  and  give  to  each  of  them  the  correct  weight. 

In  some  recent  work  [4]  the  problem  of  face  r  ecognition 
and  gender  classification  has  been  tackled  using  the  in¬ 


ternal  representation  of  a  compression  network  as  unsu¬ 
pervised  feature  extractor  and  a  (smaller)  classification 
network  taking  as  input  the  extracted  features.  Recent 
theoretical  results  [2]  show  that  the  internal  representa¬ 
tion  of  such  a  network  is  closely  related  to  a  Karhunen- 
Loewe  expansion  (see  ako  [8,  13])  so  that  the  work  of 
Cottrell  et  al.  should  probably  be  considered  as  classi¬ 
fied  in  the  template  matching  category.  In  our  paper  we 
want  to  show  how  limited  geometrical  information  (see 
Fig.  1  for  the  set  of  features)  can  give  reasonable  per- 
fornuince  and  possibly  provide  some  insight  into  human 
mechanisms. 

2  Gender  Classification 

The  inspection  of  a  face  allows  us  to  establish,  usually 
without  much  effort,  the  gender  of  the  person  we  are 
looking  at.  It  seems  natural  to  mimic  this  ability  with  a 
computer  program.  The  experiment  we  did  is  based  on 
the  use  of  a  geometrical  feature  vector.  In  fact,  the  same 
vector  extracted  for  recognition  purposes  in  a  previous 
paper  [3]  was  used.  The  only  difference  is  that  the  face 
description  has  been  symmetrised  (left  and  right  eye¬ 
brow  and  chin  information  has  been  averaged)  thereby 
reducing  the  dimensionality  of  the  vector. 

All  of  the  features  have  been  extracted  automatically, 
from  images  whose  rotation  and  scale  was  previously  nor¬ 
malised  (by  automatically  locating  eyes).  The  paradigm 
we  used  is  that  of  learning  from  examples,  where  a  sys¬ 
tem  learns  to  discriminate  between  males  and  females 
given  a  sufficient  number  of  examples.  The  system  we 
used  is  based  on  a  classifier  called  Hyper  Basis  Function 
Network  (see  [10]). 

Learning  from  examples  can  be  regarded,  whenever 
the  inputs  and  output  are  expressible  as  numerical  vec¬ 
tors,  as  the  reconstruction  of  an  unknown  function  from 
sparse  data.  From  this  point  of  view  learning  is  equiva¬ 
lent  to  functional  approximation.  Hyper  Basis  Function 
Networks  are  a  tool  for  multivariate  function  approxi¬ 
mation  and  rest  on  a  solid  background  of  results  in  this 
field. 

Before  describing  the  networks  used  for  gender  classifi¬ 
cation  let  us  briefly  recall  the  fundamentak  of  the  Hyper 
Bask  Functions  Network. 

Radial  Bask  Functions  can  be  regarded  as  a  special 
case  of  Regularisation  Networks  introduced  in  [10]  as 
a  general  approximation  technique  that  can  be  used  in 
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Figure  1:  Geometrical  features  (white)  used  in  the  gen¬ 
der  classification  experiments 

problems  of  learning  from  examples. 

A  scalar  function  can  be  approximated,  given  its  value 
on  a  sparse  set  of  points  {£<},  by  an  expansion  in  radial 
functions: 

F(i)  =  f;c<G(l|i-i,||)  (1) 

where  {|  •  ||  represents  the  usual  Euclidean  norm.  The 
computation  of  the  coefficients  a  rests  on  the  invertibil- 
ity  of  matrix  =  G(||  ii—ij  ||)  which  has  been  proved 

(see  Micchelli  [9])  for  fonctions  such  as: 

G(v)  =  e-(5)*  (2) 

G(v)  =  (c*-hr*)-,a<  1  (3) 

It  is  possible  to  use  fewer  radial  functions  than  ex¬ 
amples,  i.e.  data  points.  The  resulting  overconstrained 
system  can  be  solved  using  a  least  square  approach  un¬ 
der  the  conditions  of  Michelli’s  theorem  and  proves  to 
be  useful  when  many  examples  are  available  [10]. 


Poggio  and  Girosi  [11]  have  shown  that  the  RBF  tech¬ 
nique  is  a  special  case  of  the  regularization  approach 
to  the  approximation  of  multivariate  functions.  From  a 
mote  general  formulation  of  the  variational  problem  of 
regularization  they  derive  the  following  approximation 
scheme,  instead  of  equation  (1); 

n 

/*(x)  =  ^CaG(||x  -  to)||^)  -)-p(x)  (4) 

0  =  1 

where  the  parameters  ta,  which  we  call  “centers,”  and 
the  coefficients  Ca  are  unknown,  and  ate  in  general  fewer 
than  the  data  points  (n  <  N).  The  term  p(x)  is  a  poly¬ 
nomial  that  often  can  be  neglected,  though  it  may  be 
useful  to  keep  the  constant  and  linear  terms.  The  norm 
is  a  weighted  norm 

I1(X  -  ta)||?v  =  (X  -  -  ta)  (5) 

where  W  is  an  unknown  square  matrix  and  the  super¬ 
script  T  indicates  the  transpose  operator.  In  the  simple 
case  of  diagonal  W  the  diagonal  elements  w,  assign  a 
specific  weight  to  each  input  coordinate,  determining  in 
fact  the  units  of  measure  and  the  importance  of  each 
sensory  input.  In  this  formulation  the  learning  stage  is 
used  to  estimate  not  only  the  coefficients  of  the  RBF 
expansion,  but  also  the  metric  (problem  dependent  di¬ 
mensionality  reduction)  and  the  position  of  the  centers 
(optimal  examples  selection). 

In  a  classification  task,  in  which  the  function  range 
is  represented  by  the  closed  interval  [0, 1],  the  value  of 
the  fonction  can  be  interpreted  as  a  /iuzy  predicate.  If  a 
gaussian  function  is  used  the  center  of  expansion  is  the 
only  point  at  which  the  predicate  assumes  value  1:  it 
can  be  effectively  interpreted  as  a  prototype  (note  that 
the  use  of  HyperBF  Networks  for  classification  b  directly 
related  to  Bayes  estimation  as  pointed  out  in  [10]). 

Using  a  geometrical  vector  as  input,  gender  classifi¬ 
cation  has  been  attempted  by  using  two  competing  net¬ 
works:  one  for  male  recognition  and  one  for  female  recog¬ 
nition  (see  Fig.  2  for  the  network  structure).  The  gender 
to  be  associated  to  a  given  vector  is  taken  to  be  that  cor¬ 
responding  to  the  network  with  the  greatest  response.  It 
is  interesting  to  note  how  each  of  the  networks  is  able 
to  create  a  meaningful  prototype  of  the  class  it  repre¬ 
sents.  As  can  be  seen  in  Figure  2  the  expansion  center, 
which  is  a  vector  with  components  free  to  move  during 
the  “learning”  process,  has  converged  at  the  end  of  the 
training  phase  to  what  could  be  considered  a  caricature 
of  a  (fe)male  face.  It  does  not  correspond  to  the  av¬ 
erage  value  on  the  separate  subsets:  it  emphasises  the 
discriminating  features.  It  is  intriguing  to  speculate  that 
putative  ceils  in  IT  cortex  involved  in  gender  classifica¬ 
tion  may  be  similar  to  the  units  of  our  model,  repre¬ 
senting  fussy  (because  of  the  Gaussian)  templates  such 
as  the  ones  of  Fig.  2.  The  learning  stage  is  also  able 
to  change  the  metric  to  account  for  the  different  weight 
and  significance  of  the  different  features.  Of  the  sixteen 
features  only  three  are  given  a  noticeable  weight:  dis¬ 
tance  of  eyebrow  from  eyes,  eyebrows  thickness  and  nose 
width.  These  are  followed  by  the  vertical  position  of  nose 
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Figure  2:  The  competing  HyperBF  Networks  used  for 
gender  classification 


and  mouth  and  the  two  radii  describing  the  lower  chin 
shape;  the  remaining  features  are  considered  ineffective. 

The  database  used  for  the  classification  experiments 
comprised  168  vectors  equally  distributed  over  21  males 
and  21  females.  Three  different  performances  were  mea¬ 
sured: 

•  on  the  vectors  of  the  training  set  (90%  correct); 

•  on  novel  faces  of  people  in  the  training  set  (86% 
correct); 

•  on  faces  of  people  not  represented  in  the  training 
set  (79%  correct) 

The  performance  on  a  testing  set,  having  null  inter¬ 
section  with  the  training  set,  has  been  estimated  with 
a  leave-one-out  strategy.  Having  n  available  examples, 
training  was  done  on  the  first  n  —  1  data  leaving  the 
last  one  for  testing.  The  data  set  was  then  rotated,  so 
that  each  of  the  available  examples  was  used  in  turn  as 
a  testing  example.  The  performance  was  estimated  by 
taking  the  percentage  of  correct  gender  assessment  on 
the  resulting  tests.  The  performance  obtained  is  of  79% 
correct  classifications. 

Human  performance  in  such  classification  tasks  (as 
well  as  recognition)  is  widely  believed  to  be  nearly  per¬ 
fect.  To  assess  the  effective  ability  of  people  in  gender 
classification  we  have  performed  some  informal  psyco- 
physical  experiments  using  as  stimulation  pattern  a  grey 
level  image  of  the  face  from  which  the  local  average  was 
subtracted  (to  make  the  different  images  as  similar  as 
possible).  As  Figure  3  shows,  no  hair  information  was 
available  (residual  facial  hair  was  masked  out) 

The  database  of  stimuli  was  then  presented  one  image 
after  another  on  a  computer  screen  and  the  subject  was 
asked  to  press  M  for  male  and  F  for  female  without  any 
time  constraint.  The  results  were  surprinsing.  An  av¬ 
erage  score  of  90%  correct  classification  (on  17  subjects 


TOP.  Feature  weights  for  gender  classification  as  computed  by 
the  HyperBF  Networks.  MIDDLE.  The  male  prototype  (left) 
and  the  female  prototype  (right)  as  synthesised  by  the  Hy¬ 
perBF  Networks  with  movable  coefficients,  centers  and  met¬ 
ric.  The  darker  the  feature,  the  more  important  it  is  according 
to  the  corresponding  entries  in  the  diagonal  metric  W.  BOT¬ 
TOM.  Average  male  face  (left)  and  female  face  (right). 


Figure  3:  Typical  stimuli  used  in  the  experiments  of  hu¬ 
man  gender  classification 
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some  of  which  familiar  with  a  large  subset  of  the  peo¬ 
ple  represented  in  the  database).  Classification  perfor¬ 
mance  was  not  impaired  by  the  lack  of  familiarity  with 
the  database  people.  Informal  chat  with  some  of  the 
subjects  revealed  that,  at  least  consciously,  eyebrow  in¬ 
formation  was  considered  to  be  the  most  discriminating. 

Note  that  no  hair  information  has  been  used,  both  in 
our  human  and  our  computer  gender  classification  ex¬ 
periments.  This  must  be  considered  if  these  results  are 
to  be  compared  with  other  experiments  reported  in  liter¬ 
ature  (see  for  example  [6],  where  images  included  limited 
hair  information). 

3  Conclusion 

Gender  classification  has  been  attempted  using  two  com¬ 
peting  HyperBF  networks  trained  on  a  geometrical  de¬ 
scription  of  (fe}males  faces.  The  resulting  performance 
was  of  79%  correct  classification  (averaged  on  males  and 
females)  and  must  be  confronted  to  a  human  perfor¬ 
mance  of  90%.  Analysis  of  the  internal  representation 
of  the  HyperBF  networks  shows  that  the  networks  have 
been  able  to  effectively  prototype  (fe)male  faces  and  that 
classification  was  achieved  using  a  subset  of  the  available 
features,  similar  to  human  strategies  of  gender  classifi¬ 
cation. 
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Abstract 

Navigation  using  maps  requires  establishing  a 
match  between  locations  in  the  environment 
and  locations  on  a  map  (localization).  View¬ 
point  can  be  determined  by  using  the  visual 
imgles  between  three  features.  However,  error 
in  estimate  of  visual  angles  produces  error  in  lo¬ 
calization.  Depending  on  feature  configuration, 
the  same  error  in  visual  angle  estimate  can  re¬ 
sult  in  very  different  errors  in  localization.  We 
show  how  “areas  of  uncertainty”  change  as  con¬ 
figuration  shape  and  relation  to  observer  are 
modified  and  describe  properties  of  configura¬ 
tions  which  C2m  be  used  to  predict  error  in  local¬ 
ization.  Two  basic  conditions  have  been  iden¬ 
tified  which  affect  area  size  and  shape. 

1  Introduction 

Visual  localization  often  must  be  based  on  the  apparent 
position  of  features  in  the  environment  (landmarks).  It  is 
frequently  the  case  that  actual  distance  to  landmarks  is 
unknown  and  viewpoint  must  be  determined  from  visual 
bearings  alone.  A  number  of  relatively  simple  methods 
for  solving  this  problem  are  available.  However,  visual 
bearings  will  never  be  completely  accurate.  As  a  result, 
it  is  important  to  have  an  understanding  of  the  errors 
which  can  develop  when  using  such  methods.  A  poste¬ 
riori  analysis  can  predict  the  precision  of  an  estimated 
viewpoint  given  expectations  about  the  errors  associated 
with  the  determination  of  visual  bearings.  Perhaps  even 
more  importantly,  a  priori  analysis  can  be  used  to  choose 
feature  configurations  which  are  least  sensitive  to  error, 
thus  yielding  the  most  reliable  localizations. 

The  best  known  method  for  doing  localization  is  iri- 
angulaiion  in  which  relatively  simple  trigonometric  op¬ 
erations  are  used  to  determine  viewpoint  given  the  abso¬ 
lute  bearings  to  two  or  more  landmarks  in  known  posi¬ 
tions.  However,  absolute  bearings  are  not  always  avail¬ 
able  and  can  be  unreliable  in  many  environments.  For 
example,  magnetic  conditions  in  an  area  will  affect  com¬ 
pass  readings.  In  such  situations,  alternative  methods 
must  be  employed.  It  has  been  shown  [Levitt  et  ai, 

‘This  work  was  supported  by  National  Science  Foundation 
grant  IRI-9196146,  with  partial  funding  from  the  Defense  Ad¬ 
vanced  Research  Projects  Agency. 


1987]  that  the  visual  angle  between  two  landmarks  con¬ 
strains  viewpoint  to  a  closed  torus-like  surface.  If  a 
two-dimensional  approximation  of  the  environment  is 
assumed,  viewpoint  is  constrained  to  the  boundary  of 
a  double  circle  (See  Figure  1).  When  the  landmarks 
can  be  ordered,  viewpoint  is  restricted  to  one  loop  of 
that  double  circle.  It  follows  that  visual  angles  between 
three  landmarks  will  constrain  viewpoint  to  the  intersec¬ 
tion  of  three  circles.  Except  in  one  special  circumstance 
[Sutherland  and  Thompson,  1992],  this  uniquely  deter¬ 
mines  the  viewpoint.  In  this  paper,  we  analyze  the  errors 
associated  with  this  form  of  localization  when  the  three 
lemdmarks  are  included  within  a  visual  angle  of  <  180*^. 
Subsequent  work  will  deal  with  errors  which  occur  when 
the  observation  point  is  in  the  middle  of  a  configuration 
of  landmarks. 


Figure  1:  Angle  from  observer  at  Ob  to  landmarks  A  and 
B  is  the  same  from  any  location  on  the  double  circle. 

We  demonstrate  a  methodology  for  determining  areets 
of  uncertainty  in  localization  given  particular  configura¬ 
tions  of  landmarks  and  errors  in  visual  angle  estimates  to 
those  landmarks.  We  then  show  the  effects  on  localiza¬ 
tion  of  different  configurations  of  landmarks  when  iden¬ 
tical  errors  in  estimate  of  visual  angles  are  made.  This 
sensitivity  to  error  in  visual  angle  estimate  will  vary  con¬ 
siderably  among  configurations.  The  significant  result 
is  that  feature  configuration  can  dramatically  affect  the 
precision  with  which  localization  is  accomplished.  We 
are  now  able  to  specify  criteria  that  aid  in  the  selection 
of  the  best  sets  of  features  on  which  to  base  localization. 
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2  The  Area  of  Uncertainty 

Levitt  e<  al.  [Levitt  tt  al.,  1987]  point  out  that  exact 
knowledge  of  visual  angle  between  two  landmarks  con¬ 
strains  viewpoint  to  a  partial  circle  which  passes  through 
the  landmarks  and  that  an  error  in  visual  angle  estimate 
will  constrain  viewpoint  to  a  thickened  ring  as  shown  in 
Figure  2.  The  thickness  of  the  ring  is  determined  by  the 
amount  of  error  in  angle  estimate.  See  [Sutherland  and 
Thompson,  1992]  for  details  of  the  computation. 


observer  will  not  necessarily  make  the  same  error  in  each 
angle  measure.  Let  7  be  the  angle  from  the  observer 
subtended  by  chord  AC.  Figure  4a  shows  the  area  of 
uncertainty  for  the  30%  error  shown  in  Figure  3.  The 
area  of  uncertainty  for  angle  7  is  surrounded  by  dashed 
lines. 


Figure  2:  Error  in  visual  angle  constrains  viewpoint  to  a 
thickened  ring.  A  and  B  are  the  landmarks.  The  actual 
viewpoint,  Ob,  lies  on  the  light  circle. 


Figure  3:  The  dark  lines  surround  the  areas  of  uncer¬ 
tainty  for  errors  of  10%  and  30%  in  visual  angle  estimate. 

When  three  landmarks  are  used,  any  given  error  in 
estimate  constrains  the  viewpoint  to  the  intersection  of 
two  rings. ^  In  Figure  3,  let  a  be  the  visual  angle  from 
the  observer  subtended  by  chord  AB  and  be  the  visual 
angle  subtended  by  chord  BC.  The  dark  lines  show  the 
areas  of  uncertainty  when  the  landmarks  are  in  a  straight 
line,  the  distance  between  landmarks  equals  the  distance 
from  the  observer  to  the  center  landmark  and  both  a  and 
/?  measure  45®.  The  inner  area  represents  an  error  less 
than  or  equal  to  ±4.5®  or  ±10%  in  both  a  and  0.  The 
outer  area  results  when  there  is  an  error  less  than  or 
equal  to  ±13.5®  or  ±30%. 

Not  only  were  angles  a  and  0  in  Figure  3  identical  in 
measure,  but  the  error  amount  in  each  was  the  same.  An 

’a  third  ring  passing  through  the  two  landmarks  lying  at 
greatest  distance  from  each  other  can  be  computed,  but  it 
does  not  affect  area  size. 


Figure  4:  a)  Dark  lines  surround  the  area  of  uncertainty 
for  a  30%  error  with  the  same  configuration  its  in  Fig¬ 
ure  3.  Dashed  lines  surround  the  error  area  for  angle  7. 
b)  Dark  lines  surround  the  area  of  uncertainty  for  40% 
error  in  a  and  20%  error  in  0.  c)  Error  in  estimate  of  an¬ 
gle  a  is  60%.  Angle  0  is  estimated  perfectly.  The  result 
is  a  30%  error  in  estimate  of  7.  Dashed  lines  surround 
the  error  area  for  a. 

Figure  4b  shows  the  area  of  uncertainty  for  an  error 
in  a  of  40%  and  error  in  0  of  20%.  Figure  4c  shows  the 
area  of  uncertainty  for  a  60%  error  in  a  and  a  perfect 
estimate  of  0.  Note  that  the  area  is  reduced  to  an  arc 
of  the  circle  passing  through  B  and  C.  In  general,  if  the 
error  is  additive  with  7e  the  error  in  7,  for  any  given  je, 
7e  =  Oe  +  0e  implies  that  —  It  —  0e  for  all  0e  such 
that  0  <  /?e  <  7e-  If  the  error  is  multiplicative,  7^  = 
a/7  *08  +  0f'(  *  0t  implies  that  ae  =  7/a  ♦  7e  —  0/a  *  0^ 
for  all  0g  such  that  0  <  /?«  <  j/0  *7*.  In  all  cases,  the 
resulting  area  of  uncertainty  equals  the  intersection  of 
the  two  (possibly  thickened)  rings  corresponding  to  the 
error  in  estimate  of  angles  a  and  0.  This  intersection 
will  always  lie  within  the  thickened  ring  corresponding 
to  the  error  in  7,  with  the  relationship  of  7^  to  a,  and 
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Pt  as  given  in  the  above  equations.^ 

3  Size  of  Area 

Three  factors  affect  the  size  of  the  area  of  uncertainty: 
amount  of  error  in  angle  estimate,  relative  distance  of 
observer  to  configuration  and  configuration  shape. 

Area  size  will  increase  with  increaised  error.  As 
previously  described  [Levitt  et  ai,  1987]  each  pair  of 
landmarks  produces  a  landmark-pair-boundary  (LPB). 
These  LPB’s  divide  the  plane  into  orientation  regions. 
Crossing  into  a  different  orientation  region  changes  the 
ordering  of  the  landmarks.  Since  landmark  order  has 
been  determined,  it  can  be  assumed  that  an  LPB  will 
not  be  crossed.  Thus,  the  LPB  puts  a  bound  on  the  size 
of  the  area  in  the  direction  of  overestimation  of  visual 
angle.  However,  there  is  no  bound  on  how  far  back  the 
observer  can  be  located,  causing  total  possible  area  of 
uncertainty  due  to  amount  of  error  in  angle  estimate  to 
be  unbounded. 


- -70%  Error 

- -  -50%  Error 

- 30%  Error 


Figure  5:  Size  of  area  of  uncertainty  as  observer  moves 
away  from  the  configuration. 

Relative  distance  of  observer  to  configuration  will  also 
affect  area  size.^  Figure  5  shows  the  change  in  size  of 
the  area  of  uncertainty  for  a  straight  line  configuration 
with  landmarks  one  unit  apart  and  observation  point 
lying  on  the  perpendicular  bisector  of  the  line  joining 
the  landmarks. 

The  third  parameter  to  consider  is  the  shape  of  the 
configuration.  Thus  far  the  configuration  has  been  held 
constant  with  all  three  landmarks  in  a  line  and  equally 
spaced.  Comparative  distance  between  straight  line 

^Although  all  graphs  in  this  section  show  a  straight  line 
configuration  of  landmarks,  the  described  conditions  also 
hold  for  nonlinear  configurations. 

^Because  the  visual  angles  alone  are  used  as  a  measure, 
distance  is  relative  (e.g.  distance  of  observer  of  1000  feet  to  a 
straight  line  configuration  with  landmarks  located  1000  feet 
apart  is  considered  to  be  the  same  as  a  distance  of  4000  feet 
with  the  landmarks  4000  feet  apart). 


landmarks  and  angular  relationship  between  non-linear 
landmarks  will  both  affect  area  size. 

In  Figure  6,  dark  lines  surround  the  area  of  uncertainty 
resulting  with  a  30%  error  in  visual  angle  for  the  straight 
line  configuration  with  landmarks  A,  B  and  C  one  unit 
apart  and  the  observer  five  units  from  the  configuration. 
The  resulting  areas  with  landmark  C'  moved  away  from 
C  are  surrounded  by  dashed  lines.  C'  is  2  units  from  B 
on  the  left  and  4  units  away  on  the  right.  In  both  figures, 
the  dark  and  dashed  lines  coincide  where  the  boundary 
is  determined  by  the  error  ring  for  the  circle  through  A 
and  B.  A  skewness  develops  on  the  boundary  determined 
by  the  error  ring  for  the  circle  through  B  and  C.  The  area 
of  uncertainty  becomes  smaller  as  C'  moves  away  from 
C. 


Figure  6:  Both  graphs  show  an  error  of  30%  in  visual  an¬ 
gle  with  observer  5  units  from  configuration.  The  dark 
lines  surround  the  area  of  uncertainty  for  the  ABC  con¬ 
figuration.  The  dashed  lines  surround  the  area  for  the 
ABC'  configuration. 


The  first  step  in  analyzing  how  change  in  angular  re¬ 
lationship  between  landmarks  affects  the  size  of  the  area 
of  uncertainty  is  to  consider  the  LPB’s  and  resulting 
orientation  regions.  When  all  three  landmarks  lie  on  a 
straight  line,  the  observer  is  constrained  to  a  half  plane. 
When  the  landmarks  are  not  in  a  straight  line,  the  three 
LPB’s  create  seven  orientation  regions,  as  shown  in  Fig¬ 
ure  7.  The  observer  is  constrained  to  one  of  those  re¬ 
gions.  The  assumption  will  be  made  that  the  observer 
is  not  in  the  center  region.  Note  that  regardless  of  error 
size,  the  observer  cannot  move  out  of  the  orientation  re¬ 
gion.  The  area  of  uncertainty  will  always  be  within  that 
region. 


Figure  7:  The  LPB’s  in  a  non-linear  configuration  re¬ 
strict  the  observer  to  one  of  7  orientation  regions. 
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If  the  locations  of  landmarks  A  and  C  are  fixed  and 
landmark  B  is  moved  along  a  line  equidistant  from  A 
and  C,  the  largest  area  of  uncertainty  for  any  give  er¬ 
ror  amount  will  occur  when  B  lies  on  the  same  circle  as 
A,  C  and  the  observer.  Although  the  navigator  cannot 
tell  by  landmark  map  positions  alone  if  the  single  circle 
condition  exists,  it  can  be  ruled  out  if  no  part  of  the 
circle  through  the  three  landmarks  lies  in  the  observer’s 
orientation  region.  This  holds  if,  for  example,  the  center 
landmark  is  closer  to  the  observer  than  the  other  two. 
See  [Sutherland  and  Thompson,  1992]  for  further  analy¬ 
sis  of  the  single  circle  configuration. 


Figure  8:  The  heavy  dark  lines  surround  the  area  of 
uncertainty  with  30%  error  in  estimate  of  visual  angle, 
landmarks  A,  B  and  C  one  unit  apart  and  the  observer 
5  units  away.  The  dashed  lines  in  a)  surround  the  area 
of  uncertainty  resulting  when  B'  is  one  unit  closer  to 
the  observer  than  B.  The  dashed  lines  in  b)  surround 
the  area  of  uncertainty  resulting  when  B'  is  one  unit 
further  away.  The  error  circles  are  black  for  the  linear 
configuration  and  grey  for  the  non-linear  configuration. 

Figure  8  shows  the  area  of  uncertainty  for  a  30%  er¬ 
ror  and  the  same  basic  configuration  as  in  Figure  6 
(o  =  /?  =  11.3®).  Landmark  B'  is  one  unit  closer  to 
the  observer  in  Figure  8a  and  one  unit  further  away  in 
Figure  8b.  Although  both  show  a  decrease  in  area,  the 
resulting  area  in  Figure  8a  is  significantly  smaller.  Thus, 
the  nonlinear  configuration  in  Figure  8a  is  the  least  sen¬ 
sitive  to  error  and  would  produce  the  most  precise  lo¬ 
calization.  Note  that  the  length  of  the  chord  lying  on 
the  axis  of  symmetry  does  not  change  eis  landmark  B  is 
moved.  For  an  observer  facing  the  configuration,  change 
in  area  is  lateral  only. 

4  Shape  of  Area 

Two  shape  properties  must  be  considered.  The  first  is 
symmetry.  The  second  is  eccentricity.  If  the  configura¬ 
tion  is  symmetric,  the  observer  is  located  on  a  line  of 
symmetry  of  that  configuration  and  the  error  in  visual 
angle  estimate  is  the  same  for  both  angle  a  and  angle  0, 


then  the  area  of  uncertainty  will  also  be  symmetric.  If 
any  of  those  properties  do  not  hold,  the  area  will  not  be 
symmetric. 


Figure  9:  Shape  of  area  of  uncertainty  is  skewed  when 
observer  is  moved  off  the  line  of  symmetry  of  the  config¬ 
uration. 

Figure  4  showed  how  unequal  error  in  angle  estimate 
affected  shape.  In  Figure  6,  landmark  C  was  pulled  away 
from  the  configuration.  The  area  decrease  resulted  in  an 
asymmetric  shape.  Figure  9  shows  how  the  area  shape  is 
skewed  when  the  observer  is  moved  off  the  line  of  sym¬ 
metry  of  the  configuration. 

The  same  three  factors  which  affect  the  size  of  the  area 
of  uncertainty  (amount  of  error  in  visual  angle  estimate, 
relative  distance  of  observer  to  configuration  and  config¬ 
uration  shape)  also  affect  the  eccentricity.  Future  work 
will  include  analyzing  how  the  conditions  described  in 
the  next  section  relate  to  an  area’s  eccentricity. 

5  Conditions  Affecting  Sensitivity 

We  have  identified  two  basic  conditions  which  affect  the 
sensitivity  of  a  configuration.  The  first  is  the  rate  of 
change  of  visual  angle  measure  as  the  observer  moves  in 
the  environment.  The  second  is  the  rate  of  change  of  the 
ratio  of  angles  a  and  0  as  the  observer  moves. 

Sections  3  and  4  contained  several  examples  of  the 
first  condition.  The  navigator  in  Figure  8a  needn’t  move 
far  from  the  actual  observation  point  to  exit  the  area 
of  uncertainty  in  the  non-linear  configuration  because 
visual  angles  change  at  a  much  greater  rate  than  they 
do  with  the  lineeur  configuration.  The  size  of  an  area  of 
uncertainty  is  based  on  this  first  condition.  If  the  rate  of 
change  of  visual  angle  with  respect  to  viewpoint  is  high 
at  a  given  location,  a  moving  observer  will  leave  the  area 
quickly.  If  the  rate  of  change  is  low,  a  significant  distance 
could  be  traveled  before  the  area  boundary  is  reached. 

In  Figure  8,  angles  a  and  0  have  increased  rates  of 
change  in  the  modified  configurations.  Figure  10  shows 
two  three-dimensional  graphs.  Visual  angle  a  is  repre¬ 
sented  by  the  height  of  the  surface.  The  rate  of  change  of 
a  is  greatest  where  surface  slope  is  steepest.  An  observer 
should  make  less  error  in  localization  when  situated  at 
a  point  of  steep  slope  than  at  a  point  where  the  slope  is 
shallow. 

The  second  condition,  the  comparison  of  angle  mea¬ 
sure,  also  affects  sensitivity.  If,  for  example,  an  observer 
is  located  at  a  position  such  that  angle  a  appears  to  be 
twice  as  large  as  angle  0,  it  is  unlikely  that  position  will 
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Figure  10:  Visual  angle  a  is  represented  by  the  height  of 
the  surface.  Surface  height  is  0  at  the  LPB  and  outside  of 
the  orientation  region.  Distance  is  units  to  the  furthest 
landmark.  Landmarks  are  4  units  apart. 

be  estimated  to  be  at  a  location  where  the  visual  angles 
appear  to  be  equal.  In  the  graphs  of  Figure  11,  the  ra¬ 
tio  is  represented  by  the  height  of  the  surface.  The 
meiximum  rate  of  change  occurs  at  the  points  of  steep¬ 
est  slope.  An  observer  at  those  points  should  make  less 
error  in  localization  than  if  at  a  point  of  shallow  slope. 

To  summarize,  the  rates  of  change  of  the  visual  angles 
and  their  ratio  in  an  area  around  actual  observer  loca¬ 
tion  depend  on  configuration  shape  and  location  of  the 
observer  with  respect  to  the  conhguration.  Large  rates 
of  change  result  in  less  sensitivity  to  error. 

6  Conclusions 

At  this  point  in  our  analysis: 

•  We  can  determine  size  and  shape  of  the  area  of  un¬ 
certainty  for  any  configuration  of  three  landmarks, 
observer  position  and  error  in  visual  angle. 

•  We  are  able  to  predict  how  that  area  will  change  in 
size  and  shape  as  configuration  shape  changes. 

•  We  have  isolated  properties  which  can  be  used  to 
predict  magnitude  and  type  of  errors  in  localization. 

•  We  have  identified  two  basic  conditions  which  are  af¬ 
fected  by  changes  in  the  configuration  and,  in  turn, 
affect  size  and  shape  of  the  area  of  uncertainty. 


Figure  11:  Visual  angle  ratio  af^  is  represented  by  the 
height  of  the  surface.  Height  is  0  outside  of  the  orienta¬ 
tion  region.  Distance  is  units  to  the  furthest  landmark. 

Future  work  will  include: 

•  Determining  what  properties  can  be  used  for  local¬ 
ization  if  the  observer  is  in  the  center  of  the  config¬ 
uration. 

•  Analyzing  how  the  rate  of  change  of  angle  ratio  re¬ 
lates  to  the  eccentricity  of  the  area  of  uncertainty. 

•  Defining  a  measure  of  sensitivity  for  any  given  con¬ 
figuration,  including  a  weighting  of  the  properties 
which  affect  area  size  and  shape. 
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Abstract 

Landmark-based  navigation  is  a  problem  solv¬ 
ing  activity  in  which  correspondences  must  be 
established  between  map  and  view  features, 
and  then  current  location  determined  from 
these  correspondences.  A  variety  of  high-level 
strategies  are  important  in  order  to  control 
combinatorial  complexity  and  minimize  diffi¬ 
culties  due  to  problems  in  accurately  recogniz¬ 
ing  different  features.  These  strategies  organize 
features  into  distinctive  configurations,  control 
the  order  in  which  correspondences  are  estab¬ 
lished,  and  regulate  the  manner  in  which  view¬ 
point  hypotheses  are  generated  and  validated. 

1  Introduction 

Localization  is  an  essential  aspect  of  almost  any  map- 
based  navigation  activity.  Localization  involves  the  de¬ 
termination  of  the  current  viewpoint  given  one  or  more 
views,  information  from  a  map,  and  if  available,  a  his¬ 
tory  of  past  movement  and  observations.  While  the  im¬ 
age  understanding  community  has  studied  the  problem 
of  matching  maps  and  aerial  imagery  in  some  depth  (e.g., 
[McKeown  and  Denlinger,  1984]),  less  attention  has  been 
paid  to  localization  problems  involving  outdoor,  ground- 
level  viewpoints.  Much  of  the  ground-level  work  that  has 
been  done  has  been  directed  at  lower-level  problems  such 
as  road  following  (e.g.,  [Thorpe  et  a/.,  1988]). 

Landmark-based  localization  establishes  correspon¬ 
dences  between  visually  distinct  landmarks  and  topo¬ 
graphically  or  culturally  distinct  map  features  and  then 
infers  a  viewpoint  based  on  the  geometric  constraints 
imposed  by  these  correspondences.  Work  done  on  in¬ 
door  mobile  robotics  has  tended  to  assume  that  land¬ 
marks  could  be  uniquely  identified  (e.g.,  [Chatila  and 
Laumond,  1985]).  Outdoors,  topographic  features  are 
not  so  easily  distinguished.  Difficulties  arise  due  to  am¬ 
biguity  in  establishing  correspondences  and  combinato¬ 
rial  complexity  due  to  the  large  number  of  potentially 
relevant  features  that  are  often  present. 

We  have  previously  argued  that  localization  and 
feature-based  object  recognition  have  a  common  compu- 
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tational  form  [Thompson  et  ah,  1990].  In  object  recog¬ 
nition,  various  forms  of  heuristic  search  have  proven  to 
be  quite  powerful  in  controlling  combinatorics  [Crimson, 
1990,  Crimson,  199l].  These  same  optimizations  aie  rel¬ 
evant  to  localization.  In  addition,  unique  aspects  of  the 
outdoor  localization  problem  allow  the  use  of  powerful 
high-level  problem  solving  strategies  to  guide  the  estab¬ 
lishment  of  correspondences  and  the  estimation  and  ver¬ 
ification  of  viewpoints. 

2  Strategies 

Computational  analysis,  computer  simulations,  and  ex¬ 
periments  done  with  expert  map  users  all  point  to¬ 
wards  a  small  set  of  strategies  being  critical  to  the  so¬ 
lution  of  difficult  localization  problems.  These  strate¬ 
gies  are  relevant  to  localization  approaches  which  op¬ 
erate  by  establishing  correspondences  between  features 
in  the  view  and  on  the  map,  using  these  feature  cor¬ 
respondences  to  hypothesize  one  or  more  possible  view¬ 
points,  and  then  using  some  sort  of  verification  process  to 
evaluate  viewpoint  hypotheses  [Thompson  et  al,  1990, 
Smith  et  ai,  1991]. 

Concentrate  on  the  view  first. 

Localization  problem  solving  should  be  initiated  by  a 
period  of  general  visual  reconnaissance,  focused  on  the 
terrain  view  in  preference  to  the  map.  In  general,  avail¬ 
able  maps  cover  an  area  much  larger  than  can  be  seen. 
As  a  result,  the  majority  of  map  features  will  not  be  rele¬ 
vant  to  any  particular  viewpoint  determination,  whereas 
most  of  the  distinctive  visual  features  will  have  corre¬ 
spondences  on  the  map.  One  of  the  major  sources  of 
difficulty  in  object  recognition  is  “clutter”  consisting  of 
image  features  not  eissociated  with  the  object  of  interest. 
Matching  is  therefore  usually  driven  by  object  models 
which  contain  only  features  relevant  to  a  particular  ob¬ 
ject  (e.g.,  [Crimson,  1990]).  In  localization,  the  clutter 
is  in  the  “model”  (i.e.,  map).  As  a  result,  the  search  for 
feature  correspondences  should  start  from  the  view. 

Landmark  features  should  be  organised  into  configura¬ 
tions. 

Most  outdoor  environments  have  a  large  number  of 
terrain  features  that  are  potentially  relevant  as  land¬ 
marks.  As  a  result,  the  combinatorial  complexity  associ¬ 
ated  with  establishing  possible  correspondences  between 
map  and  view  features  is  large.  This  complexity  can 
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be  reduced  by  first  assembling  small  groups  of  nearby 
features  into  configurations  before  correspondence  is  at¬ 
tempted.  Not  only  is  the  number  of  landmc.rks  reduced, 
configurations  are  more  likely  to  be  distinctive  than  in¬ 
dividual  features,  reducing  the  ambiguity  in  matching. 
A  particularly  useful  approach  is  to  assemble  configura¬ 
tions  formed  along  the  line  of  sight.  Such  configurations 
have  the  viewpoint  independent  property  that  they  are 
necessarily  organized  in  a  linear  fashion  on  the  map  as 
well  as  in  the  view.  In  addition,  they  have  the  viewpoint 
dependent  property  that  once  found  on  the  map,  they 
constrain  the  viewpoint  to  a  line.  When  possible,  severed 
such  configurations  surrounding  the  viewpoint  should  be 
found.  If  this  is  not  possible,  then  features  which  are  at 
least  connected  to  one  another  should  be  used  in  devel¬ 
oping  configurations,  since  these  are  far  easier  to  find  on 
a  map. 

Information  about  terrain  at  the  viewpoint  is  important. 

This  may  seem  obvious,  but  in  fact  most  approaches 
to  landmark-based  robot  navigation  do  not  pay  special 
attention  to  local  features  of  the  immediate  environment. 
If  it  is  possible  to  determine  that  an  agent  is  at  or  near  a 
particular  terrain  feature  type,  then  the  viewpoint  is  con¬ 
strained  to  be  at  one  of  the  corresponding  types  on  the 
map.  Determination  of  local  feature  type  is  often  easier 
than  evaluation  of  more  distant  features.  For  example, 
active  range  sensors  can  sense  whether  the  viewpoint  is 
on  a  ridge  but  are  inoperative  over  the  larger  distances 
to  major  landmarks  such  as  distant  peaks. 

Often,  localization  is  possible  by  assembling  a  config¬ 
uration  of  features  that  includes  the  immediate  area  and 
several  nearby  landmarks  with  distinctive  relationships 
to  each  other  and  to  the  viewpoint.  In  such  cases,  a  sim¬ 
ple  search  for  comparable  configurations  in  the  map  can 
generate  a  viewpoint  hypothesis  without  any  need  for 
triangulation  or  more  sophisticated  geometric  reasoning. 

Multiple  hypotheses  need  to  be  generated  and  examined. 

Heuristic  search  aims  at  focusing  analysis  to  quickly 
select  a  viewpoint  hypothesis  that  can  be  evaluated 
against  the  current  view  of  the  environment.  Terrain  fea¬ 
tures  are  highly  ambiguous,  however,  and  so  it  is  difficult 
to  identify  landmarks  with  certainty.  Any  single  view¬ 
point  hypothesis  based  on  a  small  number  of  features 
has  a  high  probability  of  being  incorrect.  In  complex 
terrain,  it  appears  to  be  necessary  to  develop  a  number 
of  different  plausible  hypotheses  for  the  viewpoint  before 
verification  takes  place. 

Hypotheses  should  be  compared  using  a  disconfirmaiion 
strategy. 

Validation  of  a  hypothesis  involves  a  comparison  of  the 
actual  view  with  expectations  generated  from  the  map 
based  on  the  presumed  viewpoint.  In  performing  this 
comparison,  it  is  most  important  to  note  expectations 
that  are  not  met.  If  one  clear  mismatch  is  found,  then 
the  associated  hypothesis  should  be  eliminated.  Since 
terrain  tends  to  have  lots  of  features  that  more  or  less 
look  the  same,  validation  based  on  finding  expected  fea¬ 
tures  is  far  less  effective  than  rejecting  hypotheses  when 
expected  features  are  not  found. 

The  ability  to  move  to  alternate  viewpoints  is  important. 


Movement  to  bring  obscured  features  into  view  or  to 
generate  parallax  sufficient  to  gain  distance  estimates 
has  clear  advantages  when  solving  localization  problems. 
Often  overlooked,  however,  is  the  importance  of  move¬ 
ment  in  verifying  hypothesis.  As  previously  mentioned, 
recognizing  terrain  features  when  you  are  standing  on 
them  is  usually  easier  than  recognizing  them  from  a 
distance.  When  possible,  viewpoint  hypothesis  should 
be  used  to  generate  expectations  about  nearby  features 
which  can  then  be  confirmed  or  disconfirmed  by  moving 
to  the  predicted  location  of  the  feature. 

3  Expert  Map  Users 

Experiments  with  expert  map  users  show  the  importance 
of  the  above  strategies.  Experienced  topographic  map 
readers  were  given  difficult  field  problems  and  asked  to 
describe  their  procedures  as  they  attempted  to  work  out 
the  solutions.  Their  verbal  protocols  were  recorded  and 
analyzed  to  provide  a  detailed  description  of  the  local¬ 
ization  process.  Since  the  ability  to  judge  slope  and  dis¬ 
tance  is  commonly  thought  to  be  central  to  using  topo¬ 
graphic  maps,  followup  investigations  of  people’s  ability 
to  judge  these  properties  were  also  conducted. 

The  map  readers  who  participated  in  this  study  had 
extensive  professional  or  recreational  experience  using 
topographic  maps.  Included  were  geologists,  wilderness 
guides,  nationally  ranked  orienteers,  and  members  of  the 
military.  The  study  was  done  in  two  areas  of  east- 
central  Minnesota,  in  terrain  characterized  by  rolling 
hills.  There  were  no  distinctive  landmarks  that  would  al¬ 
low  for  a  quick  solution  through  simple  recognition.  The 
problem  was  a  difficult  one,  and  only  12  of  41  subjects  ar¬ 
rived  at  a  correct  solution.  All  of  those  who  successfully 
solved  the  problem  employed  the  first  four  strategies  de¬ 
scribed  in  section  2,  and  many  also  used  movement  to 
assist  in  verification.  One  or  more  of  these  four  strate¬ 
gies  was  absent  in  the  problem  solving  approaches  used 
by  those  who  were  not  able  to  correctly  determine  the 
viewpoint. 

Figure  1  provide  a  high-level  schematic  of  a  portion  of 
the  protocol  taken  from  one  of  the  successful  experts.  A 
detailed  analysis  of  this  and  other  protocols  can  be  found 
in  [Pick  et  al,  1992].  Fragments  of  text  from  the  actual 
transcript  illustrate  many  of  the  strategies  described  in 
the  previous  section: 

Concentrating  on  the  view  first  and  then  searching  for 
correspondences  on  the  map. 

“[early  in  the  protocol,  starting  with  the  imme¬ 
diate  area]  All  right,  well  I  noticed  I’m  at  one 
of  the  higher  points  within  this  area,  so  that’s 
important.  [Now  to  the  map]  So  I’m  first  look¬ 
ing  on  the  map,  for  some  higher  points  on  the 
map.  [Next  refine  the  description  of  the  imme¬ 
diate  area]  Now  I’m  just  taking  a  look  around. 

It  seems  like  I’m  at  the  top  of  a  hill  or  nearly 
at  the  peak  of  the  hill.” 

Organizing  features  into  configurations. 

“I’m  at  a  high  point.  Directly  north  is  a  fairly 
flat  area  and  north  of  that  it  gets  steep  and 
then  there’s  a  lake. . .” 


Localization  at  O'Brien,  Explore  Condition 
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statements  A  sme  ■> 
each  entry  is  a  semantic  unit 

Figure  1:  Partial  process  protocol  trace. 


Multiple  hypotheses  need  to  be  generated. 

“Um,  for  example  say,  somewhere  here  [points 
on  the  map  to  what  is  in  fact  the  correct  hy¬ 
pothesis]  or  on  a  hill  here  [points  to  an  incorrect 
viewpoint  hypothesis].” 

Comparing  hypotheses  using  a  disconiirmation  strategy. 

“OK,  I  still  kind  of  like  this  area  [points  to  cor¬ 
rect  h)rpothesis].  But  then  I  was  looking  up 
on  the  map.  I  also  have  a  high  here  [points  to 
an  incorrect  hypothesis]  with  a  pond  that’s  not 
very  far  at  all.  It  just  doesn’t  seem  to  work  well 
because  there’s  a  fairly  steep  and  long  gradient 
here  before  you  get  to  a  flat  part.  And  I  don’t 
see  that  where  we’re  standing.” 

The  last  quotation  illustrates  the  tendency  of  experts 
to  use  qualitative  rather  than  quantitative  judgments  of 
magnitude  both  when  referring  to  the  map  and  when 
considering  the  terrain.  This  tendency  is  so  striking  it 
led  us  to  conduct  an  investigation  to  determine  whether 
or  not  quantitative  judgements  about  slope  and  distance 


could  accurately  be  made  by  people  in  terrain  similar  in 
topography  2md  extent  to  that  of  the  localization  prob¬ 
lem.  Observers  were  very  poor  at  estimating  both  dis¬ 
tance  and  slope.  In  particular,  distance  judgments  along 
a  line  of  sight  were  underestimated  in  comparison  with 
judgments  perpendicular  to  the  line  of  sight,  while  slopes 
were  dramatically  overestimated.  The  errors  are  reduced 
but  not  eliminated  by  restricting  judgments  to  more  level 
and  homogeneous  terrain.  These  results  are  of  engineer¬ 
ing  relevance,  since  image  understanding  systems  also 
are  poor  at  determining  distance  and  slope  unless  active 
sensing  is  used.  It  is  likely  that  the  strategies  observed 
with  expert  map  users  are  adapted  to  solving  localization 
problems  using  primarily  qualitative  information  about 
the  layout  of  the  visible  terrain. 

4  Implementation 

We  are  constructing  a  computer  program  capable  of  rea¬ 
soning  about  localization  problems  in  a  manner  similar 
to  that  described  above.  This  program  allows  us  to  val¬ 
idate  the  utility  of  different  problem  solving  strategies 
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Figure  2:  Taxonomy  of  image  and  map  features. 


by  experimenting  to  determine  the  efficiency  j^nd  accu¬ 
racy  of  particular  approaches.  In  its  current  form,  the 
program  uses  many  of  the  strategies  described  above  to¬ 
gether  with  a  number  of  techhiques  which  reduce  com¬ 
plexity  and  ambiguity,  while  quantifying  the  uncertainty 
that  remains. 

Gipt  kragt  Iniig*-  Imag*-  Inclinm  PmIk 
SaddM  VaHayt 

Ruga* 

map- 

fmaturaa 

Bmhw  0  3  0  0  1  1 

Dapwalont  SOS  5  00 

RMOtranta  SOI  3  00 

VaMy  SOS  5  00 

Baam  SOS  S  00 

Bowta  3  0  3  5  0  0 

Cirquaa  3  0  3  S  0  0 

Orawa  S  0  1  S  0  0 

QuUiaa  SOI  S  00 

Hanging-  3  0  3  S  0  0 

vallaya 

Map-  SOS  3  00 

Saddlaa 

Ceia  SOS  5  00 

Paaaaa  SOS  S  00 

Promiaiona  0  3  0  0  3  3 

Bunaa  0  3  0  0  3  3 

Paak-  0  3  0  0  3  S 

primitva. 

ndgaa  0  5  0  0  3  t 

Buttraaaaa  0  5  0  0  3  3 

Stiowldara  0  S  0  0  3  3 

Spura  0  5  0  0  3  3 

Spiraa  0  3  0  0  3  5 

Walla  0  3  0  0  so 

Haadwalla  0  3  0  0  3  0 


Figure  3:  Compatibility  of  map  and  image  features. 

Hierarchical  matching  is  used  to  establish  correspon¬ 
dences  between  map  and  image  features.  The  hierarchy 
is  based  on  a  taxonomy  of  geographic  terms  and  repre¬ 
sents  the  classes,  subclasses,  and  instances  of  image  and 


map  features  (see  Figure  2).  Image  features,  for  exam¬ 
ple,  are  divided  into  the  classes  of  gaps,  ridges,  saddles, 
valleys,  inclines  and  peaks.  Map  features  are  organized 
into  a  richer  structure,  since  more  information  is  avail¬ 
able  about  them  from  lower-level,  map-understanding 
processes.  This  taxonomy  provides  the  conceptual  cata¬ 
log  on  which  the  representations  for  the  problem-specific 
data  are  built.  The  class,  subclass,  and  in8t^ulce  relations 
form  the  basis  for  the  proximity  of  concepts  in  the  hi¬ 
erarchy.  Proximity  is  used  as  a  criteria  for  the  value  of 
matches  between  compatible,  but  not  identical  features. 
Figure  3  shows  the  a  priori  likelihood  currently  assigned 
to  matches  between  image  smd  map  features,  with  5  in¬ 
dicating  that  the  features  are  highly  compatible  and  0 
indicating  that  the  features  are  very  unlikely  to  match. 

Three  types  of  hypotheses  are  used;  feature  match 
hypotheses,  configuration  match  hypotheses,  and  view¬ 
point  hypotheses.  These  represent  the  possibility  that 
a  feature  or  a  configuration  from  the  image  corresponds 
to  a  feature  or  a  configuration  from  the  map,  or  that 
the  agent  is  at  a  given  location  on  the  map,  viewing 
the  world  from  a  certain  direction.  Hypotheses  are  also 
implemented  in  a  hierarchy,  reflecting  the  concrete  and 
isolated  nature  of  the  low-level  data  and  the  abstract  and 
conglomerate  nature  of  the  high-level  data.  Viewpoint 
hypotheses  axe  made  up  from  and  depend  on  configu¬ 
ration  match  hypotheses.  These  in  turn  are  made  up 
from  and  depend  on  configurations  and  feature  match 
hypotheses  which  themselves  rely  on  the  incoming  im¬ 
age  and  map  data. 

The  hierarchical  organization  of  features  and  hypothe¬ 
ses  facilitates  advanced  control  of  the  reasoning  pro¬ 
cesses.  Procedures  which  combine  low-level  data  into 
higher-level  components  are  data-driven.  Other  proce¬ 
dures  which  use  indications  from  high-level  data  to  drive 
their  analysis  of  the  low-level  data  are  hypothesis-driven. 
An  advanced,  problem-solving  control  structure  inte¬ 
grates  the  data-  and  hypothesis-driven  reasoning  coni- 
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ponents  to  balance  their  work. 

The  following  example  illustrates  several  of  the  fea¬ 
tures  currently  implemented.  The  problem  corresponds 
approximately  to  Figures  1  and  2  in  [Savitt  et  al.,  1992]. 
The  output  contains  fragments  from  an  actual  run  of  the 
program,  simplified  to  improve  readability.  A  much  more 
detailed  description  of  this  same  example  is  presented  in 
[Bennett,  1992]. 

Starting  with  the  view,  initial  reconnaissance  generates 
prominent  features: 

(((feature  peak-a)  (type  image)  (Isa  image-peak)) 
((feature  valley-a)  (type  image)  (iaa  image-valley) 
(left-of  peak-a)) 

...) 

These  features  are  evaluated  to  locate  configurations: 

((image-configuration  tl) 

(components  (ridge-a  valley-a  peak-a)) 

(relation  left-of)) 

Additional  reconnaissance  generates  prominent  map  fea¬ 
tures: 

(((feature  Mount-Moran)  (type  map)  ...  ) 

((feature  Hount-St- John)  (type  map)  ...  )  ...) 

Map  configurations  are  identified: 

(((map-configuration  tlO) 

(components  (m-wister  g-teton  m-st-john)) 

(relation  north-neighbor))  ...  ) 

Eight  feature  matches  are  hypothesized: 

(((feature-match  t2)  (image-feature  peak-a) 
(map-feature  m-moran)  . . . )  . . . ) 

Effective  configurations  are  not  found,  forcing  a  search 
for  configurations  among  features  at  finer  levels  of  detail: 

((conf ig-match  tl28)  (matches  ((ridge-a  the-jaw) 
(valley-a  paintbrush-c)  (peak-a  m-moran)))  ...  ) 

Twelve  viewpoint  hypotheses  are  proposed: 

((viewpoint-hypothesis  tl29) 

(matches  ((ridge-a  the-jaw) 

(valley-a  paintbrush-c)  (peak-a  ra-moran))) 
(viewing-from  east)  (viewing-location  (...)) 

...) 

Hypothesis-driven  reasoning  explores  each  of  the  pro¬ 
posed  viewpoints,  refining  the  feature  matching  and  lo¬ 
cation  hypotheses.  Eventually,  the  correct  answer  sur¬ 
faces  as  the  best  hypothesis.  Other  hypotheses,  less  con¬ 
sistent  with  the  data  are  ranked  accordingly. 

The  program  is  currently  operating  with  simulated  in¬ 
put  corresponding  to  a  portion  of  Grant  Teton  National 
Park  (see  the  example  in  [Thompson  ei  ai,  1990]).  We 
are  in  the  process  of  using  methods  described  in  [Savitt 
et  ai,  1992]  to  allow  operation  on  real  data. 

5  Conclusions 

Over  the  course  of  our  research,  we  have  obtained  a 
number  of  important  insights  into  effective  and  efficient 
methods  for  solving  localization  problems.  These  re¬ 
sults  have  been  made  p'^ssible  by  an  interdisciplinary 


approach  which  integrates  computational  analysis,  in¬ 
vestigations  of  expert  map  users,  and  computer  simula¬ 
tions.  Problem  solving  strategies  have  been  discovered 
which  appear  to  give  substantial  leverage  in  managing 
the  complexity  and  ambiguity  eissociated  with  naviga¬ 
tion  in  outdoor  terrain.  The  work  has  applications  in 
both  automated  navigation  aids  and  in  training.  In  ad¬ 
dition,  there  is  likely  to  be  relevance  to  other  combinato¬ 
rial  geometric  matching  problems  such  as  feature-based 
object  recognition. 
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Abstract 

Visual  navigation  with  a  map  requires  effec¬ 
tive  perception  of  both  the  information  in  the 
map  as  well  as  the  terrain  imagery  that  is  vis¬ 
ible  from  the  current  position.  In  open,  out¬ 
door  terrain  where  cultural  features  are  lack¬ 
ing,  ridgeline  and  valley  features  are  particu¬ 
larly  important.  This  paper  describes  a  novel 
method  for  extracting  such  features  from  dig¬ 
ital  elevation  maps.  In  addition,  preliminary 
results  are  presented  on  extracting  correspond¬ 
ing  features  from  views  of  the  terrain. 

1  Introduction 

Determining  the  location  of  a  viewpoint  with  respect  to 
a  map  requires  two  perception  tasks:  a  map  understand¬ 
ing  process  which  extracts  corresponding  features  from 
the  map  and  an  image  understanding  process  which  ex¬ 
tracts  salient  landmarks  from  one  or  more  views.  Peaks, 
ridgelines,  saddles,  and  valleys  are  important  landmarks 
in  outdoor  terrain  when  cultural  features  in  known  loca¬ 
tions  are  absent  [Pick  ei  ai,  1992].  These  features  have 
precise  definitions,  specified  in  terms  of  local  slope  prop¬ 
erties  of  the  underlying  surface  [Peucker  and  Douglas, 
1975,  Haralick  et  ai,  1983].  Using  these  definitions  to 
extract  navigational  features  from  elevation  data  can  be 
difficult,  however,  due  to  their  intrinsically  local  nature. 
The  saliency  of  map  features  depends  on  properties  over 
a  variety  of  scales  [Savitt,  199l].  In  subsequent  sections, 
we  outline  an  alternate  approach  for  finding  ridgelines 
and  valleys  given  digital  elevation  data.  We  then  con¬ 
clude  with  some  preliminary  results  on  the  extraction  of 
the  same  features  from  ground  level  imagery  of  outdoor 
terrain. 

2  Extraction  of  Map  Features 

Valley  features  can  be  efficiently  found  using  a  “reverse 
engineering”  strategy  which  simulates  aspects  of  the  val¬ 
ley  formation  process.  Valleys  are  erosional  features. 

‘This  work  was  supported  by  National  Science  Foundation 
grant  IRI-9196146,  with  partial  funding  from  the  Defense  Ad¬ 
vanced  Research  Projects  Agency,  and  by  a  grant  from  Texas 
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We  model  the  this  erosion  using  a  hydrologic  simula¬ 
tion,  though  the  method  will  work  for  lands  shaped  by 
other  processes.  Since  the  eroding  flows  moved  from 
higher  to  lower  elevations,  the  floor  of  the  resulting  val¬ 
ley  varies  monotonically  in  elevation  along  its  length. 
Also  the  width,  depth,  and  hence  visual  significance  of 
a  valley  tends  to  relate  to  the  volume  of  the  hydrologic 
flow  that  formed  the  valley.  Fluid  flows  over  the  re¬ 
gion  of  interest  are  simulated  to  reveal  the  paths  through 
which  the  eroding  flows  were  channeled.  Then  the  loca¬ 
tion  and  magnitude  of  the  simulated  flows  are  used  to 
determine  the  location  and  saliency  of  the  resulting  val¬ 
ley  features.  The  approach  was  motivated  by  the  early 
work  of  [Speight,  1968]  who  developed  a  manual  tech¬ 
nique  for  extracting  watercourse  features  from  printed 
contour  maps.  A  computer  implementation  of  Speight’s 
algorithm  for  hydrologic  flow  analysis  was  compared  by 
[Mark,  1983]  to  the  algorithm  of  [Peucker  and  Douglas, 
1975]  which  is  based  on  an  analysis  of  local  surface  prop¬ 
erties. 

The  implemented  algorithm  consists  of  a  two  step  pro¬ 
cess  in  which  a  hydrologic  flow  simulation  is  first  used  to 
determine  how  much  fluid  would  pass  each  point  in  the 
elevation  grid  if  the  terrain  surface  were  subjected  to  a 
uniform  retinfall.  Then  in  a  second  step  the  desired  fea¬ 
tures  are  extracted  from  the  resulting  flow  analysis,  and 
the  relative  flow  strengths  associated  with  each  feature 
are  used  to  predict  the  visual  saliency  of  the  feature. 

The  hydrologic  flow  simulation  operates  on  a  data 
structure  called  a  flow  image  which  consists  of  flow 
counter  cells  that  have  a  one  to  one  correspondence  with 
the  elevation  samples  in  the  terrain  map.  These  cells  are 
used  to  keep  track  of  the  total  amount  of  fluid  that  flows 
past  the  corresponding  surface  point  of  the  terrain. 

The  simulation  is  initiated  by  introducing  one  unit  of 
fluid  into  each  cell  of  the  flow  image.  Then  beginning 
with  the  surface  element  having  the  highest  elevation, 
the  fluid  in  the  associated  element  is  passed  down  to 
the  nearest  of  the  eight  neighboring  elements  with  the 
steepest  downward  slope  from  the  initial  element.^  The 
passing  is  accomplished  by  adding  the  amount  of  fluid 
being  passed  to  the  quantity  of  fluid  already  registered  in 
the  flow  counter  of  the  receiving  element.  The  element 
frorf.  the  cell  with  the  next  lower  elevation  value  is  pro- 

*  Since  elevation  values  are  quantized  and  of  limited  range, 
an  efficient  bin  sort  can  be  used. 
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cessed  next,  and  the  process  continues  until  the  lowest 
elevation  point  is  reached.  The  resulting  flow  image  re¬ 
veals  how  much  of  the  initial  fluid  flows  past  each  point 
of  the  terrain.  The  values  can  range  from  a  minimum  of 
one  unit  which  is  always  the  case  for  the  highest  point 
in  the  terrain,  to  a  mzocimum  of  n  i  m  units  in  the  rare 
case  that  all  of  the  initial  fluid  distribution  is  channeled 
downward  past  a  single  point  of  the  terrain.  The  magni¬ 
tude  of  flow  values  in  the  flow  image  are  used  to  predict 
the  visual  salience  of  the  associated  valley  features.  Our 
experience  indicates  that  the  flow  magnitude  is  a  good 
predictor  of  visual  salience  as  it  reveals  the  intensity  of 
the  valley  formation  process  which  caused  certain  valleys 
to  become  more  visually  salient  than  others  in  terms  of 
their  extent,  depth  and  breadth. 

Ridge  features  are  found  in  a  similar  manner.  In  effect, 
the  surface  of  the  earth  is  considered  to  be  an  infinitely 
thin  shell  and  when  inverted,  the  valleys  become  ridges 
and  the  ridges  become  the  valleys.  The  hydrologic  flow 
based  feature  extractor  is  then  applied  to  the  inverted 
terrain  model  to  find  ridgelines.  Ridgelines,  of  course, 
were  not  formed  in  this  manner  and  in  fact  have  certain 
key  properties  that  are  quite  distinct  from  valleys.  In 
particular,  ridgelines  are  seldom  monotonic  in  elevation 
along  their  length.  Nevertheless,  simple  modifications 
to  the  hydrologic  flow  algorithm  provide  a  useful  mech¬ 
anism  for  localizing  ridges. 

The  apparent  success  of  the  flow  simulation  technique 
for  extracting  ridgelines  can  be  explained  as  follows. 
Ridges  that  extend  radially  outward  from  the  peak  of 
a  hill  tend  to  have  extended  sections  which  do  decrease 
monotonically  in  elevation.  Once  inverted,  these  ridge¬ 
line  sections  are  then  directly  detectable  by  the  val¬ 
ley  detection  algorithm  since  they  present  a  continuous 
channel  through  which  a  simulated  fluid  flow  can  pass. 
The  terminations  of  i,he  monotonic  sections  of  ridgelines 
occur  at  either  local  peaks  or  saddle  points.  The  lo¬ 
cal  peaks  will  cause  the  simulated  flow  from  the  two 
branches  on  opposite  sides  of  the  peak  to  join  at  the 
site  of  the  peak  resulting  in  the  ridgeline  being  properly 
identified  as  a  continuous  feature  through  the  peak.  The 
problem  occurs  at  the  locations  of  the  saddle  points  that 
exist  along  the  ridgeline  between  pairs  of  local  peaks. 
When  encountered  during  the  simulation,  a  peak  blocks 
the  simulated  flow,  and  that  the  flow  must  be  purpose¬ 
fully  extended  across  an  adjacent  saddle  point  along  the 
ridge  to  some  point  at  a  higher  elevation  than  that  of 
the  peak. 

The  magnitudes  of  simulated  flows  along  a  ridgeline 
are  used  as  an  indication  of  the  visual  salience  of  the 
ridgeline,  although  the  rationale  for  doing  so  differs  from 
tlie  argument  forwarded  for  the  case  of  valley  features. 
In  the  case  of  ridges,  the  magnitude  of  the  simulated  flow 
is  not  directly  related  to  the  strength  of  the  process  that 
formed  the  ridges,  as  is  tlie  case  for  valleys.  However 
the  magnitude  of  the  fluid  that  impinges  on  a  ridgeline 
is  directly  related  to  the  surface  area  of  the  terrain  that 
extends  from  tlie  ridgeiine  to  the  interposing  valleys  on 
cither  side  of  the  ridge,  and  this  surface  area  is  an  impor¬ 
tant  metric  for  estimating  the  visual  saliency  of  a  ridge. 
The  surface  area  can  be  thought  of  as  the  product  of  the 


length  of  the  ridge  and  the  slant  distance  between  the 
ridgeline  and  each  of  the  adjacent  valleys,  both  factors 
which  contribute  to  the  visual  salience  of  a  ridgeline. 

After  the  flow  image  has  been  generated,  an  adap¬ 
tive  threshold  is  applied  to  extract  the  desired  terrain 
features.  In  the  case  of  valleys  and  ridges,  the  features 
emerge  in  a  dendrite-like  tree  pattern  that  depicts  the 
connected  valley  and  ridge  networks.  In  the  final  phase 
of  the  algorithm,  the  resulting  network  trees  are  parsed 
to  extract  the  individual  branches  of  the  network,  and 
the  strength  of  the  fluid  flow  associated  with  each  branch 
is  used  to  determine  the  visual  saliency  of  the  terrain 
feature.  The  individual  valleys  and  ridges  are  then  or¬ 
ganized  into  a  hierarchical  database  of  terrain  features. 
The  terrain  features  with  the  highest  salience  are  placed 
in  the  highest  position  within  the  hierarchy. 


Figure  2:  Wireframe  view. 


Figure  1  shows  a  conventional  topographic  map  of  a 
region  in  Grand  Teton  National  Park.  Figure  2  shows  a 
simulated  view,  derived  from  digital  terrain  data,  of  the 
region  as  seen  from  the  position  marked  with  the  “x” 
in  Figure  1.  Figure  3  illustrates  the  terrain  features  ex¬ 
tracted  using  local  slope  behavior  of  the  terrain  surface. 
The  white  lines  are  the  valley  features  and  the  black  lines 
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Figure  3:  Ridgelines  and  valleys  using  local  slope  properties. 


Figure  4;  Ridgelines  and  valleys  using  hydrologic  simulation  with  high  threshold. 
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are  the  ridge  features.  The  features  are  overlaid  on  an 
intensity  coded  elevation  image  in  which  darker  shades 
of  grey  correspond  to  pixels  at  lower  elevations. 

The  linear  terrain  features  that  are  extracted  by  the 
local  slope  method  consist  of  several  partial  feature  seg¬ 
ments  along  the  extent  of  the  full  feature.  Consider,  for 
example,  the  valley  feature  along  Moran  Canyon.  The 
valley  is  readily  apparent  in  the  wireframe  representation 
due  to  the  presence  of  the  two  enclosing  hillsides.  The 
valley  floor  tends  to  be  rather  flat,  irregular  and  winding, 
with  the  result  that  a  process  to  extract  the  linear  valley 
feature  based  on  local  slope  information  results  in  the 
detection  of  only  partial  spans  of  the  full  valley.  As  seen 
in  Figure  3,  the  extracted  valley  feature  is  broken  into 
approximately  ten  segments  in  the  span  before  the  fork, 
and  approximately  40%  of  the  valley  is  missed.  A  second 
problem  with  local  slope  based  feature  detection  is  that 
visual  saliency  and  navigational  significance  of  terrain 
features  does  not  appear  to  correlate  well  with  measures 
of  local  slope.  For  instance,  one  of  the  more  dominant 
valley  features  that  is  detected  by  the  local  slope  based 
feature  detector  is  a  side  valley  that  branches  south  off 
of  Moran  canyon  between  Thor  Peak  and  Cleaver  Peak. 
The  narrower,  more  di.-tinct  valley  bottom  of  this  side 
valley  results  in  a  stronger  detection  result  than  that 
obtained  for  the  primary  valley  of  Moran  Canyon  even 
though  Moran  canyon  is  clearly  the  more  visually  salient 
terrain  feature  in  the  region. 

The  result  of  the  flow  based  algorithm  is  shown  in 
Figures  4  and  5.  As  before,  the  white  lines  are  the  val¬ 
ley  features  and  the  black  lines  are  the  ridge  features.  In 
Figure  4  a  rather  high  threshold  is  used,  leaving  only  the 
more  prominent  features.  Figure  5  uses  a  lower  threshold 
and  thus  includes  greater  detail.  Notice  that  the  valley 
features  are  now  fully  connected  as  a  result  of  the  flow 
continuity  feature  of  the  underlying  simulation.  In  addi¬ 
tion,  the  intensity  of  the  flow  provides  a  metric  that  can 
be  used  to  predict  the  visual  and  navigational  saliency 
of  the  respective  feature.  For  example,  Figure  4  includes 
only  the  main  valleys  while  in  Figure  5  the  less  salient 
side  valleys  emerge. 

3  Extraction  of  Image  Features 

Ridgelines  in  terrain  appear  as  distinct  contours  with 
a  predominately  horizontal  orientation  in  ground-level 
views  of  the  terrain.  Valley  features  are  harder  to  ex¬ 
tract  from  terrain  images,  since  much  of  the  valley  itself 
is  typically  occluded  by  the  ridges  to  either  side.  When 
enough  of  the  walls  are  visible,  texture  measures  can 
sometimes  be  used  to  locate  valleys  [Thompson  tt  al., 
1990].  Otherwise,  the  existence  of  a  valley  must  be  in¬ 
ferred  from  the  ridgeline  pattern.  While  the  possible 
associations  between  ridge  shape  and  valleys  are  both 
complex  and  ambiguous,  it  is  generally  tlie  case  that 
smooth  dips  and  T-junctions  in  ridgelines  provide  evi¬ 
dence  for  the  existence  of  valleys  [Bennett,  1992]. 

We  are  currently  investigating  how  conventional  image 
.segmentation  techniques  can  be  adapted  to  the  unique 
requirements  associated  with  the  extraction  of  ridgeline 
features.  The  problem  is  difficult  because  the  irregular 
nature  of  terrain  precludes  the  effective  use  of  simple 


line-finder  type  algorithms  and  because  of  the  presence 
of  extraneous  features  such  as  sky,  clouds,  trees  or  other 
vegetative  cover.  Results  presented  here  only  involve  the 
sky-terrain  boundary.  This  is  the  easiest  set  of  ridgelines 
to  find  and  can  aid  in  locating  other  ridgelines  closer  to 
the  viewpoint. 

Figure  6  is  a  digitized  image  taken  from  a  location  near 
the  Alta  ski  area,  just  outside  of  Salt  Lake  City.  Figure  7 
shows  the  output  of  a  standard  zero-crossing  edge  detec¬ 
tor  applied  to  the  image  shown  in  Figure  6.  Note  that 
while  sky-terrain  boundary  elements  have  been  found, 
many  long  edge  segments  exist  above  and  below. 


Figure  6:  Outdoor  terrain  near  Salt  Lake  City. 

Figure  8  shows  the  output  of  an  edge  filtering  algo¬ 
rithm  which  produces  a  connected  contour  across  the  im¬ 
age.  The  contour  is  determined  using  a  heuristic  search 
algorithm  that  attempts  to  simultaneously  minimize  the 
contour  length  and  the  number  of  gap  pixels  that  have 
to  be  filled.  (Similar  methods  date  back  at  least  as  far 
as  [Martelli,  1972]).  In  this  example  and  other  simi¬ 
lar  tests,  the  algorithm  was  effective  in  finding  the  ac¬ 
tual  sky-terrain  boundary.  For  comparison  purposes. 
Figure  reffig:ground-truth  shows  the  predicted  occlusion 
boundaries  at  approximately  the  viewpoint  from  which 
Figure  6  was  taken,  generated  from  digital  elevation  data 
of  the  area.  Figure  10  shows  the  results  of  applying  a 
peak  and  saddle  detector  to  Figure  8.  Currently,  this  is 
done  using  a  simple  extrema  detector.  Work  is  underway 
to  characterize  the  shape  of  extracted  features  and  to 
differentiate  between  true  saddle  points  and  T-junction 
gaps  indicating  that  one  peak  is  in  front  of  another. 
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Abstract 

Interreflections  cause  recovery  algorithms  to  extract  er¬ 
roneous  estimates  of  surface  shape  and  color.  Due  to  the 
complex  nature  of  the  interreflection  phenomenon,  its  ef¬ 
fects  are  difficult  to  analyze  in  a  multi-dimensional  color 
space.  Noting  that  the  interreflections  of  one  wavelength 
are  in  general  unaffected  by  those  of  any  other  wavelength, 
we  decompose  the  interreflection  process  into  three  in¬ 
dependent  processes  corresponding  to  the  three  spectral 
bands  of  color  images.  Photometric  stereo  is  applied  in¬ 
dependently  applied  to  each  spectral  band  and  recovered 
(pseudo)  shapes  and  color  are  related  to  the  actual  shape 
and  color  of  the  surface.  An  algorithm  is  proposed  that 
recovers  the  actual  shape  and  color  of  the  surface  from  the 
pseudo  estimates.  The  accuracy  and  robustness  of  the  al¬ 
gorithm  is  demonstrated  on  a  variety  of  colored  and  multi¬ 
colored  surfaces. 


1  Interreflections  and  Shape  Recovery 

Points  in  the  scene,  when  illuminated,  reflect  light  not  only 
in  the  direction  of  the  sensor  but  also  between  themselves. 
This  is  always  true  with  the  exception  of  scenes  that  con¬ 
sist  of  only  a  single  convex  surface,  in  which  case,  no  two 
points  on  the  surface  are  visible  to  one  another.  In  general, 
however,  scenes  concavities  and  points  in  the  scene  reflect 
light  between  themselves.  These  interreflections  can  ap¬ 
preciably  alter  the  appearance  of  the  scene.  Existing  vision 
algorithms  do  not  account  for  the  effects  of  interreflections 
and  hence  often  produce  erroneous  results. 

Two  separate  problems  associated  with  interreflections 
can  be  identified;  the  forward  (graphics)  problem  and  the 
inverse  (vision)  problem.  Most  of  the  previous  work  done 
in  this  area  is  related  to  the  forward  problem.  The  for¬ 
ward  problem  involves  the  prediction  of  image  brightness 
values  given  the  shape  and  reflectance  of  a  scene.  Horn 
[Horn  70]  discussed  the  changes  in  image  intensities  due 
to  interreflections  caused  by  polyhedral  surfaces  that  are 
Lambertian  in  reflectance.  Koenderink  and  van  Doom 
[Koenderink  83]  formalized  the  interreflection  process  for 
Lambertian  surfaces  of  arbitrary  shape  and  varying  re¬ 
flectance  (albedo).  They  proposed  a  solution  to  the  for¬ 
ward  problem  in  terms  of  the  eigenfunctions  of  the  inter¬ 
reflection  kernel.  Cohen  and  Greenberg  [Cohen  85]  mod¬ 


eled  the  scene  as  a  finite  collection  of  Lambertian  planar 
facets  and  proposed  a  radiosity  solution  to  the  forward 
problem  and  used  it  to  render  images  for  graphics.  Later, 
Forsyth  and  Zisserman  [Forsyth  89]  used  a  similar  numer¬ 
ical  solution  to  the  forward  problem  to  compare  predicted 
and  measured  image  intensities. 

More  recently,  Nayar  et.al  [Nayar  91]  demonstrated  the 
effects  of  interreflections  on  shape-from-intensity  algo¬ 
rithms,  such  as,  shape-from-shading  [Horn  70],  photomet¬ 
ric  stereo  [Woodham  78].  These  are  algorithms  that  re¬ 
cover  three-dimensional  shape  information  from  image  in¬ 
tensities.  All  shape-from-intensity  methods,  are  based  on 
the  assumption  that  points  in  the  scene  are  illuminated 
only  by  the  sources  of  light  and  not  other  points  in  the 
scene;  interreflections  are  assumed  nof  to  exist.  As  a  result, 
these  methods  produce  erroneous  results  when  applied  to 
concave  surfaces.  Nayar  et.al  analyzed  the  incorrect  shape, 
the  pseudo  shape,  recovered  by  shape-from-intensity  meth¬ 
ods  when  applied  to  concave  Lambertian  surfaces.  They 
established  a  relation  between  the  actual  shape  and  pseudo 
shape  and  developed  an  algorithm  that  recovers  the  actual 
surface  from  the  pseudo  shape  and  reflectance  estimates. 

In  their  analysis,  Nayar  et.al  [Nayar  91]  focused  on  gray 
surfaces;  each  point  on  the  Lambertian  surface  was  as¬ 
sumed  to  have  a  constant  albedo  value  that  is  independent 
of  the  wavelength  of  incident  light.  In  the  case  of  colored 
surfaces,  however,  the  reflectance  of  a  surface  point  is  de¬ 
pendent  on  the  spectral  distribution  of  the  incident  light. 
In  the  case  of  concave  surfaces,  the  spectral  distribution 
of  light  rays  incident  on  a  given  surface  point  depend  on 
the  spectral  characteristics  of  both  the  source  as  well  as 
other  points  on  the  surface.  Further,  the  light  reflected 
by  the  surface  point  also  depends  on  its  own  spectral  re¬ 
flectance  properties.  Interreflections  in  the  case  of  colored 
surfaces,  therefore,  can  cause  the  color  of  one  surface  re¬ 
gion  to  bleed  onto  another.  These  effects  have  been  noted 
by  other  researchers  [Bajcsy  89],  [Brill  89],  [Novak  90],  and 
[Drew  90].  The  common  approach  is  to  analyze  these  inter¬ 
reflection  effects  in  a  multi-dimensional  color  space.  Since, 
interreflections  are  influenced  by  several  factors  including 
the  unknown  shape  and  color  of  the  surface,  the  analysis 
in  color  space  is  difficult. 

Figure  1  shows  the  effects  of  colored  interreflections  on 
photometric  stereo.  The  surface  has  three  regions  of  differ¬ 
ent  color.  The  actual  color  of  the  surface  points  are  seen  to 
cluster  at  three  points  in  the  color  space.  The  photometric 
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stereo  method  is  applied  independently  to  the  three  bands 
of  the  color  image.  As  seen  from  the  figure,  three  different 
shapes  (all  incorrect)  are  recovered  from  the  three  bands. 
The  recovered  color  is  also  erroneous;  it  is  not  constant 
within  each  constant-color  region  of  the  surface. 

Actual  Shape  and  Cotor  Phowmeiiic  Stereo  Results 


ftcd  Itad 


Figure  1:  The  actual  shape  and  color  of  a  surface  and  its 
shape  and  color  computed  by  photometric  stereo. 


In  this  paper,  we  extend  the  results  in  [Nayar  91]  to 
colored  and  multi-colored  surfaces.  Noting  that  the  three 
bands  of  a  color  image  correspond  to  differe.  t  wavelengths 
of  light,  we  decompose  the  interreflection  process  into 
three  independent  processes.  We  then  analyze  the  pseudo 
shape  and  pseudo  albedo  function  produced  by  photomet¬ 
ric  stereo  in  each  band.  Then,  we  establish  a  relation  be¬ 
tween  the  pseudo  shape  and  color  and  the  actual  shape 
and  color  of  a  surface.  Using  this  relation,  an  algorithm  is 
given  that  recovers  the  actual  shape  and  color  of  the  sur¬ 
face  from  the  pseudo  estimates.  The  algorithm  is  tested 
on  a  variety  of  multi-colored  surfaces.  The  results  demon¬ 
strate  the  robustness  and  accuracy  of  the  algorithm. 

2  A  Diffuse  Interrefiection  Model 

Our  solution  to  the  inverse  interreflection  problem  is  based 
on  the  solution  to  the  forward  problem,  i.e.  modeling  in¬ 
terreflections  for  diffuse  surfaces  of  known  shape  and  re¬ 
flectance.  The  interrefiection  model  described  here  is  based 
on  the  formulation  proposed  by  Koenderink  and  van  Doom 


[Koenderink  83].  All  surfaces  in  the  scene  are  assumed  to 
be  Lambertian.  We  will  shortly  see  that  this  assumption 
is  necessary  to  obtain  a  closed  form  solution  to  the  forward 
interrefiection  problem.  The  Lambertian  surface  can  have 
any  arbitrary  shape  and  varying  reflectance,  i.e.  albedo 
(p)  may  vary  from  one  surface  point  to  the  next.  Here,  we 
will  assume  that  the  incident  light  is  monochromatic, 

i. e  the  surface  reflects  and  interreflects  light  rays  of  a  sin¬ 
gle  wavelength.  Hence  each  surface  point  can  be  assumed 
to  have  a  constant  reflectance  coefficient  (albedo).  Later 
we  show  that  the  interrefiection  model  developed  here  for 
monochromatic  light  can  be  extended  and  used  for  colored 
surfaces. 

Consider  the  concave  surface  shown  in  Figure  2.  The 
surface  is  divided  into  m  infinitesimal  facets.  Let  Xi  and 
dxi  represent  the  three-dimensional  coordinates  and  the 
surface  area  of  the  facet,  respectively.  The  radiance 
(brightness)  and  albedo  values  of  each  facet  are  assumed  to 
be  constant  over  the  entire  facet  and  equal  to  the  radiance 
md  albedo  values  at  the  center  point  X|  of  the  facet,  i.e. 
Li  =  L{Xi)  and  p«  =  p(Xt).  Consider  the  two  facets  i  and 

j.  The  radiance  of  the  facet  i  due  to  the  radiance  of  the 
facet  j  is  determined  using  basic  radiometric  definitions 
[Nicodemns  77]  as: 


Li  =  ^KijLj 


where  the  factor  Kij  is  given  by: 


(1) 


,,  [n,.ry]  [n>.r>i] 

- ' 


(2) 


Kij  is  a  function  of  the  relative  positions  and  the  orien¬ 
tations  of  the  two  facets;  it  determines  the  interreflections 
between  i  and  j  from  a  purely  geometrical  perspective.  It 
is  referred  to  as  the  interrefiection  kernel. 


Figure  2:  Modeling  the  surface  as  a  collection  of  facets, 
each  with  its  own  radiance  and  albedo  values. 

Now,  let  us  consider  the  entire  surface  shown  in  Figure 
2.  Assume  the  surface  to  be  illuminated  by  a  distant  point 
source  of  light.  Rays  of  light  that  impinge  upon  the  sur¬ 
face  are  reflected  between  the  facets.  Since  the  albedo  of 
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each  facet  is  less  than  unity,  the  rays  of  light  lose  a  fraction 
of  their  energy  with  each  bounce.  Eventually,  ^lfter  pos¬ 
sibly  an  infinite  number  of  bounces,  the  radiiuice  at  each 
point  on  the  surface  converges  to  a  final  steady-state  value. 
Hence,  the  radiance  of  the  facet  i  may  be  expressed  as  a 
sum  of  the  radiance  due  to  direct  illumination  from  the 
source  and  the  radiance  due  to  the  final  radiance  values  of 
other  facets  on  the  surface: 

Li  =  L.i  +  —Y^Lj  Kij  (3) 

where  L,i  is  the  radiance  due  to  direct  illumination  from 
the  source  and  the  summation  term  corresponds  to  the 
radiance  due  to  mutual  illumination.  This  is  the  inter- 
reflection  equation.  Note  that  the  radiance  values  Li  and 
Lj  are  assumed  to  be  constants  in  the  above  equation.  It 
is  important  to  note  that  this  assumption  is  valid  only  for 
Lambertian  surfaces;  the  radiance  of  a  Lambertian  surface 
element  is  independent  of  the  viewing  direction. 

The  interreflection  equation  for  the  complete  surface  can 
be  written  using  vector  notation.  We  define  the  facet  ra- 

T 

diance  vector  as  L  =  [  Lj ,  Lz . Lm  ]  and  the  source 

T 

contribution  vector  as  Ijs  =  [  L, i,  Lsz, . ,  L«m  ]  •  We 

also  define  the  albedo  matrix  P  and  the  kernel  matrix  K 
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the  term  shape-from-intensity,  we  mean  local  methods  that 
extract  both  shape  (orientation)  and  reflectance  (albedo) 
information.  Photometric  stereo  [Woodham  78]  is  an  ex¬ 
ample  of  such  a  shape-from-intensity  method.  In  the  pres¬ 
ence  of  interreflections,  photometric  stereo  extracts  erro¬ 
neous  shape  as  well  as  erroneous  reflectance  estimates.  We 
refer  to  the  extracted  shape  as  the  pseudo  shape  and  the 
extracted  reflectance  as  the  pseudo  reflectance  of  the  sur¬ 
face.  In  this  section,  we  again  assume  that  the  light  sources 
used  to  illuminate  the  surface  are  monochromatic.  Under 
this  assumption,  we  investigate  how  the  pseudo  shape  and 
reflectance  are  related  to  the  actual  shape  and  reflectance 
of  the  surface.  In  the  next  section,  we  extend  these  results 
to  colored  interrefiections. 

Once  again,  consider  the  surface  comprised  of  m  facets 
(Figure  2).  The  facet  may  be  mathematically  repre¬ 
sented  as: 


N.  =  (7) 

where  =  [rixt,  ny^,  is  the  unit  surface  normal 

and  Pi  is  the  albedo  value  for  the  facet.  Therefore,  the 
term  "facet”  represents  both  local  orientation  as  well  as 
local  reflectance  information.  The  complete  surface  is  then 

T 

defined  by  the  facet  matrix  F  =  [  Np,  Ng . .  Nm  ]  • 

Consider,  once  again,  the  interreflection  equation  given  by 
equation  6.  Since  the  surface  is  Lambertian,  the  source 
contribution  vector  I«  may  be  determined  from  the  facet 
matrix  F  and  the  source  direction  vector  s  s  [s^,  sy, 
as: 


Ls  =  Fs 


(8) 


Then,  equation  3  may  be  written  as: 

L  =  Ls  +  PKL  (5) 

or: 

L  =  (I  -  PK)-*L8  (6) 

where  I  is  the  identity  matrix.  Thus,  we  have  obtained 
a  non-iterative,  closed-form  solution  to  the  forward  inter¬ 
reflection  problem.  The  kernel  and  albedo  matrices  are 
determined  by  the  shape  and  reflectance  of  the  surface,  re¬ 
spectively.  The  source  direction  and  intensity  may  be  used 
to  compute  the  source  contribution  vector  I«.  Then  the 
radiance  of  the  surface  facets,  L,  can  be  determined  using 
the  above  equation. 

3  The  Pseudo  Shape 

From  equation  6  it  is  clear  that  surface  radiance  values  are 
affected  by  interreflections.  This  indicates  that  if  a  shape- 
from-intensity  method  is  applied  to  a  concave  surface  it  is 
expected  to  produce  erroneous  estimates  of  shape.  In  or¬ 
der  to  generalize  the  inverse  interreflection  problem,  we  as¬ 
sume  that  the  reflectance  of  the  Lambertian  surface  is  also 
unknown  and  may  vary  from  point  to  point.  Therefore,  by 


Hence,  we  obtain: 

L  =  (I  -  PK)-*Fs  (9) 

We  define  the  matrix  Fp  as: 

Fp  =  (I  -  PK)-'F  (10) 

Note  that  Fp  has  the  same  dimensions  as  the  facet  matrix 
F.  In  fact,  in  the  absence  of  interreflections,  K  is  a  null 
matrix  and  Fp  =  F.  In  the  presence  of  interreflections,  Fp 
may  be  viewed  as  representing  another  Lambertian  surface 
whose  shape  and  reflectance  differ  from  those  of  F.  There¬ 
fore,  if  photometric  stereo  is  applied  to  the  concave  surface, 
the  extracted  shape  and  reflectance  is  Fp  and  not  the  ac¬ 
tual  shape  and  reflectance  given  by  F.  We  refer  to  Fp 
as  the  pseudo  facet  matrix;  it  represents  the  pseudo  shape 
and  pseudo  reflectance  that  are  extracted  in  the  presence 
of  interreflections. 

In  the  case  of  photometric  stereo,  three  different  source 
directions,  sj,  sz,  and  ss,  are  used  sequentially  to  illu¬ 
minate  the  surface.  The  three  resulting  snrtace  radiance 
vectors  L/,  Lj,  ^md  Lj  may  be  expressed  as: 

[L;,Ls,L5]  =  Fp  .[s/  ,  82  ,  Sj]  (11) 
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The  pseudo  facet  matrix  is  computed  as; 

Fp  =  [L; ,  Ls  ,  L5].[si  ,  82 , 8a]~*  (12) 

The  pseudo  facet  ia  Fp  may  be  written  as: 

Npi  =  ^inpj  (13) 

where  Dpj  and  pp^  are  the  pseudo  surface  normal  and  the 
pseudo  albedo  for  the  facet  i  and,  in  the  presence  of  in¬ 
terreflections,  differ  from  the  actual  surface  normal  and 
actual  albedo  of  the  facet. 

We  conclude  this  section  by  highlighting  three  impor¬ 
tant  properties  of  the  pseudo  shape  and  reflectance: 

•  The  pseudo  shape  and  reflectance  are  Ulumination  in¬ 
variant.  In  equation  10,  note  that  the  albedo  matrix 
P,  the  kernel  matrix  K,  and  the  actual  facet  matrix 
F  are  all  invariant  to  the  direction  and  intensity  of 
the  illumination.  As  a  result,  the  matrix  Fp  is  also 
illumination  invariant.  It  is  independent  of  sources 
directions  used  by  the  shape-from-intensity  method 
to  illuminate  the  surface. 

•  The  pseudo  shape  and  reflectance  are  unique.  From 
equation  10  we  see  that  the  pseudo  facet  matrix  Fp 
is  dependent  on  the  actual  facet  matrix  F,  the  albedo 
matrix  P,  and  the  kernel  matrix  K.  Note  that  P  and 
K  are  in  turn  determined  by  F.  Hence,  Fp  is  de¬ 
pendent  only  on  F.  In  other  words,  there  exists  only 
a  single  pseudo  shape  and  pseudo  reflectance  corre¬ 
sponding  to  a  given  actual  shape  and  reflectance. 


Figure  3:  A  few  actual  shapes  (with  p  =  0.95)  and  their 
pseudo  shapes. 


•  The  pseudo  shape  tends  to  be  less  concave  than  the 
actual  shape  of  the  surface.  A  proof  of  this  property 
is  provided  in  [Nayar  90],  Figure  3  illustrates  this 
property  through  a  few  examples  of  actual  shapes  and 
pseudo  shapes  All  the  surfaces  are  assumed  to  have  a 
constant  albedo  value,  p  =  0.95.  The  pseudo  shapes 
are  computed  using  equation  10  and  are  seen  to  be 
less  concave  than  the  actual  shapes.  Figure  4  shows 
that  the  pseudo  shape  gets  less  concave  as  albedo  in¬ 
creases. 

4  Colored  and  Multi-Colored  Surfaces 

While  developing  the  interreflection  model  (Section  2),  we 
assumed  a  given  surface  facet  has  a  constant  albedo  value. 
This  assumption  is  only  valid  under  either  one  of  the  fol¬ 
lowing  two  conditions,  (a)  The  surfaces  are  not  colored 
but  rather  of  different  shades  of  gray  and  they  reflect  all 
wavelengths  of  incident  light  equally  without  attenuating 
some  wavelengths  more  than  others,  (b)  The  incident  light 
is  monochromatic  and  therefore  only  light  rays  of  a  single 
wavelength  are  reflected  and  interreflected  by  the  surface 
points.  In  the  second  case,  since  we  are  only  concerned 
with  a  single  wavelength,  each  point  on  the  surface  may 
be  assumed  to  have  a  constant  albedo  value,  namely,  the 
albedo  value  for  the  given  wavelength  of  incident  light. 


Actual  Shape 


Pseudo  Shape 


Figure  4:  The  difference  between  the  actual  and  pseudo 
shapes  increases  with  surface  albedo. 
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In  the  case  of  coloied  surfaces,  the  albedo  of  a  facet 
would  also  depend  on  the  spectral  distribution  of  the  inci¬ 
dent  light.  In  fact,  with  each  bounce  the  spectral  distribu¬ 
tion  of  the  reflected  light  would  be  altered  depending  on 
the  spectral  characteristics  of  the  reflecting  surfaces.  Con¬ 
sider,  for  example,  a  concave  surface  that  has  two  regions 
of  different  colors,  say,  red  and  blue.  When  the  surface  is 
illuminated,  the  red  and  blue  regions  interreflect  incident 
light.  If  the  surface  is  illuminated  using  white  light,  the 
blue  region  would  reflect  only  blue  light  onto  the  red  re¬ 
gion.  Therefore,  a  red  surface  point  receives  light  rays  of 
different  spectral  content;  it  receives  white  light  from  the 
source,  blue  light  from  the  blue  region,  and  red  light  from 
other  red  surface  points.  Further,  the  spectral  distribution 
of  light  reflected  by  the  red  surface  point  depends  not  only 
on  the  spectral  distribution  of  light  received  by  it  but  also 
its  own  spectral  characteristics  (reflectance  properties).  As 
a  result  of  the  above  effect,  the  color  of  a  given  surface  re¬ 
gion  may  be  altered  due  to  reflections  from  neighboring 
regions  that  have  different  colors.  This  phenomenon  is  of¬ 
ten  referred  to  as  color  bleeding.  The  reader  is  directed 
to  [Bajcsy  89],  [Brill  89],  [Novak  90],  and  [Drew  90]  for  a 
more  detailed  discussion  on  this  effect. 

In  the  case  of  colored  surfaces,  therefore,  our  assump¬ 
tion  that  each  surface  point  has  a  constant  adbedo  value  is 
no  longer  valid;  The  albedo  of  a  point  would  depend  on 
its  color  and  the  spectral  distribution  of  the  incident  light, 
where  incident  light  includes  light  received  from  other  sur¬ 
face  points.  However,  from  condition  (b)  we  know  that, 
for  a  given  wavelength,  a  surface  point  (irrespective  of 
its  color)  can  be  assumed  to  have  a  unique  albedo  value. 
Therefore,  pseudo  shape  and  reflectance  estimates  for  a 
multi-colored  surface  can  be  computed  by  using  a  narrow- 
band  filter  at  the  sensor  end  so  that  only  light  waves  of 
a  particular  wavelength  are  detected  by  the  sensor.  Note 
that  the  pseudo  shape  and  reflectance  will  in  gener^J  be 
different  for  different  wavelengths. 

Typically,  color  images  are  obtained  using  three  narrow- 
band  filters  at  the  sensor  end.  These  three  filters  have 
narrow-band  spectral  responses  in  the  red,  green,  and 
blue  regions  of  the  visible  light  spectrum.  We  will  re¬ 
fer  to  the  three  images  produced  as  the  red,  green,  and 
blue  bands  of  the  color  image.  Since  the  three  filters  are 
narrow-band  filters,  the  image  produced  by  each  filter  rep¬ 
resents  reflections  and  interreflections  of  almost  a  single 
wavelength  of  light.  Hence,  a  pseudo  shape  and  pseudo 
albedo  function  for  the  surface  can  be  computed  using  each 
band  of  the  color  image.  The  pseudo  shapes  and  albedo 
functions  for  the  three  band  are  expected  to  be  different 
since  each  point  may  have  a  different  actual  albedo  for 
the  three  different  wavelengths  of  light.  Hence,  the  pseudo 
facet  matrices  for  the  three  band  may  be  written  as: 

=  (I-P'^K)-‘F'^  (14) 

Fp°  =  (I-P°K)-‘F° 

Fp®  =  (I-P®K)-’F® 

where  P®,  P°,  and  P®  are  the  albedo  matrices  of  the 


surface  for  three  wavelengths  determined  by  the  three  fil¬ 
ters.  Note  that  the  interreflection  kernel  K  is  a  purely 
geometrical  quantity  and  hence  remains  the  same  for  all 
wavelengths.  F®',  F'^,  and  F®  represent  the  actual  shape 
and  color  of  the  surface. 

The  three  pseudo  shapes  and  the  pseudo  color  of  the  sur¬ 
face  may  be  estimated  using  photometric  stereo.  Let  the 
three  sources  used  by  photometric  stereo  have  directions 
S/,  82,  and  82-  We  assume  that  all  three  sources  emit 
white  light  and  hence  have  the  same  radiant  intensity  for 
the  three  wavelengths.  Then,  the  pseudo  facet  matrices 
for  the  different  color  bands  are  determined  as; 

Fp®  =  [Lr®,L2®,L2®].[8r,S2,S2]-*  (15) 

Fp*^  =  [L;®  ,  L2'^  ,  Lj®  ]  .  [s/  ,  82  ,  S3  ]  * 

Fp®  =  [Li®,L2®,L3®].[8/,82,S3r‘ 

The  above  equations  indicate  that  the  same  surface  pro¬ 
duces  three  different  pseudo  shapes  and  albedo  functions. 
By  noting  that  the  three  bands  of  a  color  im2Lge  correspond 
to  different  wavelengths  of  light,  we  have  been  able  to  de¬ 
compose  the  interreflection  process  into  three  independent 
processes.  Having  done  this  we  are  in  a  position  to  analyze 
the  interreflections  in  each  band  independent  of  the  other 
bands. 

5  Recovering  Actual  Shape  and  Color 

In  [Nayar  91]  an  algorithm  is  developed  that  uses  equation. 
10  to  iteratively  recover  the  actual  shape  and  reflectance 
of  a  gray  surface  from  its  pseudo  shape  and  reflectance. 
Here,  we  briefly  describe  the  algorithm  for  gray  surfaces 
and  then  extend  it  to  colored  surfaces.  At  first,  photomet¬ 
ric  stereo  is  applied  to  the  scene.  If  the  scene  consists  of  a 
single  convex  surface,  the  extracted  pseudo  shape  and  re¬ 
flectance  are  simply  the  actual  ones.  However,  if  the  scene 
consists  of  concavities,  the  pseudo  shape  and  reflectance 
differ  from  the  actual  ones.  As  we  showed  in  Section  3, 
the  pseudo  shape  is  a  shallower  (less  concave)  version  of 
the  actual  shape.  Hence,  the  algorithm  uses  the  pseudo 
shape  and  reflectance  as  conservative  initial  estimates  of 
the  actual  shape  and  reflectance,  to  compute  initial  esti¬ 
mates  for  the  albedo  matrix  P  and  the  kernel  matrix  K. 
The  computed  P,  K,  and  the  pseudo  facets  Fp  are  then 
inserted  in  equation  10  to  obtain  the  next  estimate  of  the 
actual  facets.  This  estimate  of  the  surface  is  expected  to  be 
more  concave  than  the  previous  one  and  is  used  in  the  next 
iteration  to  obtain  an  even  better  estimate.  The  algorithm 
may  hence  be  written  as; 

F*+^  =  (I  -  P*K*)Fp  (16) 

where  F®  =  Fp 

In  the  above  equation,  P*^  =  P(F*)andK*  =  K(F*). 
Note  that  each  estimate  of  F  provides  estimates  of  both 
shape  and  reflectance.  With  each  iteration,  more  accurate 
estimates  of  shape  and  reflectance  are  obtained  and  the 
result  finally  converges  to  the  actual  shape  and  reflectance. 


In  implementing  the  algorithm,  the  surface  is  assumed 
to  be  continuous.  The  interieflection  kernel  depends  not 
only  on  the  orientations  of  individual  facets  but  also  on 
their  relative  positions.  Therefore,  a  depth  map  of  the 
scene  must  be  reconstructed  (by  integration)  from  the  ori¬ 
entation  map  computed  in  each  iteration  of  the  algorithm. 
The  continuity  assumption  is  necessary  to  ensure  integra- 
bility  of  the  orientation  maps. 

Several  experiments  were  conducted  on  real  surfaces  and 
the  results  indicate  that  the  algorithm  is  robust  and  accu¬ 
rate  [Nayar  90].  The  convergence  properties  of  the  algo¬ 
rithm  are  discussed  in  detail  in  [Nayar  90]. 

We  now  extend  the  algorithm  to  recover  true  shapes  and 
colors  of  colored  surfaces.  First,  equation  15  is  used  to  es¬ 
timate  the  three  pseudo  shapes  and  the  pseudo  color  of  the 
surface  using  photometric  stereo.  Then,  the  actual  shape 
and  color  of  the  surface  are  iteratively  recovered  from  the 
pseudo  estimates.  Here,  the  recovery  algorithm  given  by 
equation  16  is  applied  to  each  of  the  three  pseudo  facet 
matrices  Fp®',  Fp°,  and  Fp®.  The  recovery  algorithm  for 
colored  surfaces  may  be  written  as; 


pR*+^ 

=  (I  -  P®*K*)Fp® 

(17) 

pG*+^ 

=  (I  -  P«^*K‘)Fp° 

(18) 

pB*+^ 

=  (1  -  P®‘K*)Fp® 

(19) 

where: 

F®'’ 

=  Fp® 

(20) 

F®"  =  Fp® 

Note  that  the  recovery  algorithm  is  applied  independently 
to  the  three  pseudo  facet  matrices  Fp®,  Fp°,  and  Fp®to 
recover  the  facet  matrices  F®,  F°,  and  F®,  respectively. 
Therefore,  we  obt2dn  three  estimates  of  the  actual  shape  of 
the  surface,  one  obtained  from  each  one  of  the  three  pseudo 
facet  matrices.  The  three  shape  estimates  are  identical  and 
correspond  to  the  true  shape  of  the  colored  surface.  There¬ 
fore,  the  shape  of  the  surface  may  be  recovered  from  any 
one  of  the  three  pseudo  facet  matrices.  However,  in  order 
to  recover  the  actual  color  at  each  point  of  the  surface,  we 
need  to  recover  all  three  facet  matrices. 

The  above  algorithm  is  applicable  to  multi-colored  sur¬ 
faces,  i.e.  surfaces  that  are  comprised  of  regions  of  different 
colors.  Once  again,  the  surface  is  assumed  to  be  contin¬ 
uous  to  ensure  integrability  of  orientation  maps  produced 
in  each  iteration.  We  also  assume  that  each  surface  point 
has  a  non-zero  albedo  value  for  all  three  wavelengths  of 
the  color  image.  This  assumption  is  not  a  severe  one  as 
most  surfaces  encountered  in  practice  have  non-zero  albedo 
values  for  all  wavelengths  in  the  visible  spectrum. 


6  Results 

The  recovery  algorithm  has  been  tested  on  a  variety  of 
gray  surfaces  and  the  results  indicate  that  the  algorithm 
is  robust  and  accurate  in  recovering  the  actual  shape  and 
reflectance  of  gray  surfaces  [Nayar  90].  Here,  we  present 
the  simulation  results  of  applying  the  algorithm  to  colored 
surfaces.  The  surfaces  are  three-dimensional  with  trans¬ 
lational  symmetry  in  one  direction.  The  interreflection 
kernel  for  the  transalational  symmetry  case  was  derived 
by  Forsyth  and  Zissermann  [Forsyth  89]  and  is  given  in 
Appendix  A.l. 

The  example  shown  in  Figure  5  is  a  bucket  shaped  sur¬ 
face  that  has  three  faces  with  different  colors,  namely,  red, 
white,  and  blue.  The  red  face  of  the  surface  reflects  more 
red  light  than  any  other  wavelength  in  the  visible  spec¬ 
trum.  Note  that  the  red  region  does  reflect  green  and  blue 
light  but  in  less  amounts  relative  to  red  light.  We  assume 
that  the  color  images  are  obtained  using  three  narrow-band 
filters  that  pass  red  light,  green  light,  and  blue  light.  Each 
color  image  therefore  has  three  bands,  namely,  the  red 
band,  green  band,  and  the  blue  band.  The  actual  color  of 
the  surface  is  shown  by  plotting  the  three  albedo  functions 
(along  the  surface  crossection)  for  the  three  wavelengths 
(red,  green,  and  blue).  Within  each  of  the  three  regions 
(A,  B,  and  C)  on  the  surface,  the  albedo  functions  are  con¬ 
stant  since  all  surface  points  within  a  region  have  the  same 
color. 

Photometric  stereo  is  used  to  compute  a  pseudo  shape 
for  each  band  of  the  color  image.  The  three  pseudo  shapes 
are  shown  in  Figure  5.  Note  that  the  pseudo  shapes  pro¬ 
duced  by  the  three  bands  are  different.  The  pseudo  shape 
and  albedo  produced  by  the  green  band  is  symmetric  since 
the  actual  shape  and  the  actual  albdeo  function  for  green 
light  are  symmetric.  On  the  other  hand,  the  pseudo  shapes 
and  albedo  functions  for  the  red  and  blue  bands  are  asym¬ 
metric  since  the  actual  albedo  function  of  the  surface  for 
red  and  blue  light  are  asymmetric.  The  pseudo  albedo  val¬ 
ues  are  often  greater  that  the  actual  albedo  values,  some¬ 
times  exceeding  unity.  This  results  from  the  fact  that  inter¬ 
reflections  always  increase  the  radiance  of  a  surface  point. 

At  the  bottom  of  Figure  5  we  show  the  results  of  ap¬ 
plying  the  recovery  algorithm  given  by  equation  17.  For 
all  three  bands  the  algorithm  successfully  recovers  the  ac¬ 
tual  shape  and  albedo  of  the  surface.  Note  that  the  al¬ 
gorithm  must  be  applied  to  all  three  bands  to  recover  the 
true  color  of  the  surface.  To  recover  the  shape,  however, 
the  algorithm  need  be  applied  only  to  any  one  of  the  three 
band. 

Figures  6  and  7  show  results  obtained  for  other  multi¬ 
colored  surfaces.  In  all  the  results  included  here  and  nu¬ 
merous  other  unreported  results,  the  algorithm  conver¬ 
gence  (with  near  zero  error)  in  about  8  iterations  of  the 
algorithm. 
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Figure  5:  Results:  Multi-colored  surface  with  a  bucket  shaped  cross-section. 


7  Conclusion 

We  have  presented  an  algorithm  for  recovering  the  shape 
and  reflectance  of  colored  surfaces  in  the  presence  of  dif¬ 
fuse  interreflections.  The  surfaces  may  be  of  arbitrary  but 
continuous  shape,  and  with  possibly  varying  and  unknown 
color. 

We  showed  that  for  a  Lambertian  surface  the  shape  and 
color  extracted  by  shape-irom-intensity  methods  are  incor¬ 
rect  in  the  presence  of  interreflections.  Interreflections  can 
cause  the  c  r  of  one  region  on  a  surface  to  bleed  onto  a 
neighboring  region.  In  general  these  effects  are  difficult  to 
analyze  in  a  multi-dimensional  color  space.  Noting  that 
the  three  different  bands  of  a  color  image  correspond  to 
different  wavelengths  of  light,  we  decomposed  the  inter¬ 
reflection  process  into  three  independent  processes.  We 
then  showed  that  the  pseudo  shape  and  color  extracted 
using  photometric  stereo,  though  incorrect,  c^Ln  be  related 
to  the  actual  shape  and  color  of  the  surface.  The  pseudo 
estimates  are  shown  to  have  interesting  invairiance  proper¬ 
ties;  for  any  given  Lambertian  surface  there  exists  a  single 
pseudo  shape  and  pseudo  color  that  are  invariant  to  the 
source  directions  used  to  recover  them. 

Using  the  properties  of  the  pseudo  surface,  we  developed 
an  algorithm  that  recovers  the  actual  shape  and  color  of 
the  surface  from  the  pseudo  estimates.  The  algorithm  was 
applied  to  a  variety  of  multi-colored  test  surfaces  and  was 
shown  to  be  robust  and  accurate.  Motivated  by  the  re¬ 
sults  presented  in  this  paper,  we  are  currently  conducting 
experiments  on  real  multi-colored  surfaces. 

A  Appendix 

A.l  Kernel  for  Translational  Symmetry  Case 


Figure  8;  Cross-sectional  view  of  two  planar  facets  that 
are  infinite  in  the  x  direction. 

Forsyth  and  Zisserman  [Forsyth  89]  have  derived  the  in- 
terrefiection  kernel  for  the  special  case  of  two  planar  facets 
that  have  translational  symmetry  in  a  single  direction.  Fig¬ 
ure  8  shows  a  cross-sectional  view  of  two  such  facets  that 
are  infinite  in  the  x  direction.  The  kernel  Kij  is  derived 


[Forsyth  89]  by  integrating  along  the  x  and  y  directions, 
the  contribution  of  all  points  on  facet  j  to  the  radiance  of 
a  point  on  the  facet  i: 


•1  a 


c  -f  u*  cosa 


[  (  c®  2  c  «*  cosa  -b  «*^  ) 


m 


(21) 


where  a  is  the  angle  between  the  surface  normal  vectors 
of  the  two  facets,  and  the  parameter  u*  represents  the 
cross-sectional  length  of  the  facet  j.  Since  both  facets  are 
infinite  in  length,  the  same  kernel  is  valid  for  all  points  on 
the  facet  i.  Therefore,  under  the  translational  symmetry 
assumption,  the  kernel  need  only  be  evaluated  for  points 
along  the  cross-section  of  the  surface.  Note  that  the  above 
kernel  is  valid  only  for  surfaces  that  are  infinite  in  the 
direction  of  symmetry.  However,  the  kernel  serves  as  a 
good  approximation  [Forsyth  89]  for  points  that  lie  around 
the  middle  of  a  surface  that  is  long  though  finite  in  the 
direction  of  symmetry. 
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Abstract 

Robust  high  breakdown  point  techniques  are 
gaining  increasing  popularity  in  computer  vi¬ 
sion  since  they  can  tolerate  up  to  half  the  data 
being  severely  corrupted.  The  least  median  of 
squares  (LMedS)  estimator  is  the  best  known 
example  of  such  techniques.  We  show  that  the 
attractive  properties  of  the  LMedS  estimator 
do  not  hold  when  all  the  data  is  corrupted  by 
zero  mean  noise.  We  propose  a  new  “consen¬ 
sus  by  decomposition”  (CBD)  algorithm  which 
preserves  the  properties  of  LMedS  up  to  low 
signal- to- noise  ratios  while  achieving  a  signifi¬ 
cant  speed-up  relative  to  LMedS.  The  CBD  es¬ 
timator  uses  a  different  paradigm  than  LMedS. 

The  data  is  decomposed  in  both  the  spatial  and 
parameter  domains.  A  separate  distribution  is 
built  for  every  parameter  and  the  distribution 
is  analyzed  with  a  new,  enhanced  mode  detec¬ 
tion  procedure.  The  superiority  of  the  CBD 
estimator  is  proved  by  extensive  simulations. 

1  Introduction 

The  riddle  of  the  “chicken  and  egg”  is  a  well  known 
teaser.  Which  one  was  first?  Since  each  one  has  a  causal¬ 
ity  relation  with  the  other  neither  of  them  can  be  chosen 
as  primal.  In  computer  vision  similar  situations  are  of¬ 
ten  met.  For  example,  consider  the  problem  of  segmen¬ 
tation  To  segment  an  image  we  must  delineate  regions 
homogeneous  under  a  given  measure.  The  boundaries 
of  the  regions  are  the  discontinuities  in  the  homogeneity 
measure.  Most  of  the  methods  of  detecting  homogeneity, 
however,  cannot  handle  data  containing  discontinuities. 
To  avoid  erroneous  results  we  would  like  to  analyze  the 
data  only  far  away  from  discontinuities.  But  the  loca¬ 
tions  of  the  discontinuities  are  not  available  since  this  is 
one  of  the  reasons  why  we  want  to  segment  the  image! 
See  [6]  for  a  discussion  of  the  segmentation  problem. 

We  conclude  that  the  “chicken  and  egg”  problem  is 
caused  by  the  inability  of  our  methods  to  handle  piece- 
wise  data,  i.e.,  data  containing  discontinuities.  The 
weighted  least  squares  (WLS)  estimator  is  often  a  basic 

‘The  support  of  the  Air  Force  Office  of  Scientific  Research 
under  Grant  AFOSR-86-0092  is  gratefully  acknowledged. 


computational  module  of  such  methods.  The  limitations 
of  the  WLS  estimator  and  its  properties  that  are  relevant 
to  computer  vision  are  discussed  in  Section  2. 

In  the  Icist  two  decades,  to  avoid  the  pitfalls  of  least 
squares,  robust  estimation  techniques  were  developed  by 
statisticians.  Recently  these  robust  techniques  have  be¬ 
come  popular  in  computer  vision.  See  [5]  for  a  survey 
of  robust  techniques  as  applied  in  computer  visjon  The 
class  of  high  breakdown  point  robust  operators  can  elim¬ 
inate  the  above  mentioned  “chicken  and  egg”  situation. 
These  operators  can  tolerate  a  discontinuity  and  recover 
the  model  corresponding  to  the  absolute  majority  of  the 
data.  The  least  median  of  squares  (LMedS)  estimator, 
proposed  by  Rousseeuw  in  1984,  is  the  most  frequently 
used  in  computer  vision.  However,  the  LMedS  estimator 
was  developed  for  statistical  applications  and  its  utility 
in  solving  computer  vision  problems  cannot  be  taken  for 
granted.  In  Section  3  we  describe  LMedS  and  show  its 
limitations  for  noisy  data. 

In  Section  4  we  propose  a  new  high  breakdown  robust 
algorithm  which  we  call  “consensus  by  decomposition” 
(CBD).  The  CBD  estimator  retains  the  desirable  prop¬ 
erties  of  both  the  least  squares  and  the  least  median  of 
squares  estimators  while  eliminating  most  of  their  draw¬ 
backs.  The  algorithm  was  developed  for  computer  vision 
applications  and  its  characteristics  are  analyzed  through 
simulation  studies.  Some  of  the  questions  raised  by  the 
CBD  estir.iator,  its  relation  to  the  other  robust  tech¬ 
niques  used  in  computer  vision,  and  the  consensus  vision 
paradigm  are  discussed  in  Section  6. 

2  The  Weighted  Least  Squares 
Estimator 

The  weighted  leeist  squares  (WLS)  estimator  is  a  fre¬ 
quently  used  tool  in  computer  vision.  We  only  mention 
here  the  properties  used  later  in  the  paper.  For  proofs 
and/or  more  detailed  discussions  see  a  textbook  on  lin¬ 
ear  estimation.  Let  zi^i . .  .Zn,m  be  the  data  and  atssume 
that  it  can  be  modeled  by  a  linear  model  characterized 
by  p  parameters  ,  Jb  =  0 ...  (p  -  1).  Thus 

p-i 

+  (1) 

Jb=0 

where  the  p  regressor  variables  n(i,j)  are  specified  for 
every  sampling  site  i  =  0...n;  j  =  0...m;  and 
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Vij  represents  the  deviation  from  the  model,  i.e.,  the 
noise.  The  problem  of  non-linear  models  and  of  uncer¬ 
tainty  about  the  regressor  variables  is  beyond  the  scope 
of  this  paper. 

2.1  Properties  of  WLS 

In  matrix  notation  (1)  becomes 

z  =  X)9-t-v  (2) 

where  z  is  a  nm  x  1  vector  of  data,  X  is  a  nm  x  p  matrix 
of  regressor  variables,  is  the  unknown  p  x  1  vector  of 
regression  coefficients,  and  v  is  a  nm  x  1  noise  vector.  To 
find  the  weighted  least  squares  estimate  of  the  parameter 
vector,  we  must  minimize  the  quadratic  form 

(z-X)8)‘W(z-X^)  (3) 

where  W  is  a  symmetric,  positive  definite  nm  x  nm 
weight  matrix,  and  the  superscript  t  means  transpose. 

The  minimization  of  (3)  yields  a  reliable  estimate  only 
if  the  noise  hw  zero  mean,  i.e.,  E[v]  =  0.  (E[  ]  is  the 
expectation  operator.)  The  noise,  however,  can  be  cor¬ 
related,  having  the  covariance  matrix  cov  [v]  =  V.  The 
weighted  least  squares  estimate  is  obtained  by  comput¬ 
ing  the  expression 

i3  =  (X‘WX)-‘X‘Wz.  (4) 

If  the  noise  has  zero  mean  the  estimate  (4)  is  unbiased: 

E  [^]  =  0.  (5) 

This  very  important  property  is  satisfied  for  any  weight 
matrix  W.  If  we  choose  W  =  V“',  P  is  also  the 
minimum-variance  estimate  among  all  unbiased  esti¬ 
mates  linearly  related  to  the  measurements.  In  this  case 

cov  [3]=(X‘V-‘X)-i  (g) 

If  in  addition  v  has  a  multivariate  normal  distribution, 

/9  is  also  the  maximum  likelihood  estimate. 

2.2  WLS  in  Computer  Vision 

Minimum-variance  estimation,  however,  may  not  suffice 
in  computer  vision  applications  in  which  a  local  opera¬ 
tor  deals  with  small  sample  sizes.  For  example,  let  us 
estimate  the  coefficients  of  a  planar  fit  (p  =  3)  in  the 
presence  of  i.i.d.  noise  (cov[v]  =  for  data  defined 

in  an  odd-sized  square  window  on  a  dense  sampling  grid. 
The  model  is  then  of  the  form 

2i,i  =  0o  + 0ii-¥  073  i,j=-n...n.  (7) 

Since  we  have  assumed  i.i.d.  noise  we  must  perform 
unweighted  least  squares  for  minimum-variance  estima¬ 
tion,  and  it  is  immediate  to  obtain  from  (6) 

Cov[/3]=<tH  0  0 

\  ®  ®  n(n+0?2n+l)5  / 

(8) 


The  covariance  matrix  is  diagonal  and  thus  the  esti¬ 
mates  of  the  three  parameters,  the  intercept  and  the  two 
slopes,  are  uncorrelated.  This  is  due  to  the  fact  that  the 
column  vectors  of  X  are  orthogonal.  It  is  important  to 
notice  that  for  a  higher  degree  polynomial  surface  model, 
or  for  data  defined  on  a  sparse  grid,  the  column  vectors 
are  not  necessarily  orthogonal  and  some  of  the  estimate 
vector  components  may  become  correlated. 

The  estimate  is  unbiased  and  we  can  use  as  a  simple 
quality  measure  the  ratio  between  the  correct  parame¬ 
ter  value  0ii  and  the  standard  deviation  of  its  estimate 
taken  from  (8).  We  will  consider  an  estimate  reliable  if 
the  ratio  is  larger  than  5.  (The  more  accurate  measure 
of  a  confidence  interval  involves  additional  assumptions 
about  the  noise.)  Thus  we  can  obtain  the  normalized 
upper  bound  for  the  noise  standard  deviation  yielding 
reliable  slope  estimates  {k  —  1  or  2).  In  Table  1  the 
values  of  this  bound  are  shown  for  several  window  sizes. 

Table  1:  Dependence  of  the  normalized  noise  upper 
bound  on  window  size. 


n 

Window  Size 

Upper  Bound 

1 

Txl 

0:45 

5x5 

1.41 

3 

7x7 

015 

4 

9ir§ 

4.65 

In  images  the  slope  values  have  the  order  of  magnitude 
of  units.  For  example  if  the  plane  is  slanted  at  45  degrees 
{01  —  1)  a  5  X  5  window  is  able  to  handle  only  or  <  1.41, 
not  far  from  the  range  of  quantization  noise.  Similar 
bounds  can  be  obtained  for  the  coefficients  of  higher  de¬ 
gree  polynomial  surfaces.  We  conclude  that  the  small 
windows  usually  employed  in  computer  vision  for  local 
estimation  of  polynomial  surfaces  cannot  return  reliable 
parameter  estimates  even  in  the  presence  of  negligible 
noise.  Only  the  intercept  estimate  is  reliable,  a  fact  used 
in  image  smoothing  operators  applied  at  every  pixel. 

The  implicit  assumption  behind  the  model  (1)  is  that 
the  entire  data  set  can  be  characterized  by  only  one  pa¬ 
rameter  vector  0.  For  piecewise  data,  a  cause  frequently 
met  in  computer  vision  but  less  common  in  statistics, 
this  is  not  true.  In  piecewise  data  a  model  discontinuity 
is  also  present  and  (at  leaist)  two  parameter  vectors  are 
required  to  describe  the  entire  data  set.  For  example, 
changes  in  reflectance  or  depth  yield  piecewise  data. 

It  is  well  known  that  least  squares  estimation  cannot 
handle  piecewise  data.  This  is  caused  by  the  violation  of 
the  zero  mean  noise  assumption.  Indeed,  if  we  assume 
the  validity  of  (1)  we  must  take  one  of  the  models  as 
fitting  all  the  data.  Then  data  not  represented  by  this 
model  has  to  be  regarded  as  corrupted  by  an  additional 
non-zero  mean  noise  process.  It  is  shown  in  the  Section  3 
that  model  estimation  for  piecewise  data  can  be  achieved 
only  by  high  breakdown  point  robust  techniques  which 
tolerate  non-zero  mean  noise. 

Least  squares  polynomial  surface  estimation  of  piece- 
wise  constant  data  (i.e.,  containing  a  step  edge)  has  an 
important  property  of  which  we  make  use  later.  Let 
the  step  edge  in  the  data  be  of  height  h  and  corrupted 
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with  noise  having  standard  deviation  <7.  We  estimate 
the  surface  in  a  window  centered  on  the  discontinuity. 
The  simplest,  degree-0  model  is  that  of  a  constant  plane 
(p  =  1).  The  least  squares  estimated  value  yields  a 
plane  lying  between  the  two  original  horizontal  surfaces 
that  form  the  step  edge.  When  a  planar,  degree-1  model 
(p  =  3)  is  used,  the  estimated  plane  lies  across  the  edge 
and  intersects  both  surfaces. 


freq 


freq 


Figure  1:  Comparison  between  the  normalized  standard 
deviation  estimates  computed  with  least  squares  (LS) 
estimators.  Left  histogram:  constant  surface.  Right 
histogram:  step  edge,  a)  Degree-0  LS,  SNR  =  25.  b) 
Degree-0  LS,  SNR  =  8. 

Both  fits  are  erroneous  and  do  not  correspond  to  either 
of  the  original  surfaces.  However,  up  to  moderate  signal- 
to-noise  (SNR)  ratios,  when  the  standard  deviation  of 
the  residuals  (the  noise's  standard  deviation  estimate)  is 
computed  the  degree-0  model  yields  an  increase  relative 
to  <T  while  the  degree- 1  model  may  not.  Thus,  when  a 
window  operator  performing  degree-0  least  squares  esti¬ 
mation  (averaging)  slides  along  piecewise  constant  noisy 
data,  the  standard  deviation  estimate  increases  in  the 
neighborhoods  of  step  edges.  The  increase  is  significant 
enough  to  be  used  for  coarse  detection  of  the  presence  of 
a  model  discontinuity  (edge). 


To  illustrate  the  effect,  the  standard  deviation  of  the 
residuals  was  computed  in  a  3  x  3  window  for  con¬ 
stant  and  planar  models.  The  data  was  either  a  two- 
dimensional  step  edge  with  amplitude  h  or  a  horizon¬ 
tal  surface,  both  corrupted  with  i.i.d.  Gaussian  noise 
(0,<r^).  The  signal-to-noise  ratio  is  defined  as  SNR  = 
(/»/«t)^.  To  facilitate  comparison  the  estimated  standard 
deviation  is  normalized  by  a. 


freq 


freq 


Figure  2:  Comparison  between  the  normalized  standard 
deviation  estimates  computed  with  least  squares  (LS) 
estimators.  Left  histogram:  constant  surface.  Right 
histogram:  step  edge,  a)  Degree-1  LS,  SNR  =  25.  b) 
Degree-1  LS,  SNR  =  8. 

In  Figure  1  the  results  are  presented  as  histograms  of 
the  normalized  standard  deviation  estimates  computed 
from  5000  trials  for  SNRs  of  25  and  8.  Each  figure  con¬ 
tains  two  overlaid  histograms.  The  peak  at  the  left  al¬ 
ways  corresponds  to  the  window  containing  the  horizon¬ 
tal  surface,  and  the  peak  at  the  right  to  the  window 
containing  the  step  edge.  When  the  signal-to-noise  ra¬ 
tio  is  high  (SNR=  25)  and  the  degree-0  model  (p  =  0) 
is  used,  the  two  histograms  are  clearly  separated  (Fig¬ 
ure  la).  Thus  a  reliable  indication  of  the  presence  of  a 
discontinuity  is  obtained.  This  result  is  widely  applied 
for  adaptive  filtering  in  the  signal  processing  literature. 
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For  an  application  in  computer  vision  see  [10]. 

If  a  degree-1  model  (p  =  3)  is  used  a  slight  over¬ 
lap  of  the  two  histograms  appears  (Figure  2a)  which 
reduces  the  probability  of  discontinuity  detection.  A 
similar  overlap  appears  with  the  degree-0  model  only  at 
SNR=  8  (Figure  lb).  At  such  signal- to-noise  ratios  the 
degree-1  model  already  yields  almost  completely  over¬ 
lapping  histograms  (Figure  2b).  Note  that  for  constant 
surfaces  the  degree-0  least  squares  standard  deviation  es¬ 
timates  are  unbiased,  i.e.,  the  peaks  of  their  histograms 
are  always  close  to  one.  (The  degree-1  model  yields  un¬ 
derestimates  since  it  absorbs  some  of  the  noise  variations 
into  the  model.)  The  use  of  larger  (say  5x5)  windows 
decreases  the  spread  of  the  degree-0  estimates  but  also 
shifts  the  histogram  for  the  edge  downward. 

To  conclude  this  section  we  emphasize  two  observa¬ 
tions: 

•  When  all  the  data  belongs  to  one  model  corrupted 
with  zero  mean  noise  the  ]e^lst  squares  estimates  are 
unbiased  but  may  have  a  large  standard  deviation. 

•  At  moderate  signal-to-noise  ratios  the  degree-0  least 
squares  estimate  of  the  noise’s  standard  deviation 
can  be  used  as  warning  about  the  presence  of  a  step 
edge  in  the  data. 

These  two  properties  will  be  of  importance  for  the  robust 
CBD  estimator  proposed  in  Section  4. 


3  The  Least  Median  of  Squares 
Estimator 

In  this  section  we  discuss  a  robust  estimator  which  can 
handle  weakly  corrupted  piecewise  data.  For  conve¬ 
nience  we  describe  the  one-dimensional  C2ise;  extension 
to  multi-dimensions  is  immediate. 

3.1  Mean,  Median  and  Mode 

The  ability  of  an  estimator  to  handle  severely  corrupted 
data  is  captured  by  its  breakdown  point  t*.  This  is  the 
smallest  fraction  of  the  data  which  can  yield  arbitrary 
estimate  values.  For  example,  the  degree-0  least  squares 
estimate  (the  mean  of  n  data  points,  Zi)  hsis  e*  =  1/n, 
since  one  large  valued  erroneous  point  already  compro¬ 
mises  the  result.  The  asymptotical  breakdown  point  of 
the  degree-0  LS  estimator  thus  is  0.  The  zero  breakdown 
point  property  is  common  for  all  least  squares  estimators 
[2,  p.  328). 


Assumptions:  The  model  is  defined  by  the  parameters 
/?fc,jk  =  0...(p  —  1).  The  data  contains  n  points  with 
a  fraction  «  <  0.5  of  outliers. 

1.  Randomly  select  a  p-tuple  11  from  the  data. 

2.  For  this  p-tuple  compute  the  parameter  values 
)3jfc(n)  by  solving  the  p  equations.  Note  that  if  a 
linear  model  is  Jissumed  the  p-tuple  must  yield  a 
full  rank  coefficient  matrix  for  the  system. 

3.  Retain  all  parameters  but  /^(H).  Project  the  data 
into  the  /?o  subspace  by  computing 


a 


p-i 

o.(n)  t  Zi  -  J2Mn)x,{i)  i  =  1 . . .  n.  (9) 

it=i 

4.  Find  the  mode  of  the  aj(n)  sequence  and  allocate 
it  to  /^(n).  Store  the  corresponding  shortest  half¬ 
window  size 


b 

Figure  4:  Piecewise  constant  data,  a)  Uncorrupted  data, 
b)  Noisy  data,  SNR  =  8. 


5(n)  = 


1/2 


min  med  r? 

Un)  ’■ 


(10) 


5.  Repeat  Steps  1  to  4  g  times.  The  final  LMedS  esti¬ 
mates  are  from  the  p-tuple  yielding  5  =  min^(n). 


Figure  5:  Ordered  sequence  of  the  pixel  values  on  the 
noisy  piecewise  data 


Figure  3:  The  LMedS  Algorithm 


Two  types  of  severely  corrupted  data  can  be  distin¬ 
guished.  Outliers  are  data  points  with  values  j  which 
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piecewise  data  can  always  be  described  by  one  model  and 
a  set  of  outliers.  For  example,  one  of  the  planes  forming 
a  roof  edge  is  taken  as  the  model  and  the  points  belong¬ 
ing  to  the  other  are  then  regarded  eis  outliers  relative  to 
that  model.  Once  zero  mean  noise  is  also  present,  how¬ 
ever,  the  dichotomy  between  the  model  and  the  outliers 
may  not  be  easy  to  establish.  (We  return  to  this  subject 
in  Section  3.3.) 

Leverage  points  are  data  points  corresponding  to  out¬ 
lying  regressor  variable  values  Xk{i,j).  These  points  are 
not  necessarily  “bad”  since  their  values  may  be  close  to 
the  value  predicted  by  the  model.  However,  if  a  data 
point  is  far  away  from  the  rest  and  cannot  be  accounted 
for  by  the  model,  it  may  exert  an  increased  influence  ou 
the  estimate.  The  sparse  data  in  some  computer  vision 
applications  (for  example,  stereo)  may  contain  leverage 
points  which  compromise  the  result  in  spite  of  the  fact 
that  their  value  is  within  the  range  of  the  data.  Detec¬ 
tion  of  leverage  points  requires  special  techniques.  See 
for  example  [14]  for  a  method  using  an  estimator  similar 
to  the  one  to  be  discussed  in  Section  3.2. 

The  breakdown  point  of  an  estimator  cannot  exceed 
0.5  without  making  use  of  a  priori  information  about  the 
data.  Indeed,  at  least  half  of  the  data  should  be  repre¬ 
sented  by  the  model  to  be  the  unique  solution  of  the 
estimation.  The  median,  the  degree-0  least  absolute 
deviation  estimator  (belonging  to  the  Li  family  of  esti¬ 
mators),  tolerates  close  to  half  the  data  being  severely 
corrupted.  At  the  limit,  the  median  has  c*  =  0.5.  It  can 
be  shown,  however,  that  higher  degree  surface  fitting  in 
Li  is  sensitive  to  leverage  points  and  it  has  e*  =  0  [2, 
p.  328]. 

Another  degree-0  estimator  is  the  mode  of  the 
data.  For  a  continuous  probability  distribution  function 
(p.d.f.)  the  mode  is  the  most  probable  variable  value, 
i.e.,  the  maximum  of  the  p.d.f.  (Without  loss  of  general¬ 
ity  we  can  consider  a  unimodal  p.d.f.  for  the  moment.) 
Assume  that  n  outcomes  were  obtained  from  the  p.d.f. 
and  they  are  ordered  by  increasing  values.  We  first  com¬ 
pute  the  half-length  of  a  window  containing  N  <  n/2 
points  in  all  the  n  —  N  +  I  positions  along  the  ordered 
sequence.  (The  half-length  is  defined  as  half  the  differ¬ 
ence  between  the  data  point  values  at  locations  i  and 
i  +  N  —  1  in  the  ordered  sequence.)  The  midpoint  of  the 
shortest  window  is  taken  as  the  mode  since  the  maximum 
of  the  p.d.f.  implies  the  most  outcomes  in  an  interval. 

Let  the  window  size  equal  to  half  the  data  size,  i.e., 
N  =  \n/2\,  where  [  J  is  the  floor  function.  Then  it  can 
be  shown  that  the  mode  ^  minimizes  the  median  of  the 
squared  residuals,  i.e.,  satisfies 

minmed  (z,  - /?o)^  (11) 

and  that  the  value  of  (11)  is  the  squared  half-length  of 
the  window  yielding  the  mode  [13,  p.  166].  The  use  of 
squares  instead  of  absolute  values  assures  the  uniqueness 
of  the  solution  for  n  an  even  number  [13,  p.  170].  The 
presence  of  the  median  in  (11)  makes  the  mode  a  robust 
degree-0  estimator  with  breakdown  point  close  to  0.5. 


3.2  Properties  of  LMedS 

The  minimization  problem  (11)  can  be  generalized  to  an 
arbitrary  model  ^  generating  the  residuals  r,- 

minmedr?.  (12) 

The  expression  (12)  defines  the  least  median  of 
squares  (LMedS)  estimator  introduced  in  statistics  by 
Rousseeuw  in  1984.  The  already  mentioned  book  of 
Rousseeuw  and  Leroy  [13]  gives  an  excellent  practical 
analysis  of  the  estimator.  The  finite  sample  breakdown 
point  of  the  LMedS  estimator  is 

n 

and  asymptotically  yields  0.5.  , 

Since  (12)  has  no  analytical  solution,  to  find  the 
LMedS  estimates  a  numerical  technique,  projection  pur¬ 
suit,  is  employed. 

The  search  procedure  (Step  5)  has  the  goal  of  finding 
at  least  one  p-tuple  containing  only  data  points  repre¬ 
sented  by  the  model,  i.e.,  inkers.  Projection  of  such 
a  p-tuple  into  the  00  space  (Step  3)  can  yield  at  most 
f  outliers.  The  mode  detection  procedure  (Step  4)  then 
assures  the  recovery  of  /^(H),  and  that  the  entire  pa¬ 
rameter  vector  satisfies  (12). 

The  random  sampling  (Step  1)  reduces  the  complex¬ 
ity  of  the  LMedS  algorithm  to  a  manageable  amount  of 
computation.  Mode  computation  per  p-tuple  requires 
0(n  log  n)  operations  and  there  are  0(nP)  possible  p- 
tuples.  Let  the  tolerated  probability  of  error  in  random 
sampling  be  Q  <C  1-  Then  the  probability  that  for  q  in¬ 
dependently  selected  p-tuples  at  leetst  one  contains  only 
inliers  is 

1  -  [1  -  (1  -  e)'’]’  <  1  -  Q.  (14) 

From  (14)  g  can  easily  be  computed.  For  example,  when 
a  degree-1  model  (p  =  3)  is  fit  to  piecewise  data  with 
e  =  0.45,  and  the  tolerated  probability  of  error  is  Q  = 
0.01,  choosing  q  =  26  different  3-tuples  suffices  to  find 
the  three  LMedS  estimates.  Note  that  q  does  not  depend 
on  the  data  size.  Since  the  LMedS  estimates  correspond 
to  one  of  the  p-tuples,  the  only  characteristic  of  interest 
about  the  structure  of  the  data  is  c. 

A  “hidden”  assumption  behind  the  LMedS  estimator 
is  that  the  inliers  are  only  weakly  corrupted  by  i.i.d. 
Gaussian  noise.  That  is,  the  parameter  values  obtained 
from  a  p-tuple  containing  only  inliers  are  close  to  the 
correct  ones.  The  LMedS  algorithm  also  returns  a  robust 
estimate  of  the  noise’s  standard  deviation; 

^  =  1.4826  (15) 

where  the  term  5/(n  —  p)  is  the  finite  sample  size  cor¬ 
rection  [13,  p.  202]  and  1.4826  is  the  correction  factor 
for  median  based  Gaussian  noise  standard  deviation  es¬ 
timates.  Once  the  noise  corrupting  the  model  is  defined 
(zero  mean  normal  with  variance  ff^)  the  data  can  be 
classified  into  inliers  and  outliers.  The  weights 
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tu,  =  l  if  lr,|  <  2.5d’  (16) 

mark  the  inliers  with  about  a  0.95  confidence  level  (the 
uncertainty  is  due  to  the  finite  data  size).  If  the  nature 
of  the  noise  is  known  a  priori  other  correction  factors  and 
thresholds  can  be  computed.  The  LMedS  algorithm  and 
the  inlier/outlier  dichotomy  procedure,  however,  become 
unreliable  when  significant  noise  corrupts  the  data.  This 
problem  is  the  subject  of  the  next  section. 

An  optimization  principle  similar  to  the  one  used  in 
the  LMedS  algorithm  was  already  proposed  for  computer 
vision  applications  in  [1]  as  the  Random  Sample  Con¬ 
sensus  (RANSAC)  paradigm.  In  RANSAC  the  model 
derived  from  a  p-tuple  determines  the  set  of  data  points 
agreeing  with  it  within  a  given  tolerance  limit.  If  the 
number  of  points  in  the  set  exceeds  a  threshold,  no  more 
p-tuples  are  drawn.  Thus  RANSAC  requires  the  a  priori 
information  of  a  tolerance  limit  and  a  cardinality  thresh¬ 
old.  LMedS  automatically  selects  the  model  representing 
at  least  half  the  data  by  minimizing  a  quality  measure  of 
the  fit,  the  median  of  the  squared  residuals.  For  a  more 
detailed  comparison  of  the  two  techniques  see  [5]. 


Figure  6:  LMedS:  A  vs.  tr. 


3.3  LMedS  in  Computer  Vision 

The  high  breakdown  point  of  the  LMedS  estimator 
makes  it  an  ideal  operator  for  handling  piecewise  data. 
The  estimator  identifies  the  model  corresponding  to  the 
majority  of  the  points  in  the  set  and  discriminates  the 
points  not  belonging  to  it,  i.e.,  the  outliers.  The  LMedS 
algorithm  (as  presented  above)  was  applied  to  several 
computer  vision  problems.  Kim  et  al.  [3],  Roth  and 
Levine  [12],  Sinha  and  Schunck  [15]  have  u^  it  in  sur¬ 
face  reconstruction;  Kiimar  and  Hanson  [4]  in  camera 
position  estimation;  Tirumalai  et  al.  [ITj  in  dynamic 
stereo. 

We  have  already  emphasized  the  necessary  condition 
for  obtaining  “good”  LMedS  estimates,  the  existence  of 
at  least  one  p-tuple  from  which  reliable  model  parame¬ 
ters  can  be  extracted.  When  significant  zero  mean  noise 
corrupts  all  the  data  points  such  a  p-tuple  may  not  exist. 
In  this  case  a  p-tuple  containing  only  noisy  inliers  yields, 
like  the  least  squares  estimator  (see  Section  2.2),  unbi¬ 
ased  estimates  with  large  variance.  The  noise  can  also 
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Figure  7:  Histograms  of  LMedS  estimators,  a)  =  17 
(SNR  =  8).  b)  or=25  (SNR  =  4). 


erase  the  clear  dichotomy  between  data  points  belonging 
to  the  model  and  outliers. 

Consider  again  piecewise  constant  data  containing  a 
step  edge  of  amplitude  h  (c  close  to  0.5)  corrupted  with 
zero  mean  noise  taking  values  in  the  range  (—a,  a).  This 
definition  is  for  convenience  only;  for  Gaussian  noise  a 
could  be,  say,  3(r.  Suppose  also  for  the  moment  that 
a  <  h/2. 

First  we  use  the  degree-0  LMedS  estimator,  i.e.,  com¬ 
pute  the  mode  of  the  data.  Assume  that  we  have  suc¬ 
ceeded  in  recovering  the  uncorrupted  value  of  the  hori¬ 
zontal  surface  corresponding  to  the  majority  of  the  pix¬ 
els.  Then  the  squared  residuals  are  distributed  as  fol¬ 
lows:  slightly  more  then  half  in  the  interval  (0,  a^)  and 
the  rest  in  the  interval  [{h  —  a)^,  {h  -f-  a)*] .  The  estimate 
of  &,  the  model  noise  standard  deviation,  is  proportional 
to  the  square  root  of  the  median  of  squared  residuals, 
(10)  and  (15),  and  thus  it  will  exceed  a.  But  a  is  the 
range  of  the  noise  and  therefore  d  is  a  strong  overesti¬ 
mate.  Large  d  reduces  the  weights  (16)  and  outliers  may 
be  classified  as  inliers. 

When  a  degree-1  LMedS  estimator  is  applied  to  the 
noisy  piecewise  constant  data  the  large  variance  of  the 
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Figure  8:  LMedS:  a  vs.  a. 


k 


Figure  9;  LMedS:  k  vs.  <r. 


3-tuple  based  estimates  makes  any  plane  orientation  pos¬ 
sible.  Assume  that  we  have  chosen  two  3-tuples,  one 
generating  the  correct  horizontal  fit,  the  other  yielding  a 
plane  tilt^  at  about  45  degrees  across  the  edge.  The  lat¬ 
ter  fit  is  similar  to  what  a  least  squares  estimator  would 
have  recovered  from  the  window.  If  significant  noise  is 
present  (a  is  close  to  h/2),  most  of  the  squared  residuals 
for  the  tilted  plane  fit  are  between  0  and  a^.  The  median 
of  this  residual  sequence  is  significantly  lower  than  the 
median  of  the  sequence  obtained  when  the  horizontal  fit 
is  employed  (see  above).  The  LMedS  criterion  (12)  seeks 
the  minimum  and  therefore  the  tilted  plane  is  preferred. 
Note  that  tilted  planes  are  always  obtained  from  some 
3-tuple8  since  any  sample  containing  pixels  from  both 
surfaces  can  yield  one.  The  estimator  will  choose  the 
correct,  horizontal  plane  only  when  the  amplitude  of  the 
corrupting  noise  is  very  small  relative  to  the  amplitude 
of  the  step,  that  is,  the  SNR  is  very  high.  The  correct 
fit,  however,  implies  an  overestimate  of  <r  as  was  shown 
above. 

For  noisy  data  with  a  discontinuity  the  mode  detec¬ 
tion  procedure  can  also  fail.  Such  data  yields  a  bimodal 
distribution.  Figure  4b  shows  noisy  piecewise  constant 
data  having  SNR  =  8  derived  from  the  ideal  data  in  Fig¬ 
ure  4a.  The  ordered  sequence  of  pixel  values  is  given  in 
Figure  5.  Since  the  resolution  of  the  drawing  does  not 


allow  distinct  representation,  for  every  data  point  larger 
circles  stand  for  more  points  with  similar  values. 

The  value  of  the  constant  surface  corresponding  to 
the  majority  of  the  pixels  in  the  uncorrupted  data  (Fig¬ 
ure  4a)  is  marked  ‘C’  in  the  sequence.  The  significant 
noise  level  spreads  the  values  across  the  entire  range  of 
the  data  and  no  clear  bimodality  can  be  seen.  The  mode 
detection  procedure  finds  that  the  shortest  window  cor¬ 
responds  to  a  value  between  the  two  surfaces  forming  the 
original  step  edge.  The  location  of  the  mode  detected  for 
the  noisy  data  is  marked  ‘M’.  The  estimate  obtained  is 
very  similar  to  what  a  degree-0  least  squares  estimator 
would  recover. 

To  illustrate  the  severity  of  the  artifacts  of  the  LMedS 
estimator  for  piecewise  constant  data  we  have  performed 
a  simulation  study.  A  step  edge  of  amplitude  h  =  50 
was  defined  on  a  dense  square  grid  in  a  9  x  9  region.  Four 
columns  had  the  value  128  and  five  had  178.  Gaussian 
noise  (i.i.d.)  with  standard  deviation  from  o-  =  0  to  <r  = 
35  (SNR  =  2)  was  added  to  the  data.  For  each  <t  value 
100  trials  were  run.  The  original  data  thus  has  c  =  0.44, 
and  a  high  breakdown  point  estimator  should  recover 
from  the  noisy  piecewise  constant  data  an  estimate  close 
to  178. 

First  the  behavior  of  the  degree-0  LMedS  estimator, 
i.e.,  mode  detection,  is  analyzed.  In  Figure  6  the  depen¬ 
dence  of  the  estimated  mode  value,  on  er,  the  noise’s 
standard  deviation,  is  shown.  The  average  value  of  100 
trials  as  a  function  of  o-  is  drawn  with  a  continuous  line. 
As  a  measure  for  the  “stability”  of  the  estimate  its  stan¬ 
dard  deviation  for  the  trials  was  also  computed.  The 
dashed  lines  in  Figure  6  show  one  standard  deviation 
distance  from  the  average  function  of  tr. 

The  estimated  mode  value  is  satisfactory  for  high 
signal-to-noise  ratios  but  deteriorates  for  (r  >  14  (SNR< 
13).  Higher  noise  more  often  yields  erroneous  mode  de¬ 
tection  and  the  average  decreases.  Two  histograms  of 
k>  for  n  given  <t  b2ised  on  500  trials  are  shown  in  Fig¬ 
ure  7a  (<r  =  17,  SNR=  8)  and  Figure  7b  (<r  =  25, 
SNR=  4).  The  estimates  have  a  wide  distribution  and 
the  complete  failure  of  the  degree-0  LMedS  estimator  at 
such  noise  levels  is  clearly  seen. 

The  effect  of  overestimation  of  a,  the  noise’s  standard 
deviation,  by  the  degree-0  LMedS  estimator  is  shown  in 
Figure  8  where  the  dependence  of  d  on  <t  is  given  using 
the  same  conventions  as  in  Figure  6.  The  linear  increase 
for  high  SNR  is  already  an  overestimation  since  it  has 
slope  1.5  instead  of  1.  At  significant  noise  levels  the 
failure  of  the  degree-0  LMedS  estimator  to  provide  the 
correct  model  yields  incorrect  &  estimates. 

We  have  shown  above  that  for  noisy  piecewise  constant 
data  the  degree-1  LMedS  estimator  (planar  fit)  will  favor 
a  tilted  plane  over  the  correct  horizontal  surface.  In 
Figure  9  the  dependence  of  the  slope  estimate  k  on  <t 
is  shown  in  the  same  experimental  conditions  as  before. 
In  the  presence  of  a  very  low  noise  level  (<r  =  4,  SNR  = 
150)  the  estimate  already  deviates  from  the  correct  value 
of  /0i  =  0.  The  deterioration  of  the  performance  is  very 
fast;  at  (T  =  7  (SNR  =  50)  k  already  reaches  the 
plateau  corresponding  to  a  plane  tilted  across  the  edge. 

We  have  shown  in  this  section  that  application  of 
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LMedS  to  noisy  piecewise  constant  data  with  a  large 
fraction  of  outliers  can  result  in  failure. 

•  The  high  breakdown  point  property  of  the  LMedS 
estimator  no  longer  is  satisfied  and  performance 
similar  to  least  squares  is  often  obtained. 

•  The  LMedS  estimator  cannot  recover  the  correct 
model  when  this  is  misspecified,  i.e.,  non-essential 
parameters  are  introduced. 

•  The  inlier/outlier  dichotomy  can  become  erroneous 
because  of  overestimation  of  the  noise’s  standard 
deviation. 

In  the  next  section  we  propose  a  new  robust  estima¬ 
tion  technique  which  extends  the  high  breakdown  point 
property  to  piecewise  data  with  low  signal-to-noise  ra¬ 
tios.  It  also  achieves  a  significant  speedup  relative  to 
LMedS  while  reducing  the  above  mentioned  deficiencies. 
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Figure  10:  The  censored  mean  distribution  of  the  pixel 
values  on  the  noisy  piecewise  data.  The  mode  ‘M’  is 
detected  with  the  enhanced  procedure. 


4  The  Consensus  by  Decomposition 
Estimator 

The  principle  behind  the  LMedS  estimator  (and  the 
RANSAC  paradigm)  can  be  described  as  follows: 

Compute  a  candidate  model  based  on  a  randomly 
chosen  small  subset  of  the  data.  Apply  this  model 
to  all  the  data.  Compute  a  global  quality  measure 
for  the  model.  Optimize  the  quality  measure  by 
repeating  the  procedure  several  times. 

To  have  reliable  LMedS  estimates  it  is  necessary  that 
at  least  one  candidate  model  carry  the  correct  param¬ 
eter  values.  We  have  shown  that  when  this  condition 
is  not  satisfied  the  LMedS  estimator  loses  most  of  its 
attractive  properties.  The  high  breakdown  point  of  an 
estimator  can  be  interpreted  as  seeking  the  consensus 
in  the  data  about  the  assumed  model.  Consensus  in 
LMedS  is  achieved  by  minimizing  the  criterion  (12)  and 
in  RANSAC  by  maximizing  the  number  of  inliers. 

However,  consensus  can  be  achieved  in  other  ways.  In 
this  section  we  describe  a  new,  high  breakdown  point 
robust  technique,  which  we  call  the  consensus  by  de- 
composUion  (CBD)  estimator.  The  concept  of  sampling 
siruciure,  defined  as  the  set  required  to  index  the  data, 
is  of  importance  for  the  CBD  estimator.  For  example, 
the  sampling  structure  of  two-dimensional  spatial  data 
described  in  Cartesian  coordinates  is  the  set  of  (i,y) 
addresses  of  the  points  at  which  data  is  available.  The 


principle  behind  the  CBD  estimator  differs  from  that  of 
LMedS: 

Compute  a  candidate  model  for  a  small  neighbor¬ 
hood  of  every  data  point  in  the  sampling  structure. 
Define  a  separate  estimated  image  for  each  of  the 
model  parameters.  Estimate  the  parameters  from 
their  images  by  recursively  applying  an  enhanced 
mode  detection  procedure. 

The  necessary  condition  for  a  reliable  CBD  estimate  is 
that  the  majority  of  the  candidate  models  carry  unbiased 
parameter  estimates.  This  condition  is  much  weaker 
than  the  one  required  by  LMedS  and  makes  possible  the 
elimination  of  the  latter’s  artifacts.  The  CBD  estimator 
was  described  briefly  in  [8]. 


Assumptions:  The  data  is  a  step  edge  (e  close  to  0.5) 
homogeneously  corrupted  by  zero  mean  noise  with  stan¬ 
dard  deviation  o-.  The  size  of  the  neighborhood  used  for 
local  processing  allows  an  absolute  majority  of  unbiased 
estimates. 

1.  (Smoothing.)  Using  a  degree-0  LS  estimator,  com¬ 
pute  the  local  mean  and  standard  deviation  esti¬ 
mates  for  small  neighborhoods  around  every  data 
point  in  the  sampling  structure. 

2.  Using  a  degree-0  LMedS  estimator,  estimate  the 
noise’s  standard  deviation  a  as  the  mode  of  the  stan¬ 
dard  deviation  distribution.  Mark  the  outliers  of 
this  distribution. 

3.  (Censoring.)  Eliminate  the  points  found  eis  stan¬ 
dard  deviation  outliers  from  the  mean  estimate  dis¬ 
tribution. 

4.  Using  a  degree-0  LMedS  estimator,  estimate  the 
value  of  the  surface  model  ^  as  the  mode  of  the 
censored  mean  distribution. 

5.  Define  the  inlier/outlier  dichotomy  of  the  censored 
mean  distribution.  Map  it  into  the  data.  Eliminate 
possible  inconsistencies  on  the  sampling  structure. 
Extend  the  dichotomy  to  the  entire  set  of  data. 


Figure  11:  The  EMD  Algorithm 

4.1  Mode  Detection  with  Smoothing  and 
Censoring 

The  failure  of  mode  detection  (the  degree-0  LMedS  esti¬ 
mator)  for  noisy  piecewise  constant  data  and  the  prob¬ 
lem  of  overestimating  the  noise’s  standard  deviation 
were  discussed  in  Section  3.3.  Since  mode  detection 
is  the  computational  module  assuring  the  robustness  of 
LMedS,  we  must  first  improve  its  performance  if  want  to 
robustly  estimate  noisy  piecewise  data. 

Consider  piecewise  constant  data  containing  a  step 
edge  (e  close  to  0.5)  corrupted  with  zero  mean  noise  hav¬ 
ing  standard  deviation  <t.  We  assume  a  homogeneous 
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Figure  13:  Histograms  of  EMD  estimates  a)  o-  =  17.  b) 
<r  =  25. 


noise  process,  that  is,  the  same  process  corrupts  both 
sides  of  the  edge.  This  assumption  is  a  weak  one  and  is 
practically  always  satisfied  in  computer  vision  applica¬ 
tions.  A  small  neighborhood  is  defined  on  the  sampling 
structure  around  every  data  point.  (Border  effects  can 
be  eliminated  by  using  only  a  subset  of  the  available 
data.)  In  each  of  these  neighborhoods  a  degree-0  least 
squares  (LS)  estimator  is  applied  and  the  estimates  for 
the  local  mean  and  standard  deviation  of  the  data  (and 
also  the  noise!)  are  computed. 

In  Section  2  we  have  shown  that  for  neighborhoods 
not  containing  the  edge  both  LS  estimates  are  unbiased. 
In  neighborhoods  that  cross  the  edge  (thus  having  data 
from  both  surfaces)  the  mean  estimate  has  an  intermedi¬ 
ate  value  and  the  standard  deviation  estimate  increases. 
To  obtain  the  distribution  of  each  of  the  two  LS  esti¬ 
mates  their  values  are  ordered  into  two  sequences.  We 
want  the  unbiased  estimates  to  become  the  absolute  ma¬ 
jority  in  these  sequences.  Since  the  two  LS  estimates  are 
defined  on  the  same  sampling  structure  as  the  data,  we 
achieve  our  goal  if  the  size  of  the  data  is  large  enough 
relative  to  the  size  of  the  neighborhood. 

The  distribution  of  the  standard  deviation  estimates 
is  unimodal  with  the  mode  close  to  the  correct  value  <t. 
Estimates  from  neighborhoods  containing  the  edge  are 
placed  at  the  upper  tail  of  the  ordered  sequence.  The 
mode  detection  procedure  described  in  Section  3.1  can 
always  cope  with  such  a  distribution  and  thus  the  mode 
is  a  reliable  estimate  of  the  corrupting  noise’s  standard 
deviation.  The  mode  detection  procedure  also  returns  a 
robust  measure  for  the  spread  of  the  distribution  (15). 
By  using  the  weights  (16)  we  can  mark  the  estimates 
which  are  far  away  from  the  correct  standard  deviation 
value,  i.e.,  the  outliers.  Most  of  the  outliers  from  the 
upper  tail  correspond  to  the  data  points  with  neighbor¬ 
hoods  containing  the  edge. 

Like  the  ordered  sequence  of  the  original  data,  the 
distribution  of  the  local  mean  estimates  is  bimodal.  The 
two  modes  correspond  to  the  values  of  the  two  original 
surfaces.  Note  that  we  must  have  the  majority  of  LS  esti¬ 
mates  in  the  mean  sequence  unbiased  but  not  all  of  these 
estimates  have  to  represent  the  same  surface.  The  re¬ 
placement  of  the  data  by  a  local  mean  reduces  the  spread 


of  the  estimates  around  the  two  modes.  The  data  points 
where  the  averaging  yields  incorrect  (biased)  results  are 
near  the  edge  and  were  already  marked  as  outliers  in  the 
standard  deviation  sequence.  These  points  thus  can  be 
removed  and  the  two  modes  of  the  remaining  (censored) 
mean  sequence  become  well  separated.  The  mode  de¬ 
tection  is  now  reliable  and  the  mode  corresponding  to 
the  majority  of  the  data  points  is  recovered.  Compare 
the  censored  mean  sequence  in  Figure  10  with  the  se¬ 
quence  corresponding  to  the  original  data  in  Figure  5. 
The  new,  enhanced  mode  estimate  (marked  ‘M’)  is  close 
to  the  correct  surface  value  (marked  ‘C’). 

The  enhanced  mode  detection  (EMD)  procedure  for 
noisy  piecewise  constant  data  is  given  in  Figure  11. 

The  experiments  described  in  Section  3.3  analyzing 
the  behavior  of  the  degree-0  LMedS  estimator  were  re¬ 
peated  for  the  EMD  procedure.  The  local  neighborhood 
was  of  size  3x3.  In  Figure  12  the  dependence  of  the 
estimated  mode  value,  on  a,  the  noise’s  standard  devia¬ 
tion,  is  shown.  The  performance  of  the  EMD  procedure 
no  longer  deteriorates  for  cr  >  14  (SNR<  13)  as  that  of 
LMedS  did  (Figure  6).  The  standard  deviation  of  ^  for 
the  100  trials,  however,  appears  to  be  larger  than  in  the 
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Figure  14:  Graphs  of  4o  vs.  tr  with  the  lowest  12  percent 
of  the  values  trimmed,  a)  EMD.  b)  LMedS. 


case  of  LMedS. 

The  phenomenon  is  explained  using  the  histograms  of 
500  3b  estimates  for  a  given  <r.  They  are  shown  in  Fig¬ 
ure  13a  ((T  =  17,  SNR=  8)  and  Figure  13b  (a  =  25, 
SNR=  4).  In  the  EMD  procedure  most  of  the  trials 
yield  a  3b  estimate  close  to  the  correct  value  Aj  =  178 
(compare  with  Figure  7).  The  estimates  no  longer  spread 
over  the  entire  range  of  the  data,  but  in  a  few  experi¬ 
ments  mode  values  in  the  vicinity  of  the  other  surface 
(A)  =  128)  are  detected.  These  artifacts  are  caused  by 
the  probabilistic  nature  of  the  censoring  process.  The 
noisy  data  points  derived  from  the  two  surfaces  are  al¬ 
most  equally  represented  (e  =  0.44).  After  the  censoring 
process  the  majority  in  the  mean  sequence  may  “flip”  to¬ 
ward  the  second  surface  and  the  mode  detection  returns 
the  corresponding  value.  Note  in  Figure  13a  the  distri¬ 
bution  of  “flipped”  estimates  close  to  the  other  surface’s 
value.  The  outlying  mode  values  dramatically  increase 
the  standard  deviation  of  3b  for  the  trials.  In  Figure  12 
only  two  <r  values  yielded  100  trials  without  artifacts 
where  small  standard  deviations  were  obtained  (<r  =  9 
and  <r  =  15). 

In  most  computer  vision  applications  the  existence  of 
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such  a  “flip”  is  of  no  importance  since  the  inlier/outlier 
dichotomy  of  the  data  obtained  at  Step  5  of  the  EMD 
procedure  is  correct  albeit  reversed.  Cooperative  pro¬ 
cesses  among  adjacent  (or  overlapping)  data  sets  will 
thus  always  converge  to  the  same  final  result.  To  remove 
the  effects  of  the  artifacts  we  have  trimmed  the  lowest 
12  percent  of  3b  values  from  the  trials.  The  dependence 
of  the  trimmed  3b  on  <r  (Figure  14a)  has  practically  the 
same  mean  as  before  but  with  a  reduced  standard  devi¬ 
ation.  Note  that  the  probability  of  a  “flip”  is  very  low 
since  discarding  the  lowest  12  percent  eliminated  most  of 
the  “flips”  over  the  entire  range  of  a.  Applying  the  same 
trimming  method  to  the  LMedS  trials  has  no  significant 
effect  on  the  curve.  Compare  Figures  6  and  14b. 

At  Step  2  of  the  enhanced  mode  detection  procedure 
the  estimate  &  of  the  noise’s  standard  deviation  is  also 
obtained.  Its  value  as  a  function  of  a  is  shown  in  Fig¬ 
ure  15.  The  dependence  is  now  linear  over  the  whole 
range  and  has  slope  one,  i.e.,  the  estimate  is  always  cor¬ 
rect.  Compare  with  Figure  8. 

We  conclude  from  these  experiments  that  the  en¬ 
hanced  mode  detection  procedure  succeeds  in  preserving 
the  high  breakdown  point  properties  for  piecewise  con¬ 
stant  data  up  to  low  signal-to-noise  ratios.  The  proce¬ 
dure  also  returns  a  correct  estimate  for  the  noise’s  stan¬ 
dard  deviation.  We  can  now  proceed  to  develop  a  tech¬ 
nique  for  the  general  case  of  piecewise  data. 

4.2  Robust  Estimation  of  Noisy  Piecewise 
Data 

The  performance  improvement  of  the  enhanced  mode 
detection  method  relative  to  the  original  procedure  is 
made  possible  through  computing  mostly  unbiased  es¬ 
timates  and  robustly  analyzing  their  distributions.  For 
piecewise  constant  data  the  neighborhoods  not  contain¬ 
ing  a  discontinuity  (the  step  edge)  yield  the  unbiased  es¬ 
timates.  Unfortunately,  for  the  general  piecewise  model 
this  is  not  true  for  all  the  parameters.  The  local  neigh¬ 
borhood  computations  can  be  regarded  as  translating  a 
small  window  across  the  data.  We  are  interested  in  the 
set  of  parameters  invariant  under  translation  over  data 
coming  from  the  same  modei.  Only  these  parameters 
yield  unbiased  estimates  and  therefore  only  their  dis- 


354 


tribution  pooled  together  from  several  neighborhoods  is 
meaningful.  The  discussion  for  arbitrary  regressor  vari¬ 
ables  Xk(i,j)  and/or  sparse  sampling  structures  is  be¬ 
yond  the  scope  of  this  paper.  In  what  follows  we  restrict 
ourselves  to  the  case 

=  0,1....  (17) 


Assumptions:  The  piecewise  data  (0  <  c  <  0.5)  is  ho¬ 
mogeneously  corrupted  by  zero  mean  noise  with  stan¬ 
dard  deviation  <r.  The  model  has  the  parameters  /Jjt, 
k  =  0, 1, . . . ,  (p  —  1).  The  size  of  the  neighborhood  used 
for  local  processing  allows  an  absolute  majority  of  unbi¬ 
ased  estimates  for  the  translation  invariant  parameters. 

1.  (Spatial  decomposition.)  Using  an  LS  estimator, 
compute  a  candidate  model  for  a  small  neighbor¬ 
hood  around  every  data  point  in  the  sampling  struc¬ 
ture. 

2.  (Parameter  decomposition.)  For  each  define  a 
separate  parameter  image  on  the  sampling  struc¬ 
ture. 

3.  (Consensus  seeking.)  Apply  the  enhanced  mode  de¬ 
tection  procedure  to  the  translation  invariant  pa¬ 
rameter  images.  The  resulting  mode  values  are 
the  final  CBD  estimates  for  this  subset  of  parame¬ 
ters. 

4.  Mark  on  the  sampling  structure  the  points  that  are 
inliers  in  all  the  translation  invariant  parameter  im¬ 
ages  processed  in  Step  3. 

5.  Project  the  data  into  the  subpace  of  the  remaining 
parameters  by  using  a  relation  similar  to  (9).  Define 
the  corresponding  reduced  model. 

6.  Repeat  Steps  1  to  5  for  the  marked  points  until  no 
parameter  is  left  unestimated. 


Figure  16:  The  CBD  Algorithm 
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Figure  17:  CBD:  vs.  <t. 
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Figure  18:  Histograms  of  CBD  estimates  of  Pi  a)  <t=  17. 
b)  <r  =  25. 


Model  (1)  is  then  that  of  a  polynomial  surface  and  we 
perform  robust  surface  fitting  on  a  dense  Cartesian  grid. 

The  linear  model  (1)  used  by  local  unweighted  least 
squares  estimation  is  correct  when  no  discontinuity  is 
present.  Since  the  estimation  is  unweighted  the  covari¬ 
ance  matrix  of  the  estimates  (6)  becomes  cov|/3j  = 

<r*(X*X)“*.  The  matrix  X  is  defined  by  the  values  of 
the  regressor  variables.  We  want  X  to  be  an  orthogonal 
matrix,  that  is,  to  satisfy  the  property  X*  =  X“^  In 
this  case  the  covariance  matrix  (6)  is  diagonal  and  the 
estimates  are  uncorrelated. 


In  our  case  orthogonality  of  X  is  achieved  if  the  model 
is  defined  in  terms  of  a  polynomial  b2«e  instead  of  the 
regressor  variables  (17).  That  is,  we  use  linear  combina¬ 
tions  of  the  monomials  (17).  These  basis  polynomials  are 
known  as  Chebyshev  polynomials  and  their  properties  in 
the  context  of  computer  vision  are  discussed  in  [7].  The 
Cartesian  product  of  the  Chebyshev  polynomials  serves 
as  the  base  for  the  two-dimensional  model.  The  coordi¬ 


nate  variables  i  or  j  belong  to  the  Chebyshev  base  poly¬ 
nomials.  This  explains  why  the  planar  model  (7)  yields 
the  diagonal  covariance  matrix  (8).  For  a  quadratic  sur¬ 
face  the  second  degree  terms  must  be  introduced  by  tlie 
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corresponding  Chebyshev  polynomials  and  not  by  or 

alone.  In  what  follows  we  assume  that  the  least  square 
estimated  parameters  are  uncorrelated. 

Since  the  estimates  are  uncorrelated  the  estimation 
process  can  be  decomposed  into  k  independent  proce¬ 
dures  each  having  the  partial  model  in  (i,j)  =  0kXk(i,j) 
[9,  p.  137].  The  relation  between  the  regressor  variables 
defined  in  two  different  neighborhoods  can  be  found  by 
performing  the  translation  transformation  i  *—  {i  —  io) 
and  j  <—  (j  —  jo).  A  parameter  estimate  fik  is  invari¬ 
ant  to  translation  if  its  partial  model  remains  unchanged 
after  the  transformation.  It  is  immediate  to  observe 
that  only  the  parameters  /?*  multiplying  the  highest  de¬ 
gree  terms  remain  unchanged  in  the  translated  model. 
The  rest  of  the  polynomial  coefficients  in  the  translated 
model  become  linear  combinations  of  the  original  ffks. 
For  example,  when  a  planar  (degree- 1)  model  is  used 
only  the  two  slope  parameters  are  invariants  under  trans¬ 
lation;  the  intercept  is  not. 

The  uncorrelatedness  of  the  estimates  allows  us  to  de¬ 
fine  for  each  parameter  a  separate  parameter  image  on 
the  sampling  structure.  Thus  after  performing  the  local 
LS  estimations  at  every  data  point,  p  parameter  images 
are  defined.  The  fundamental  observation  behind  the 
consensus  by  decomposition  (CBD)  estimator  is  that  the 
translation  invariant  parameters  of  piecewise  data  yield 
parameter  images  which  are  blurred  noisy  step  edges. 
Indeed,  for  data  belonging  to  the  same  model  the  trans¬ 
lation  invariant  parameters  are  unbiased  estimates,  i.e., 
constants  corrupted  with  noise.  Neighborhoods  contain¬ 
ing  data  from  both  models  yield  incorrect  estimates,  in¬ 
troducing  blur  at  the  step  discontinuity.  Note  that  the 
noise  is  correlated  since  the  estimation  is  performed  in 
overlapping  neighborhoods.  Thus  the  piecewise  constant 
data  case  discussed  in  the  previous  section  is  an  ideal¬ 
ization  of  a  translation  invariant  parameter  image. 

The  consensus  by  decomposition  algorithm  (CBD)  is 
given  in  Figure  16. 


Figure  19:  CBD:  ^  vs.  <r. 


The  CBD  algorithm  seeks  a  consensus  on  each  model 
parameter  by  separately  analyzing  its  underlying  distri¬ 
bution  with  the  enhanced  mode  seeking  procedure.  First 
the  algorithm  estimates  the  coefficients  of  the  highest  de¬ 
gree  terms  of  the  polynomial  surface  which  are  the  trans¬ 


lation  invariant  parameters.  For  a  quadratic  surface,  for 
example,  the  coefficients  are  those  of  ,  tj  and  j^ .  These 
estimates  are  then  used  to  compute  the  data  values  pro¬ 
jected  into  the  subspace  of  the  remaining  parameters. 
The  projected  data  is  represented  by  a  reduced  model. 
In  the  case  of  the  quadratic  example,  the  reduced  model 
is  planar.  The  reduced  model  uses  the  same  regressor 
variables.  The  coefficients  of  the  currently  highest  degree 
terms  are  the  new  set  of  translation  invariant  parame¬ 
ters.  In  our  example  they  are  the  two  slope  components 
of  the  plane.  Note  that  the  projection  propagates  the 
estimation  errors  into  the  remaining  parameters. 
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Figure  20:  CBD:  ^  vs.  (t.  a)  All  the  values  b)  The 
lowest  12  percent  trimmed 

Only  the  inliers,  i.e.,  data  points  belonging  to  the 
model,  are  taken  into  account  in  the  projected  data. 
It  is  possible  that  at  Step  4  the  intersection  of  the  set 
of  inliers  in  the  different  translation  invariant  parame¬ 
ter  images  yields  only  a  few  points.  This  can  happen 
when  the  inlier/outlier  dichotomy  is  “flipped”  in  some 
of  the  parameter  images.  As  we  have  seen  in  Section  4.1 
the  probability  of  a  “flip”  is  low.  Cooperative  processes 
among  the  parameter  images  help  to  eliminate  the  ar¬ 
tifacts.  After  projection  the  local  LS  estimate  use  the 
reduced  model  and  the  consensus  for  the  new  transla- 


tion  invuiant  parameter  set  is  sought.  The  entire  cycle 
is  then  repeated  until  is  finally  estimated. 

The  behavior  of  the  degree- 1  CBD  algorithm  for  piece- 
wise  constant  data  is  analyzed  next.  The  same  experi¬ 
mental  conditions  were  used  as  for  the  analysis  of  degree- 
0  estimators.  The  dependence  of  the  slope  across  the 
edge,  /?i,  on  <r,  the  standard  deviation  of  the  corrupt¬ 
ing  noise  is  shown  in  Figure  17.  Compare  it  with  the 
behavior  of  LMedS  shown  in  Figure  9.  The  CBD  al¬ 
gorithm  correctly  returns  a  value  close  to  /?i  =0  up  to 
low  signal-to-noise  ratios.  The  histograms  of  0i  shown  in 
Figure  18a  (<t  =  17,  SNR=  8)  and  Figure  18b  (a  =  25, 
SNR=  4)  illustrate  the  spread  of  the  estimates.  Note 
that  no  “flips”  occur  since  for  piecewise  constant  data 
the  slopes  have  a  unimodal  distribution.  The  depen¬ 
dence  of  $2  on  <T  is  shown  in  Figure  19.  As  expected  the 
estimates  are  always  close  to  the  correct  value  ^2  =  0. 
The  dependence  of  $q  on  <r  is  shown  in  Figure  20a.  The 
correct  value  is  /?o  =  178.  The  0o  parameter  image  is 
piecewise  constant  and  therefore  the  “flips”  discussed  in 
Section  4.1  may  appear.  When  the  lowest  12  percent  of 
the  $0  values  in  the  100  trials  are  trimmed,  however,  the 
dependence  is  as  expected  (Figure  20b).  The  error  prop¬ 
agation  discussed  above  slightly  increases  the  standard 
deviation  relative  to  the  use  of  the  degree-0  estimator 
(compare  with  Figure  14a). 

It  is  important  to  notice  that  we  have  eliminated  the 
problem  of  non-essential  model  parameters  discussed  in 
Section  3.3.  In  spite  of  using  a  degree-1  model  for  piece- 
wise  constant  data  we  were  able  to  recover  the  horizon¬ 
tal  surface  corresponding  to  the  majority  of  the  pixels. 
Given  the  data  and  a  family  of  models  (e.g.,  polynomial 
surfaces),  the  model  order  problem  deals  with  the  se¬ 
lection  of  the  model  best  fitting  the  data.  There  is  no 
unique  solution  for  the  problem  and  it  is  an  active  re¬ 
search  area  both  in  the  context  of  classic  [9,  chapter  7] 
and  robust  [2,  pp.  366-7]  statistical  techniques.  Without 
claiming  a  theoretical  advance,  the  CBD  estimator  may 
constitute  a  practical  solution  at  least  for  the  polynomial 
surface  fitting  case  discussed  here. 

Another  important  observation  is  that  of  the  speed-up 
achieved  by  the  CBD  estimator  relative  to  the  LMedS  es¬ 
timator  employing  random  sampling.  When  q  p-tuples 
are  chosen  randomly  from  n  data  points,  the  computa¬ 
tional  complexity  of  LMedS  is  O(9nlogn).  Due  to  the 
decomposition  used  by  the  CBD  estimator  the  computa¬ 
tional  complexity  decreases  to  0(pn  logn).  Since  q  is  not 
a  linear  function  of  p  (14)  the  speed-up  is  by  an  order 
of  magnitude  for  planar  surfaces  (p  =  3)  and  can  reach 
two  orders  of  magnitude  for  quadratic  surfaces  (p  =  6). 

The  CBD  estimator  is  superior  to  LMedS  for  most 
computer  vision  applications  since  it:  1)  can  handle 
noisy  piecewise  data;  2)  can  tolerate  misspecified  models; 
3)  is  faster. 

The  only  type  of  data  for  which  our  implementation 
of  the  CBD  estimator  may  fail  is  data  corrupted  with 
impulse  noise.  When  e  is  close  to  0.5  and  the  corrupted 
data  points  are  uniformly  distributed  on  the  sampling 
structure,  most  of  the  local  LS  estimates  become  erro¬ 
neous  and  unbiased  estimate  majorities  cannot  be  ob- 
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Figure  21;  a)  A  “wedding  cake”  b)  A  “wedding  cake” 
with  Gaussian  noise  (SNR  =  11). 

tained.  The  impulse  noise,  however,  can  be  removed 
by  preprocessing.  The  preprocessing  can  be  either  local 
LMedS  interpolation,  if  no  significant  zero  mean  noise  is 
present,  or  M-estimator  based  local  smoothing  (without 
guaranteed  good  performance!)  if  such  noise  is  present. 

5  Applications 

In  this  section  we  briefly  present  two  applications  that 
use  the  CBD  estimator:  image  smoothing  and  edge  de¬ 
tection.  The  use  of  robust  techniques  in  low-level  vision 
task  is  of  importance  since  an  improved  output  lessens 
the  feature  definition  uncertainty  in  high-level  vision. 
When  appropriate  we  compare  the  results  with  those 
obtained  by  substituting  LMedS  for  the  CBD  module. 

The  CBD  algorithm  is  applied  here  to  the  case  of 
robust  polynomial  surface  fitting  on  a  dense  sampling 
structure.  To  discuss  some  of  the  implementational 
problems  and  to  illustrate  the  performance  of  the  tech¬ 
nique  on  real  data,  the  CBD  algorithm  was  used  as 
a  computational  module  in  two  low-level  vision  tasks. 
The  image  smoothing  application  shows  the  superiority 
of  CBD  to  LMedS  when  the  data  is  corrupted  by  zero 
mean  noise.  The  simultaneous  step/roof  edge  detection 
application  makes  use  of  the  capacity  of  the  CBD  algo¬ 
rithm  to  correctly  estimate  piecewise  constant  data  with 
degree- 1  estimators. 
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Figure  23:  Results  of  two  level  consensus  based  smooth^ 
ing. 


them.  Some  artifacts  appear  as  discussed  above. 


b 

Figure  22:  a)  CBD  based  smoothing,  b)  LMedS  based 
smoothing. 


5.1  Image  Smoothing 

In  centered  window  based  image  smoothing  methods  the 
value  of  a  pixel  is  updated  with  the  value  of  the  in¬ 
tercept  00  since  the  pixel  has  local  coordinates  (0,0). 
High  breakdown  point  estimators  return  the  model  cor¬ 
responding  to  the  majority  of  the  pixels  and  this  can 
yield  artifacts.  At  corners  the  window  center  does  not 
necessarily  belong  to  the  surface  representing  the  major¬ 
ity  of  the  pixels.  Similarly,  if  the  inlier/outlier  dichotomy 
is  reversed  (for  example,  a  “^ip”  occurs  ia  the  CBD  esti¬ 
mation)  the  smoothed  pixel  gets  an  incorrect  value.  The 
latter  case  can  appear  only  at  low  signal-to-noise  ratios. 

Figure  21a  is  wire-frame  display  of  a  synthetic  “wed¬ 
ding  cake”  and  Figure  21b  is  a  display  of  the  “wedding 
cake”  with  additive  Gaussian  noise  (SNR  =  11).  The 
wire-frame  representation  was  selected  to  better  display 
the  effects  of  the  smoothing.  Although  the  noisy  data 
seems  very  noisy  the  amount  of  noise  that  was  added  to 
the  original  data  seems  harmless  enough  in  a  gray  level 
representation. 

Figures  22a  and  22b  show  the  results  of  running  a 
centered  window  based  smoothing  operator  on  the  noisy 
data  using  CBD  and  LMedS  (respectively).  In  both  cases 
a  9  X  9  window  was  used  and  3x3  windows  were  used 
as  local  neighborhoods  for  CBD.  Note  how  the  CBD  op¬ 
erator  better  preserves  the  edges  while  LMedS  smooths 


These  artifacts  are  eliminated  if  consistency  among  ad¬ 
jacent  windows  processing  overlapping  data  is  assured. 
Most  of  the  windows  recover  the  correct  local  structure 
and  thus  a  consensus  based  decision  allocates  the  cor¬ 
rect  smoothed  value  to  the  pixel.  The  result  of  such 
a  procedure  can  be  seen  in  Figure  23.  Note  that  this 
consensus  is  hierarchically  at  a  higher  level  that  the  one 
detected  in  the  CBD  estimator.  Cooperation  among  win¬ 
dows  is  also  helpful  to  keep  the  size  of  the  windows  using 
the  CBD  algorithm  relatively  small.  Diagonally  oriented 
discontinuities  require  larger  windows  in  order  to  have 
the  majority  of  local  estimates  unbiased.  However,  this 
worst  case  situation  appears  only  when  the  discontinu¬ 
ity  passes  through  the  center  of  the  window.  Thus  the 
windows  containing  a  discontinuity  with  an  offset  still 
recover  the  correct  local  structure. 

The  same  experiment  was  run  on  a  synthetic  roof  (Fig¬ 
ure  24a)  with  added  noise  having  cr  =  4  (Figure  24b). 
The  results  are  shown  in  Figure  25.  A  consensus  at 
a  higher  level  is  not  needed  since  a  “flip”  occurring  in 
a  window  centered  on  the  rooftop  does  not  change  the 
smoothed  value  as  it  does  at  a  step  discontinuity,  and  a 
“flip”  is  less  likely  to  occur  in  other  windows. 

The  results  clearly  show  the  superiority  of  the  CBD 
estimator  over  LMedS.  It  is  highly  probable  that  due  to 
the  noise,  the  LMedS  search  procedure  will  result  in  an 
LMedS  value  that  corresponds  to  a  model  with  differ¬ 
ent  parameters  from  the  original.  Thus,  the  roof  top  is 
smoothed  by  the  LMedS  procedure.  The  CBD  estima¬ 
tor  considers  only  two  alternatives  for  parameters  per 
window  and  thus  obtains  the  correct  parameters. 

As  for  running  times,  the  CBD  based  smoothing  pro¬ 
cedure  ran  for  45  seconds  on  a  DECstation  while  the 
LMedS  procedure  took  over  1 1  minutes  on  the  same  ma¬ 
chine.  The  CBD  estimator  thus  achieves  a  speedup  of 
15  over  the  LMedS  estimator.  This  agrees  with  the  com¬ 
putational  complexity  comparison  since  p  =  3  and  50 
samples  were  taken  in  the  LMedS  procedure. 


b  05-.  a)  CBD  based  smoothing  b)  LMedS  base 

Figure  24:  a)  A  synthetic  roof,  b)  A  synthetic  roof  with 
additive  noise  (<r  -  4). 


,,  SirnuHaneousStep/lW^^-^  ^ 

When  a  robust  boundary  between  the  iidwr 

wise  polynomial  ideation  of  a  discontinuity.  The 

and  outlier  regions  is  .  invariant  to  a  shift  of  the 

delineated  boundary  discontinuity  is  within 

processing  window  the  delineation  of  the 

the  window.  In  real  im  g  model  in¬ 
boundary  is  not  P«^e  boundary  candidate  in¬ 
adequacy.  ft  °m  a^et  of  overlapping 

formation  .  allocate  to  every  pixel  in  the 

however,  can  ^  .  presence  of  a  discontinuity  at 

image  a  probability  of  P  as  an  edge  candi- 

that  pixel.  A  pixel  th  .  vrindows  is  more  likely 
date  in  mwt  ^pixel  that  only  a  minority  ^ 

reasonable  output.  Ho  P  ratio  still 

selection  and  se^^  that  in  cases  such 

remains  when  LMedS  recover  the  correct 

„  Figure  22.  «^,rfl“fbou„dU  beW'*" 

parameters  for  the  fit,  tne  j  ^^d.  A  complete  de- 
Ind  outlier  regions  ^ Vthe  edge  detection  algorithm  is 
scription  and  analysis  of  the  eug 


in  preparation.  range  image  is  shown. 

In  Figure  26a  a  y“®'Xstep^and  rwf  edges  which  are 
Note  the  pre^nce  the  output  of  the  edge  detector 

correctly  de  meated  m  P^^  ^ange  image  edge  de- 


6  Discussion  .  ,  the 

spatial  and  „  dfiributions  which  are  ana- 

position  yields  P"f"\Xe  the  extrema  of  a  multidimen- 
lyzed  separately.  Fin^nKj)^^  criterion)  by  proceeding 
sional  A°*^/„_.ameter)  direction  at  a  time  i 

along  one  '^i^XcTtechnique  (see  [U]  for  exam- 

frequently  used  n"™®"®*“rwn  by  different  names. 
p,4.  Th»e  3S“»t,  dirertion  set,  eU, 

univariate  search,  5?°  •  ”  ^rfomed  only  in  the  space  of 
and  the  decomposition  i  P  -  convergence  and  the 

the  parameters.  The  condition  for  ^onv^g^^^^ 

influence  of  ^Safeh  we  intend  to  apply  the  r^ 
next  ph^e  of  mentioned  numerical  meth- 

sults  obtained  by  the  ao  estimator, 

ods  within  the  framework  Cooperative  Inde- 

Recently  Taylor  [16]  P^taPS)  approach  for  improv- 
pendent  Parameter  SP^^Hough  traLform  in  recon¬ 
ing  the  performance  of  the  Ho  g 


Figure  26;  Simultaneous  step/roof  ed^;e  detection  exam-  Figure  27;  Simultaneous  step/roof  edge  detection  exam 
pie.  a)  Ring-on-steps  range  image,  b)  Edge  output.  pie.  a)  Polyhedral  object  range  image,  b)  Edge  output 


360 


structing  range  data  with  geometric  models.  In  the  CIPS 
approach,  as  in  our  CBD  algorithm,  the  parameters  are 
also  analyzed  in  separate  subspaces.  The  parameters  are 
extracted  with  a  multiwindow  procedure,  but  fewer  spa¬ 
tial  subsets  are  used  than  in  the  CBD  algorithm.  The 
parameters  are  represented  in  discrete  accumulators  and 
considerable  computational  effort  is  required  to  reduce 
the  artifacts  of  this  quantization.  In  CBD  the  mode 
detection  process  uses  the  ordered  parameter  estimates 
sequence  without  quantization.  The  final  estimate  is  ob¬ 
tained  in  CIPS  by  combining  local  estimates  in  a  voting 
procedure.  The  CIPS  approach  and  the  CBD  algorithm, 
while  developed  independently,  have  some  common  prin¬ 
ciples.  However,  the  CIPS  is  tailored  toward  a  specific 
application  (object  reconstruction)  and  several  heuristic 
procedures  are  incorporated  into  it.  The  CBD,  on  the 
other  hand,  was  derived  from  the  class  of  high  break¬ 
down  point  robust  estimators  and  makes  extensive  use 
of  estimation  theory  concepts. 

The  CBD  algorithm  belongs  to  the  consensus  vision 
paradigm.  This  paradigm  assumes  that  the  data  can  be 
represented  by  a  model  of  known  structure  but  unknown 
parameters.  The  applicability  of  the  consensus  vision 
paradigm  is  contingent  upon  the  existence  of  estimates 
which  become  unbiased  whenever  local  agreement  be¬ 
tween  the  data  and  the  model  exists.  The  homogeneity 
conjecture  states  that  the  majority  of  the  local  estimates 
represent  agreement  with  the  model,  i.e.,  the  density  of 
discontinuities  in  the  data  is  not  extremely  high.  Then  a 
robust  analysis  of  their  global  distributions  can  be  used 
to  obtain  the  final  parameter  estimates  with  increased 
accuracy,  and/or  to  select  the  reliable  local  processes. 
In  this  paper  our  model  was  that  of  the  piecewise  poly¬ 
nomial  surface  structure  and  the  locally  LS  estimated 
parameters  are  the  unbiased  carriers  of  the  agreement 
with  the  model.  For  an  application  of  the  consensus  vi¬ 
sion  paradigm  to  edge  preserving  image  smoothing  see 
[10]. 

The  optimization  criterion  of  the  CBD  algorithm  is 
not  that  of  least  median  squared  residuals  (12).  The 
degree-0  LMedS  in  the  enhanced  mode  seeking  procedure 
minimizes  the  one-dimensional  criterion  (11)  for  each 
censored  mean  sequence  separately.  Thus  the  breakdown 
point  of  CBD  is  not  expressed  by  (13)  and  is  influenced 
by  the  reliability  of  the  censoring  process,  i.e.,  by  the 
signal-to-noise  ratio.  Our  results  show  that  in  the  range 
of  normal  SNR  the  CBD  algorithm  has  a  breakdown 
point  close  to  the  optimum  0.5.  The  absolute  majority 
detected  in  the  censored  mean  sequence  may  map  only 
into  a  relative  majority  in  the  data  (e.g.,  the  largest  set 
of  points  has  less  than  50  percent).  Such  cases  appear 
when  the  data  contains  two  discontinuities  and  thus  we 
may  have  a  “virtual”  breakdown  point  for  the  (^BD  al¬ 
gorithm  exceeding  0.5. 

An  estimator  similar  to  least  median  of  squares  is  least 
trimmed  squares  (LTS),  in  which  the  sum  of  the  first  h 
ordered  squared  residuals  is  minimized  [13].  When  h 
equals  half  the  data  size  the  breakdown  point  of  LTS 
is  close  to  0.5.  The  presence  of  integration  (the  sum) 
in  the  LTS  optimization  criterion  makes  it  numerically 
more  stable  and  thus  more  suitable  for  applications  than 


LMedS  [14].  In  the  CBD  algorithm,  the  degree-0  LMedS 
module  could  easily  be  replaced  by  the  equivalent  LTS 
estimator.  Our  experiments  with  LTS  in  the  presence  of 
zero  mean  noise,  however,  did  not  show  a  significantly 
better  performance. 

We  conclude  by  recapitulating  the  advantages  and  dis¬ 
advantages  of  the  CBD  estimator  relative  to  LMedS.  As 
advantages  we  can  enumerate  better  tolerance  of  zero 
mean  noise,  less  sensitivity  to  inadequate  model  order, 
and  significant  speed-up.  The  disadvantages  are  the  re¬ 
quirement  for  larger  processing  windows  when  CBD  is 
used  in  window  operators,  and  possible  artifacts  caused 
by  making  decisions  taken  on  the  censored  mean  se¬ 
quences  instead  of  on  the  original  data.  Note  that  the 
disadvantages  are  of  an  implementational  nature  while 
the  advantages  are  theoretical.  This  observation  makes 
us  believe  that  CBD  can  become  a  valuable  tool  in  com¬ 
puter  vision. 
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Abstract 

The  problem  of  chromatic  aberration  arises 
because  each  wavelength  of  light  is  refracted 
differently  by  the  elements  of  a  lens.  Unfortu¬ 
nately,  this  means  that  is  image  will  be  blurred 
and  distorted.  In  color  imaging  these  distor¬ 
tions  cause  measurable  differences  between  the 
images.  Recent  research  has  proposed  an  ap¬ 
proach  for  dealing  with  this  aberrations  by  ac¬ 
tively  controlling  the  optics  of  the  imaging  sys¬ 
tem.  This  paper  addresses  the  same  problem, 
but  instead  of  adapting  the  optics,  we  adapt 
the  geometry  of  the  (already  obtained)  images; 
we  do  chromatic  aberration  correction  by  image 
warping.  We  briefly  discuss  the  image  restora¬ 
tion/reconstruction  techniques  used,  since  they 
are  non-standard.  This  is  followed  by  a  discus¬ 
sion  of  the  techniques  used  to  define  the  chro¬ 
matic  aberration  correcting  warp.  The  tech¬ 
nique  is  demonstrated  and  analyzed  on  two  test 
cases  and  is  directly  compared  to  the  active  op¬ 
tics  approach. 

1  Introduction 

The  first  stage  of  an  imaging  system  is  the 
lens,  which  refracts  the  incoming  light  to  focus 
it  on  the  image  plane.  It  has  long  been  known 
that  the  refraction  of  light  depends  upon  the 
wavelength,  a  ray  refracted  at  the  lens  surface 
becomes  a  small  spectral  fan.  According  to  ge¬ 
ometric  optics  for  a  a  simple  lens,  if  a  point 
object  is  placed  in  front  of  a  lens,  there  would 
be  a  plane  at  some  focal  distance  where  the  im¬ 
age  of  that  point  would  be  in  focus,  see  Fig.  1. 
Unfortunately,  because  refraction  is  wavelength 
dependent,  this  focal  distance  is  also  wavelength 


dependent.  The  difference,  due  to  wavelength, 
between  the  ideally  focused  image  and  actual 
image  is  called  chromatic  aberration.  Chromatic 
aberrati-in  is  generallj  broken  up  into  two  cate¬ 
gories:  axial  aberrations  and  lateral  aberrations 
[Slama  et  ai,  80].  In  axial  aberrations,  the  fo¬ 
cal  plane  for  one  wavelength,  say  red,  will  be 
displaced  along  the  optic  axis  from  the  focus 
plane  for  another  wavelength,  say  blue.  In  lat¬ 
eral  aberrations,  the  image  of  a  one  colored  fea¬ 
ture  point  will  be  displaced  laterally  within  the 
image  plane  with  respect  to  another  color  point. 
There  are  two  parts  to  this  lateral  displacement. 
In  general  the  largest  component  is  a  difference 
in  the  magnification,  which  causes  a  generally 
radial  translation  of  image  features.  Secondly, 
there  may  be  differences  in  the  optic  axis  for  the 
different  wavelength. 

The  fact  that  some  parts  of  a  color  image 
are  more  blurred  that  others,  or  that  there  the 
wavelength  dependent  distortions  may  not  seem 
too  important  at  first.  If  our  algorithms  were 
truly  robust  this  might  be  true,  but  increased 
accuracy  is  always  desirable,  especially  if  it  does 
not  require  expensive  new  eqriipment. 

For  example,  consider  a  technique  which  re¬ 
quires  clustering  in  color  space.  The  effects  of 
chromatic  aberration  can  be  very  important,  es¬ 
pecially  if  the  clustering  makes  assumptions  on 
the  shape/distribution  of  the  cluster.  Examine 
the  color  space  depicted  in  figure  2  and  imag¬ 
ine  trying  to  find  a  “T”  which  is  a  process  re¬ 
quired  in  the  color  segmentation/highlight  de¬ 
tection  algorithm  in  [Klinker,  88].  Even  if  a 
color  vision  technique  does  not  use  the  actual 
values  of  the  RGB  triples,  but  rather  just  uses 
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Figure  1:  Figure  showing  the  geometric  optic  interpretation  for  Chromatic  aberration  caused  by  a 
thin  lens.  Longer  wavelength  focus  long,  and  are  magnified  greater  than  the  shorter  wavelength. 


edges  detected  separately  in  each  color  band, 
the  misregistjration  of  one  or  two  pixels  may  be 
significant. 

Once  one  determines  that  chromatic  aberra¬ 
tion  correction  is  necessary  for  a  vision  applica¬ 
tion,  there  are  at  least  three  things  to  do.  First 
you  can  go  out  and  buy  an  expensive  lens  and 
hope  it  is  properly  corrected.*  We  briefly  dis¬ 
cuss,  in  the  next  subsection,  some  of  the  tech¬ 
niques  of  lens  designers. 

A  more  recent  development  is  the  use  of  ac¬ 
tive  lens  control  to  achieve  reduced  chromatic 
effects.  This  technique,  developed  by  R.  WiU- 
son  and  S.  Shafer  at  CMU  [Willson  and  Shafer, 
91a]  and  [Willson  and  Shafer,  91b],  takes  three 
separate  images  with  slightly  different  focus  and 
zoom  settings  designed  to  compensate  for  the 
optics.  This  is  discussed  in  section  1.2. 

The  final  choice  is  to  do  image  warping  for 
chromatic  correction,  as  described  in  this  pa¬ 
per.  We  will  show,  in  section  2.2  how  to  de¬ 
termine  the  warping  function  and  briefly  recall 
how  to  do  the  warping  and  the  image  restora¬ 
tion/reconstruction  needed  for  it.  Then  we  will 
demonstrate  and  analyze  the  algorithm  in  sec¬ 
tion  3.2. 

We  are  not  the  first  researchers  to  suggest 
using  image  warping  for  image  registration  or 
correction,  e.g.  NASA  has  used  image  warp¬ 
ing  in  various  applications,  see  [Green,  89],  and 
much  of  the  early  work  on  image  reconstruc- 

*In  [Willson  and  Shafer,  91b]  they  report  that  they  have 
tested  a  number  of  lenses,  including  some  that  are  supposed  to 
be  chromatic  correcting  lenses,  only  to  find  significant  errors 
in  every  lens. 


tion  centered  around  “digital  correction”,  e.g. 
see  [Rifman  and  McKinnon,  74].  We  are  un¬ 
aware,  however,  of  any  quantitative  experimen¬ 
tation  studying  the  effectiveness  of  warping  to 
correct  for  chromatic  aberration. 

1.1  Chromatic  aberration  correction  by 
better  lens  design 

We  now  briefly  discuss  the  “traditional”  tech¬ 
niques  for  dealing  with  chromatic  aberration. 
This  reivew  of  lens  design  is  based  on  the  ma¬ 
terial  in  [Slama  et  ai,  80],  and  [Kingslake,  78]. 
The  subject  of  design  good  lenses  is  quite  in¬ 
volved  and  increasing  uses  computer  simulations. 
While  one  might  think  the  space  of  lens  designs 
have  been  completely  explored,  modern  tech¬ 
niques  continue  to  produce  better  lenses,  see 
[Laikin,  9l].  If  the  price  is  right,  a  lens  can 
be  designed  to  meet  particular  Imaging  crite¬ 
rion.  However,  most  vision  researchers  use  off 
the  shelf  lenses  with  either  no  chromatic  correc¬ 
tion  or  only  simple  correction.  In  addition,  the 
correction  of  aberrations  is  generally  becomes 
more  difficult  as  one  reduces  the  focal  length,  in- 
creaises  the  aperture,  increases  the  field  of  view, 
or  allows  zooming.  Note  that  wide  field  of  view, 
large  aperture  zoom  lenses,  with  short  minimal 
focal  length,  are  exactly  what  many  vision  re¬ 
searchers  use. 

Axial  chromatic  aberration  is  usually  handled 
by  using  multiple  elements,  some  positive  others 
negative,  with  different  optical  indices.  These 
elements  are  chosen  such  that  if  the  first  causes 
the  longer  wavelength  (red)  to  focus  too  far,  the 
next  causes  the  shorter  wavelength  ((blues)  to 
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Figure  2:  Upper  left  shows  a  test  image  (the  blue  channel).  The  figures  on  the  right  show  2D 
histograms  of  color  space  top  is  blue  vs  green,  bottom  is  red  vs  green.  If  there  were  no  chromatic 
aberration  both  of  these  plots  would  be  a  line.  The  original  images  were  taken  when  the  lens  was 
focused  for  blue  light.  Thus  for  this  uncorrected  lens  system,  the  green  channel  show  some  errors, 
and  red  channel  much  larger  errors.  The  graph  in  the  lower  left  is  the  plot  of  red  vs  blue  for  a 
square  window  in  the  lower  left  of  the  image.  These  plots  are  described  in  more  detail  in  section  3. 


focus  long.  A  standard  item,  called  an  achro¬ 
matic  doublet,  uses  two-lenses  made  with  dif¬ 
ferent  glasses  such  that  it  can  bring  2  wave¬ 
lengths  into  alignment  on  the  axis.  When  there 
are  more  elements  in  the  system,  more  wave¬ 
lengths  can  be  brought  into  axial  alignment.  For 
the  simple  “corrected”  lenses  two  wavelength 
are  brought  into  correspondence  resulting  in  a 
quadratic  like  error  between  these  wavelength. 
In  such  lenses,  deep  violets  and  deep  reds  gen¬ 
erally  focus  short  and  the  middle  wavelength, 
like  yellow,  focus  long,  see  [Slama  et  ai,  80], 
[Kingslake,  78].  Note  that  this  correction  is  gen¬ 
erally  measured  on  the  optic  axis,  and  in  simple 
systems  degrades  with  increasing  distance  from 


the  axis.  Many  “three  lens”  systems  correct  on 
the  axis  and  in  a  zonal  band  some  distance  from 
the  optic  axis  producing  a  spatially  varying  cor¬ 
rection.  There  is  often  a  tradeoff  between  cor¬ 
rection  of  chromatic  aberration  and  correction 
of  shperical  aberrations  and  commas.  Still  high 
quality  photogrametric  lenses  can  be  obtained 
with  axial  chromatic  errors  of  less  than  .01%. 

For  simple  lenses  the  correction  for  lateral 
aberration  is,  in  theory,  easier.  To  correct  for 
the  first  order  effects,  all  that  is  necessary  is  for 
the  prismatic  effects  of  the  first  lens  to  be  can¬ 
celed  in  the  second.  For  simple  symmetric  lens 
designs  this  is  straightforward,  see  [Slama  et  al., 
80].  Unfortunately,  for  asymmetric  lenses,  e.g. 
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zoom  and  tele-photo  lenses,  the  correction  is 
considerably  more  difficult  (some  of  these  have 
15-1-  lenses  with  complex  interned  moving  parts). 
Additionally,  if  the  lens  has  higher  order  lens 
affects,  i.e.  comma,  radial  and  tangential  ge¬ 
ometric  distortions,  these  are  also  wavelength 
dependent  complicating  the  correction  process. 

While  the  “older”  lenses  were  often  designed 
by  hand  and  as  a  result  corrected  only  for  a 
few  wavelength  at  a  few  points,  modern  compu¬ 
tational  techniques  have  allowed  lens  designers 
the  ability  to  numerically  designs  lenses  which 
have  minimal  aberration.  Actually  manufactur¬ 
ing  and  maintaining  such  compensation  is,  how¬ 
ever,  another  issue.  With  increasing  part  num¬ 
ber  the  chance  of  misalignment  of  an  internal 
lens  (or  shifting  of  the  lens  after  manufactur¬ 
ing  greatly  increases.  Furthermore,  it  is  very 
common  in  modern  lenses  to  use  coatings  on 
the  optics  to  reduce  reflections.  Such  coatings 
are  also  wavelength  dependent  and  good  sys¬ 
tems  use  a  multi- coating  technique  to  reduce 
the  wavelength  dependence  and  decrease  chro¬ 
matic  aberrations.  Such  coatings  are,  however, 
easily  damaged  or  removed  severely  impacting 
the  performance  of  the  lens. 

1.2  Chromatic  aberration  correction  by 
active  control 

Because  of  the  emerging  use  of  color  imag¬ 
ing  in  computer  vision,  and  because  of  the  ex¬ 
pense  of  well  corrected  optics  (which  are  still 
not  perfect),  researchers  at  CMU  have  recently 
been  investigating  the  use  of  active  lenses  el¬ 
ements  to  deal  with  chromatic  aberration,  see 
[Willson  and  Shafer,  91a],  [Willson  and  Shafer, 
91b].  We  describe  their  approach  in  some  de¬ 
tail  since  our  experiments  will  be  compared  with 
their  approach  and  on  their  data. 

The  CMU  active  optics  approach  to  chro¬ 
matic  aberration  correction  has  three  main  steps: 

1 .  determination  of  best  focus  for  each  color, 

2.  determination  of  a  magnifleation  factor  for 
red  and  green, 

3.  and  determination  of  camera  shift  to  align 
images. 

The  first  stage  corrects  for  axial  chromatic 
aberration  by  varying  the  focal  plane  for  each  of 
the  colors.  Note  that  for  images  with  only  three 


wavelength,  this  can  exactly  correct  for  axial  er¬ 
rors.  For  more  complex  images  this  wiU  ha\e 
zero  error  for  three  wavelength  and,  hopefully, 
small  error  for  the  remaining  wavelengths.  This 
approach  can  be  applied  even  if  the  lens  was 
chromatically  corrected.  The  technique  is  au¬ 
tomated  and  determines  best  focus  using  an  al¬ 
gorithm  by  E.  Krotkov,  see  [Krotkov,  87].  This 
involves  computing  an  image  sharpness  measure 
at  a  number  of  focal  positions  and  searching  for 
the  maximum  of  this  sharpness  measure. 

The  second  stage  begins  to  correct  for  lat¬ 
eral  chromatic  aberration.  They  determine  a 
magnifleation  factor  for  each  band  and  use  this 
to  actively  control  the  zoom  lens.  Unlike  focus 
determination,  this  stage  requires  more  than  a 
simple  image  based  operator;  it  requires  some 
type  of  geometric  calibration  image.  They  have 
used  subpixel  detection  of  edges  (vertical  and 
horizontal)  to  determine  this  zoom.  They  then 
compute  the  magnification  needed  and  actively 
update  the  lens  for  each  of  the  green  and  red 
images.^ 

The  final  stage  is  to  deal  with  differences  in 
the  image  of  the  optic  axis.  Because  the  lens  el¬ 
ements  are  not  all  perfectly  aligned,  it  will  often 
occur  that  the  image  of  the  red  optic  axis  is  dis¬ 
placed  with  respect  to  the  blue  optic  axis.  When 
this  difference  is  coupled  with  refocus  and  zoom¬ 
ing,  the  effect  can  become  more  pronounced. 
Thus  using  the  above  mentioned  edge  data  they 
also  compute  a  shift  for  red  and  green  to  bring 
them  into  correspondence  with  blue.  This  is 
done  by  physically  shifting  the  camera  in  the 
plane  of  the  im  yng  sensor.  Obviously,  be¬ 
cause  of  the  ma  nification  affects  of  the  lens 
very  small  shifts,  on  the  order  of  .005in,  will 
generally  be  used. 

We  feel  it  is  worth  noting  that  their  paper 
also  addresses  use  of  active  optics  in  shape  from 
focus  computations.  In  this  case,  the  active  op¬ 
tics  for  chromatic  correction  is  a  minimal  incre¬ 
mental  cost  in  that  of  shape  from  focus.  Be¬ 
cause  they  also  correct  for  axial  aberrations  (by 
refocusing),  they  can,  theoretically,  correct  for 

^Blue  is  assumed  as  the  correct  image.  Their  errors  might 
actually  be  smaller  than  reported  in  their  paper  had  they 
chosen  green  to  be  the  standard  and  actively  corrected  both 
red  and  blue. 
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blurs  that  would  be  problematic  in  shape-from- 
focus  when  used  on  a  colored  world. 

2  Correction  Using  Image  Warping 

When  we  did  our  initial  work  on  imaging- 
consistent  reconstruction  algorithms,  [Boult  and 
Wolberg,  9l],  we  realized  that  one  use  of  such 
techniques  would  be  in  image  warping  to  cor¬ 
rect  for  lens  aberrations.  However,  at  the  time 
we  felt  that  the  cost  of  of  image  warping  for 
lens  correction  was  probably  unwarrented  since 
we  assumed  (from  reading  [Slama  et  ai,  80]) 
that  the  available  lenses  probably  did  not  have 
significant  chromatic  aberrations  and  that  ra¬ 
dial  type  lens  distortions  could  be  handled  by 
directly  mapping  the  location  of  feature  points 
as  opposed  to  warping  the  intensity  image  be¬ 
fore  feature  detection.  When  we  heard  a  talk 
by  Reg  Willson,  about  the  CMU  active  lens 
approach  to  chromatic  aberration,  we  realized 
we  were  wrong;  standard  CCTV  lenses  have 
noticeable  aberrations  that  require  correction. 
Furthermore,  an  increasing  number  of  “physics- 
based”  vision  algorithms  use  the  actual  radio- 
metric  quantities  measured,  and  hence  they  need 
more  than  just  a  calibration  of  the  geometric 
distortions,  they  need  to  have  the  intensities 
registered.  Unlike  human  interpretation,  these 
phsyics-based  algorithms  are  probably  not  par¬ 
ticularly  robust  with  respect  to  violations  of  their 
imaging  assumptions. 

There  are  two  main  parts  of  the  image  warp¬ 
ing  technique  to  chromatic  aberration  correc¬ 
tion:  how  to  warp  images  in  general,  and  de¬ 
termining  what  warp  to  apply.  We  will  briefly 
discuss  each  of  these. 

2.1  Image  warping 

Image  warping  has  been  most  commonly  used 
in  graphics,  where  the  underlying  question  is 
“does  it  look  good”,  a  qualitative  assessment. 
To  use  image  warping  in  vision  we  need  to  ad¬ 
dress  the  question,  are  the  pixel  values  correct, 
a  more  quantitative  question. 

For  complex  warps  we  described  a  technique 
to  increase  the  accuracy  of  the  warp  while  main¬ 
taining  low  cost,  see  [Wolberg  and  Boult,  89]. 
This  separable  image  warping  technique  warped 
the  image  twice  and  then  combined  the  results 
to  increase  the  accuracy.  Fortunately,  the  warps 


needed  to  correct  for  most  lens  distortions  are 
not  so  severe  and  do  not  require  such  a  complex 
algorithm.  Instead,  ignoring  boundary  effects, 
we  can  simply  warp  the  image  in  one  direction 
(say  x)  and  then  the  other  direction  (y).  This 
is  accomplished  row  by  row  and  then  column  by 
column.^  This  allows  for  both  efficient  pipelin¬ 
ing  (within  a  scan)  and  parallelism  (each  row  is 
warped  independently). 

This  leaves  us  with  the  question  of  how  to 
warp  a  regularly  sampled  ID  signal  into  another 
ID  array  of  pixels,  with  a  nonlinear  transfor¬ 
mation  between  their  geometry.  To  do  this  we 
need  some  approximation  to  the  signal  underly¬ 
ing  the  input  ID  signal,  and  some  way  of  warp¬ 
ing  this  signal.  We  approximate  the  signal  with 
what  we  call  imaging-consistent  reconstruction 
/  restoration  filters.  These  linear  filters  use  a 
model  of  input  PSF  (blur)  within  a  pixel  to  ob¬ 
tain  a  functional  restoration.  This  functional 
form  is  then  then  warped,  and  reblurred  ac¬ 
cording  to  an  output  PSF,  using  an  approach 
we  call  the  integrating  resampler.  An  impor¬ 
tant  property  of  these  filters  is  that  they  do  not 
necessarily  pass  through  the  data,  but  rather 
when  blurred  according  to  the  assumed  input 
PSF  they  return  the  original  data.  While  there 
are  many  variations  on  this  idea,  this  paper 
mainly  considers  one,  called  a  quadratic  restora¬ 
tion  filter  with  a  rect  type  PSF,  or  quadratic- 
box  restoration  for  short.  We  define  the  func¬ 
tion  here,  but  see  [Boult  and  Wolberg,  9l]  for 
more  details  and  discussion.  This  filter  is  lo¬ 
cal,  so  consider  pixel  i  with  which  spans  the 
interval  from  fcj  to  A:,+i.  We  assume  some  in¬ 
terpolation  technique,  e.g.  linear  interpolation 
or  cubic-convolution  ([Rifman  and  McKinnon, 
74]),  is  use  to  determine  values  Cj  at  kj.  The 
value  of  the  quadratic-box  restoration  function 
is  then  given  by: 

e,  +  (6u,  -  2e,+i  -  4e.)a:  -1-  3(e,+i  -f  e,  -  2t;,)x^ 

where  x  =  The  integral  of  this  quadratic 
over  the  interval  ki  to  ki+i  is  exactly  u,  the  mea¬ 
sured  input  pixel  value.  In  this  paper  we  con¬ 
sider  the  use  of  linear  interpolation  as  a  means 

^  Actually  for  increased  accuracy  we  might  use  full  2D  fil¬ 
tering,  but  this  would  really  push  up  the  cost  of  the  algorithm. 
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of  determining  the  endpoints  because  its  cheap. 
We  have  also  determined  the  endpoints  using 
cubic-convolution  with  numerous  values  of  of 
their  magic  parameter  .4,  with  no  significant 
visible  difference  in  this  application. 

Given  the  “restoration”  of  the  input  function, 
we  now  have  a  functional  form  to  warp,  which 
we  do  using  the  following  idea  we  call  the  inte¬ 
grating  resampler.  Assume  the  n  input  pixels, 
at  locations  ki  =  =  l..n,  are  being  mapped 

into  m  output  pixels  Oj,j  =  l..m,  according  to  a 
warping  function  g{t).  Compute  qj  as  the  linear 
approximation  to  the  location  of  g~^{oj),  i.e. 

for(j  =  i  =  0;  j  <  n;j  4-I-)  { 

while{g{K,)  <  j)  i-t-b; 

3(A:i+0-3(fc.) 

} 

Now  to  process  the  data  we  run  along  the  in¬ 
put  determining  the  next  event;  either  an  input 
pixel  will  be  consumed,  or  an  output  pixel  will 
be  generated.  If  the  next  event  will  be  the  com¬ 
pletion  of  an  input  pixel  we  compute  the  inte¬ 
gral  from  the  location  of  the  last  event  to  the 
end  of  this  input  pixel,  adding  the  value  to  the 
accumulator.  If,  however,  the  next  event  is  the 
generation  of  output  pixel  j,  we  use  qj  as  an 
approximation  of  g~^{oj),  which  gives  the  loca¬ 
tion  of  this  event  in  the  input  space.  We  then 
compute  the  integral  from  the  location  of  the 
last  event  to  the  location  of  this  event,  and  add 
this  value  to  the  accumulator.**  For  more  de¬ 
tails  on  this  process  see  [Wolberg  and  Boult,  9l]. 
The  underlying  idea  of  this  integrating  resam¬ 
pler  can  be  found  in  [Fant,  86]  which  proposed  a 
similar  algorithm  for  the  special  case  of  warping 
a  linear  reconstruction  of  the  input. 

As  mentioned  before,  other  researchers  have 
used  image  warping  in  vision.  In  particular,  re¬ 
searchers  at  NASA  have  frequently  used  image 
warping,  though  more  often  to  allow  better  hu¬ 
man  viewing  than  for  the  application  of  vision 
algorithms.  A  standard  approach  is  described 
in  [Green,  89].  This  approach  uses  a  very  sim¬ 
ple  reconstruction  filter  (bi-linear  interpolation) 
with  point  sampling.  While  an  example  pre- 

**Iii  the  general  case  we  would  compute  the  integral  with 
the  output  PSF,  in  this  case  a  rect  filter. 


sented,  there  is  no  quantitative  analysis  of  the 
approach. 

2.2  Computing  the  distortion 

As  mentioned  previously,  there  are  really  two 
related  types  of  chromatic  aberration,  axial  and 
lateral.  The  main  goal  of  image  warping-based 
chromatic  aberration  correction  is  to  correct  for 
the  lateral  aberrations.  To  do  this  we  start  with 
the  the  same  types  of  geometric  features  used  to 
compute  the  zoom  and  shift  factors  in  the  in  the 
active  lens  approach.  In  this  case  we  have  data 
on  location  of  horizontal  and  vertical  edges  in 
each  of  the  R,  G,  and  B  images  (with  the  blue 
image  focused).  These  are  the  same  uncorrected 
images  used  in  [Willson  and  S'  'er,  91b],  and 
we  use  the  edge  data  computt.i  ,y  their  algo¬ 
rithms. 

The  test  target  was  a  checker-board  pattern, 
see  Fig.  2.  For  each  of  the  sides  of  each  checker, 
the  horizontal  edge  position  provides  a  stable 
measurement.  Since  the  feature  detector  only 
computes  the  horizontal  edge  position  near  a 
vertex,  each  horizontal  edge  position  has  an  ac¬ 
companying  vertical  position  which,  while  not 
stable  enough  to  use  in  warping,  facilitated  group¬ 
ing  of  related  horizontal  edges.  Edges  with  ap¬ 
proximately  the  same  vertical  location  were  thus 
associated  to  form  a  “row”  of  horizontal  posi¬ 
tions  in  each  color  band.ff.  A  similar  matching 
was  done  to  form  “columns”  of  vertical  edge  po¬ 
sition  data,  using  the  tops  and  bottoms  of  each 
checker.  For  the  CCTV  example  below  we  ob¬ 
tained  270  horizontal  edges  points  and  416  verti¬ 
cal  points.  For  the  Photometries  example  there 
were  210  horizontal  edge  points  and  220  verti¬ 
cal  edge  points.  We  do  this  processing  for  each 
of  the  three  color  bands.  Using  the  edges  in  the 
blue  image  as  the  desired  location,  we  the  differ¬ 
ence  between  those  positions  and  the  computed 
edge  positions  for  red  (or  green)  to  define  the 
warp.  To  find  the  warping  at  other  points  we 
need  to  fit  some  model.  To  maintain  flexibility, 
we  choose  to  fit  a  cubic  spline  through  the  loca¬ 
tion  data  in  each  row/column  and  consider  the 

1^  Unfortunately,  every  row  did  not  have  the  same  number  of 
detected  horizontal  edge  features.  To  make  our  computation 
easier,  we  used  linear  interpolation  between  the  data  points 
with  minimal  variation  to  get  an  equal  number  of  datapoints 
per  row 
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warp  to  be  the  tensor  product  of  these  splines. 

Once  we  have  computed  the  warping  func¬ 
tion,  we  can  then  use  the  integrating  resampler, 
as  described  in  section  2.1  and  [Wolberg  and 
Boult,  9l],  to  warp  the  red  channel  and  the 
green  channel  to  their  respective  desired  posi¬ 
tions.  Note  that  other  techniques  might  be  used 
to  determine  the  warping  function.  In  particu¬ 
lar,  a  future  direction  will  be  to  consider  a  global 
deformation  model,  which  might  be  parameter¬ 
ized  by  the  lens  setting  for  focus  and  zoom.  It 
will  also  examine  warping  all  three  images  to 
correct  for  geometric  lens  distortions  as  well  as 
chromatic  aberration.  We  note  that  in  [Green, 
89],  they  used  correlation  between  features  in 
each  color  channel,  as  well  as  a  priori  calibra¬ 
tion  information.  Again,  however,  the  results 
seemed  to  intended  for  human  analysis  and  no 
quantitative  analysis  was  given.  Future  work 
will  compare  use  of  correlation  with  edge  based 
warp  determination. 

3  Experimental  Analysis 

In  this  section  section,  we  discuss  the  results 
of  application  of  the  image-warping  chromatic 
aberration  correction  technique  on  two  test  im¬ 
ages.  The  test  images  were  obtained  from  the 
Calibrated  Imaging  Laboratory  at  CMU,  and 
were  used  in  [Willson  and  Shafer,  91b]  to  de¬ 
scribe/test  the  active  optics  approach  to  chro¬ 
matic  aberration  correction.  We  first  discuss 
how  to  measure  the  quality  of  a  correction,  then 
get  into  the  actual  data. 

3.0.1  Photometries  examples 

These  images  were  also  collected  at  CMU, 
and  used  a  Photometries  camera  connected  to 
Matrox  frame  grabber.  The  lens  was  a  Fuji- 
non  motorized  zoom  lens.  For  the  color  filters 
they  used  a  Hoya  IR  block  -|-  the  same  RGB 
filters.  The  1/2”  checkerboard  was  imaged  at 
a  distance  of  2.03m.  The  original  image  is  not 
shown.  The  actual  sensor  array  (384x576)  is 
larger  than  the  lens’s  spot  size,  and  so  the  im¬ 
ages  were  clipped  to  338  by  388.  For  each  im¬ 
age  10  frames  were  averaged  together.  The  in¬ 
teger  pixel  values  (0..4059)  were  converted  to 
the  range  (0..255).  The  color  histograms  of  the 
uncorrected  data  are  shown  in  Figs  7. 


3.1  Determining  the  quality  of  a  correc¬ 
tion  and  displaying  the  results. 

In  their  paper,  [Willson  and  Shafer,  91b],  Will- 
son  and  shafer  use  the  displacement  of  the  loca¬ 
tion  of  the  zero-crossings  edges  as  an  error  mea¬ 
sure.  In  this  paper  we  do  not  use  such  geomet¬ 
ric  measures  but  rather  use  colormetric  mea¬ 
sures.  The  primary  reason  for  this  change  in 
error  measures  is  that  we  will  directly  manip¬ 
ulate  the  geometry  of  the  image  and  felt  this 
would  be  an  unfair  measure  to  apply  to  our  al¬ 
gorithm.  Secondly,  the  edge  position  does  not 
actually  address  the  issue  of  focus  in  each  band 
since  a  blurry  red  image  might  have  its  edge  in 
the  same  position  as  blue  but  the  color  proper¬ 
ties  of  the  images  could  be  poor. 

Since  the  actual  scene  is  a  black  and  white 
checker  pattern,  all  pixels  in  the  scene  should 
image  to  some  shade  of  gray.  To  visualize  the 
errors  we  consider  two  techniques:  direct  plot¬ 
ting,  and  computing  an  error  measure.  The 
first,  direct  display,  involves  plotting  two  2D  his¬ 
tograms  where  for  each  pixel  its  b  pixel  value  is 
used  as  the  y  coordinate  of  the  plot  while  the 
X  coordinate  is  the  r  pixel  value  or  the  g  pixel 
value.  In  these  plots,  their  is  some  ideal  curve 
for  the  camera  response.  If  the  camera  is  linear 
the  ideal  would  be  a  straight  line.  The  impor¬ 
tant  information  in  these  plots  is  the  spread  of 
the  points  around  their  “central”  tendency,  the 
wider  the  spread,  the  more  uncorrected  the  im¬ 
ages. 

Examples  of  these  plots  can  be  seen  in  Fig.  2 
which  shows  the  2d  histograms  for  the  uncor¬ 
rected  images  for  the  CCTV  camera  example. 
Given  the  intensity  ranges  of  the  images  we  show 
the  part  of  the  histogram  containing  data.  On 
the  right  of  that  figure  are  the  plots  for  green 
vs- blue  (top)  and  red-vs-blue  for  what  we  caD 
the  big  window.  Because  we  did  not  have  in¬ 
formation  about  the  warp  at  the  very  edges  and 
because  without  a  global  model  to  allow  extrap¬ 
olation  of  the  warping  function  we  consider  only 
the  window  with  rows  [15,  465]  and  columns  [15, 
497].  Note  that  the  active  optics  approach  also 
has  its  poorest  performance  near  the  bound¬ 
ary,  and  all  comparisons  will  consider  its  perfor¬ 
mance  in  tne  same  window.  The  2D  histograms 
for  the  big  window  uses  intensity  to  encode  the 
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Figure  3:  2D  histograms  showing  dispersion  in  color  space  using  the  big  window  [15  465]  X  [15 
497].  The  top  row  shows  the  corrections  to  green  (blue-green  histogram),  and  the  bottom  shows 
corrections  for  red.  The  left  shows  the  results  obtainable  with  active  optics.  The  right  shows  the 
image  warping  results.  Both  techniques  work  extremely  well  on  green  and  do  a  credible  job  on 
red,  for  quantitative  comparisons  see  table  1.  Overall  the  active  optics  approach  is  better  (tighter 
cluster)  and  also  more  symmetric  in  its  error.  The  sigmoid  shape  which  is  slightly  visible  in  the 
blue-red  histogram  for  image  warping  is  caused  by  differential  blur  after  correcting  for  magnification 


affects. 

logarithm  of  the  number  of  items  in  each  of  the 
64x64  bins,  with  black  meaning  1000  pixels  in 
that  bin.  Note  that  a  few  bins  are  clipped  be¬ 
cause  they  contain  >  1000  pixels.  On  the  left 
of  Fig.  2,  we  see  the  2D  histograms  for  the  red 
components,  this  time  restricted  to  a  window 
in  the  lower  left  of  the  image,  [15  65]  x  [15 
65].  This  represents  a  region  which  is,  approx¬ 
imately,  maximally  distant  from  the  optic  axis 
and  hence  where  one  would  expect  to  find  max¬ 
imal  artifacts.  The  plots  are  similar  in  nature 
except  that  they  use  a  log  scale  with  100  items 
being  black. 

The  remmning  measures  we  will  report  are 


quantitative.  While  it  seems  intuitive  to  con¬ 
sider  RMS  distance  to  the  gray-line,  this  has  a 
problem:  the  image  values  are  not  normalized, 
and  we  did  not  have  radiometric  calibration  in¬ 
formation.  Thus  we  adopted  the  following  ap¬ 
proach.  Because  the  only  information  is  near 
the  edges  we  restricted  our  attention  to  this 
region.^^  The  rows  in  tables  1  and  3  marked 
“outside  mask”  computed  their  values  outside 
this  mask  region  to  show  the  size  of  the  error 

**  Determined  as  an  expansion  of  the  region  where  the  gra¬ 
dient  magnitude  was  >  4  for  the  CCTV  example  and  >  2 
for  the  Photometries  examples.  (Obtained  by  the  KBVision 
sequence  FastGauss(2),  Gradim,  Normlm  ,  Thrshim  followed 
by  a  MorphOps  with  “d8  5e8” . 
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Figure  4:  2D  histograms  showing  dispersion  in  color  space  using  dilferent  windows.  The  shows  the 
results  on  the  box  from  the  lower  left  of  the  image,  [15  65]  x  [15  65].  In  this  case  the  active  optics 
(left)  does  not  do  as  well,  probably  because  of  geometric  lens  distortions  affecting  the  image.  (The 
open  nature  of  the  plot  is  indicative  of  a  magnification  error.)  The  image  warping  approach,  lower 
right,  does  reasonably  weU.  It  too  has  more  problems  in  the  corners,  possibly  because  of  inaccuracies 
in  the  warping,  plus  focus  affects.  These  observations  are  also  supported  by  the  quantitative  results 
in  table  1. 


Table  1:  Table  of  error  for  CCTV  examples.  As  ca"  be  seen  the  active  optics  approach  produces 
quantitatively  better  values  for  all  examples  except  the  one  in  the  lower  left  part  of  the  image.  Note 
the  differences  in  the  blur  related  measures  on  the  vertical  vs  horizontal  windows  in  the  images. 


measures  outside  this  region  of  interest. 

Given  that  we  do  not  have  radiometric  cali¬ 
bration  information,  we  use  a  simple  heuristic 
to  compute  approximate  information.  We  know 
that  the  calibration  target  was  to  be  black  and 
white.  We  computed  a  set  of  reference  values 
for  every  10  x  10  window  in  the  input.  These 
references  values  are  obtained  by  considering 
only  pixels  outside  the  mask  described  above 
but  within  a  window  of  60x80  pixels  for  the 
CCTV  (25  X  60  for  the  Photometries)  centered 


around  the  current  point.  Within  this  window 
we  consider  average  aU  those  above  130  to  get 
a  “white”  level  and  all  those  below  100  to  get 
a  “black”  level.  This  was  done  separately  for 
each  color  band.  Using  these  reference  values, 
we  define  five  error  measures.  The  first,  which 
we  call  Gray-line  error,  is  the  average  distance 
from  a  RGB  triple  to  the  line  defined  by  the 
reference  values.  This  error  measure  relates  to 
the  color  shift  of  a  pixel.  The  remaining  mea¬ 
sures  were  meant  to  be  more  sensitive  to  blur 


Algorithm 

Region 

Gray-line 

error 

BW-RGB 

error 

BW-R 

error 

BW-G 

error 

BW-B 

error 

Uncorrected 

outside  mask 

237.234 

0.096 

0.073 

0.069 

0.076 

Uncorrected 

[15  465]  [15  497] 

568.381 

0.363 

0.300 

0.261 

0.260 

Active  Optics 

378.378 

0.353 

0.284 

0.257 

0.260 

Image  Warping 

[15  465]  [15  497] 

411.030 

0.364 

0.301 

0.261 

0.260 

Uncorrected 

[  15  65]  [15  65] 

616.824 

1.049 

0.861 

0.761 

0.756 

Active  Optics 

557.978 

1.039 

0.838 

0.765 

0.756 

Image  Warping 

[  15  65]  [15  65] 

449.579 

1.053 

0.869 

0.760 

0.756 

Uncorrected 

50  400  [270  300 

778.941 

0.880 

0.692 

0.636 

0.673 

Active  Optics 

[50  400  [270  300 

0.875 

0.637 

0.673 

Image  Warping 

50  400  [270  300 

703.478 

0.880 

0.693 

0.635 

0.673 

Uncorrected 

235  260]  [50  430 

554.214 

0.831 

0.692 

0.594 

0.592 

Active  Optics 

235  260]  [50  430 

363.741 

0.800 

0.641 

0.585 

0.592 

Image  warping 

235  260]  [50  430 

472.019 

0.833 

0.696 

0.595 

0.j92 
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Figure  5:  2D  histograms  showing  dispersion  in  color  space  using  only  edges  of  one  type.  The  upper 
plots  show  the  performance  of  the  two  algorithms  (active  optics  on  the  left)  when  applied  to  vertical 
window  with  extents  [50,  400]  X  [270,  300].  This  window  contains  only  horizontal  edges.  In  this 
case  the  active  optics  does  extremely  well  (it  is  best  on  or  near  axis).  As  can  be  seen  in  the  the 
upper  right  the  image-warping  technique  has  a  sigmoid  shaped  bias.  This  is  caused  by  difference 
in  focus  between  the  red  and  blue  channels.  The  bottom  row  uses  the  horizontal  box  [235,  260] 
X  [15,  430]  which  contains  only  vertical  edges.  Again,  active  optics  is  right  on  the  money.  The 
image-warping  technique  does  not  show  the  differential  blur  as  in  the  vertical  case,  but  this  time 
shows  some  residual  magnification  error  (mostly  near  the  outer  edges  of  the  image.)  We  are  not  sure 
about  the  exact  cause  of  this  directionally  selective  behavior,  but  have  found  a  similar  difference 
in  the  blur  of  un corrected  images. 


within  the  image.  Note  that  the  Gray-line  error 
could  be  made  small  by  overly  blurring  the  im¬ 
age,  because  everything  would  approach  gray! 
The  second  quantitative  error  measure,  which 
we  call  BW-RGB  error  is  defined  as  the  mean 
pointwise  distance  from  each  RGB  triple  to  the 
nearer  of  the  reference  triples.  Since  the  under¬ 
lying  image  was  supposed  to  be  black  and  white 
this  measure  should  be  small.  If  the  images  are 
locally  blurred  this  measure  will  grow.  If  there 


is  excessive  blurring,  the  heuristic  calibration 
processes  wiU  cause  this  measure  to  become  too 
small.  The  remaining  measures  were  meant  to 
be  sensitive  to  blur  separate  in  each  wavelength. 
The  third  measure,  which  we  call  BW-R  error, 
is  the  distance  between  the  R  value  of  a  pixel 
and  closer  of  the  reference  values  for  R.  The 
measures  BW-G  error  and  BW-B  error  are  de¬ 
fined  similarly. 

Under  ideal  imaging,  each  of  the  error  mea- 


Figure  6:  2D  histograms  showing  dispersion  in  color  space  using  the  big  window  and  the  red  channel. 
Here  we  see  the  4  different  image  warping  results  using  different  image  reconstruction  techniques. 
The  upper  left  is  the  quadratic-box  using  cubic-convolution  beised  edge  values  with  a  value  of 
A  =  —1.  This  results  in  a  correction  which  is  barely  different  from  using  linear  interpolation  to  get 
the  edges.  The  upper  right  shows  the  results  using  just  linear  interpolation  for  image  reconstruction. 
While  it  still  does  well,  it  is  not  as  tight  a  cluster  as  the  quadratic-box  restoration  approach.  In  the 
lower  right  we  see  the  results  of  applying  the  integrating  resampler  using  cubic-convolution  with 
A  =  —1.  While  it  may  be  a  superior  image  reconstruction  filter,  it  has  significant  problems  in  this 
application.  We  also  tried  cubic-convolution  as  it  was  originally  intended  (i.e.  doing  point  sampling 
reconstruction),  and  the  results  were  even  worse.  On  the  lower  left  we  show  the  result  of  another 
restoration  filter.  This  one  is  also  a  quadratic  spline,  but  the  point  spread  function  used  to  define 
it  was  a  4  piece  cubic  approximation  to  a  Gaussian. 


sures  would  be  zero,  and  for  non-ideal  imag¬ 
ing  smaller  error  measures  are  better.  Unfortu¬ 
nately,  interpretation  of  these  measures  is  com¬ 
plicated  by  the  fact  that  they  do  not  have  intu¬ 
itive  units  of  measure,  so  its  not  clear  how  im¬ 
portant  a  difference  of  10  units  is  in  the  gray-line 
error.  Additionally,  because  the  images  have 
noise  and  also  because  our  approximated  cali¬ 
bration  data  is  not  perfect,  the  error  measures 
have  an  offset  so  that  it  is  unlikely  they  would 


ever  attain  the  value  zero.  When  we  report  the 
quantitative  results,  we  will  also  report  the  error 
measures  in  the  area  outside  the  aforementioned 
edge  mask.  In  these  regions  the  image  should 
have  little  chromatic  aberration,  and  the  resid¬ 
ual  error  measure  gives  some  indication  of  the 
base  level  of  each  error  measures. 

3.2  Analysis  of  results 

The  analysis  is  mostly  the  data  itself,  pre¬ 
sented  in  figures  3-8  and  tables  1-3.  The  cap- 
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Figure  7:  Here  we  see  some  examples  using  a  Photometries  camera  and  a  Fujinon  lens.  The  figure 
shows  2D  histograms  of  color  space,  with  blue  vs  green  on  theleft,  and  red  vs  blue  on  the  right. 
The  lens  was  supposed  to  have  been  corrected  for  chromatic  aberration.  Obviously  it  was  not 
completely  corrected.  Note  that  these  plots  are  on  a  different  scale  (black  =100)  than  the  CCTV 
examples  (black =1000). 


Figure  8;  Here  we  see  the  corrected  versions  of  the  Photometries  example.  Green  is  on  top,  red 
on  the  bottom.  On  the  left  are  the  active  optics  approach.  On  the  right  is  the  imaging  warping 


approach.  Qualitatively,  we  did  comparably  well,  for  a  quantitative  comparison  see  table  3  While 


the  active  optics  approach  reduced  the  RMS  error  for  the  red  channel,  it  did  increase  the  .size  of 


the  envelope  in  color  space. 
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Algorithm 

Gray-line 

BW-RGB 

BW-R 

BW-G 

BW-B 

error 

error 

error 

error 

error 

Uncorrected 

0.363 

■iWiM 

0.261 

0.260 

Active  Optics 

0.353 

0.257 

0.260 

Quadratic-box  Restoration,  edges 
from  linear  interpolation 

411.030 

0.364 

0.301 

0.261 

0.260 

Quadratic- box  Restoration,  edges 
from  Cubic  Convolution,  A=-l 

412.021 

0.363 

0.301 

0.261 

0.260 

Quadratic-gauss  restoration,  edges 
from  Cubic  Convolution,  A=-l 

411.744 

0.364 

0.301 

0.261 

0.260 

Bi-linear  Interpolation 

453.921 

0.366 

0.303 

0.264 

0.260 

Cubic  Convolution,  A=0 

581.295 

0.367 

0.305 

0.266 

0.260 

Cubic  Convolution,  A=-l 

585.868 

0.363 

0.300 

0.260 

0.260 

Table  2:  Table  of  error  for  different  reconstruction  algorithms  applied  to  CCTV  example.  There 
is  a  slight  sharpening  when  using  y4  =  —  1,  but  because  the  warp  is  rather  smad,  the  sharpening 
is  not  very  significant.  Note  that  the  difference  between  the  new  reconstruction  methods  and 
linear  interpolation  is  about  the  same  as  the  difference  between  active  optics  and  the  new  methods. 
Finally,  cubic  convolution  seems  worse  than  the  uncorrected  image,  although  the  qualitative  results 
looked  like  some  improvement.  We  are  still  investigating  this  behavior. 


Algorithm 

Region 

Gray-line 

error 

BW-RGB 

error 

BW-R 

error 

BW-G 

error 

BW-B 

error 

Uncorrected 

[outside  mask] 

306.499 

0.154 

0.113 

0.119 

■iHM 

Uncorrected 

[15  323] 

[15  323 

694.093 

0.391 

MilR'iVlB 

0.288 

0.301 

Active  Optics 

[15  323 

15  323 

456.365 

0.392 

0.301 

0.291 

0.301 

Image  Warping 

[15  323 

15  323 

562.715 

0.399 

0.304 

0.305 

0.301 

Uncorrected 

805.603 

0.969 

0.731 

0.670 

Active  Optics 

[  15  65]  [15  65] 

546.352 

0.968 

0.709 

0.694 

■riHflilM 

Image  Warping 

[  15  65]  [15  65] 

567.845 

1.027 

0.774 

0.775 

0.791 

Table  3:  Table  of  error  for  Photometries  example 


tions  in  the  figures  are  meant  to  be  almost  self 
contained,  and  present  most  of  the  commen¬ 
tary.  The  2D  histograms  are  meant  to  pro¬ 
vide  a  qualitative  measure  of  performance,  and 
clearly  show  the  largest  errors.  The  tables  pro¬ 
vide  a  more  quantitative  comparison  with  a  un¬ 
weighted  sum  of  distances  type  mixing  of  large 
and  small  errprs.  The  references  to  the  active 
optics  approach-referees  to  [Willson  and  Shafer, 
91b],  who  were  kind  enough  to  supply  us  with 
their  data.  References  to  quadratic-box  restora¬ 
tion  refer  to  the  method  described  in  section  2.1, 
[Boult  and  Wolberg,  9l]  and  [Wolberg  and  Boult, 
91],  with  edge  values  determined  with  linear  in¬ 
terpolation.  Details  on  cubic  convolution,  which 
is  use  in  figure  6  and  table  2  can  be  found  in 
the  2  previous  references  as  well  as  [Rifman  and 
McKinnon,  74]  and  [Park  and  Schowengerdt, 
83].  We  note  that  the  figures  use  cubic  convolu¬ 
tion  in  the  integrating  resampling  approach.  We 


also  tested  cubic  convolution  using  point  sam¬ 
pling,  and  its  performance  was  slightly  worse. 

3.2.1  CCTV  camera  based  examples 

These  images  were  collected  at  CMU,  and 
used  in  [Willson  and  Shafer,  91b].  The  used 
a  General  Imaging  camera  connected  to  Matrox 
frame  grabber.  The  lens  was  a  Cosmicar  motor¬ 
ized  zoom  lens  (12.5-75mm),  with  a  minimum 
focus  distance  of  1.2m.  For  imaging  they  used  a 
Corion  IR  block  -f  filters  for  R,G  and  B.***  The 
the  1/2”  checkerboard  was  imaged  at  a  distance 
of  1.5m.  The  original  blue  channel  and  initial 
color  histograms  were  shown  in  Figs  2. 

3.3  Improvement  to  the  image  warping 
technique 

If  the  original  images  had  been  taken  at  a 
point  of  maximal  focus  for  yellow  (or  “white”), 

***The  used  Wratten  filters.  #25  +  0.9ND  for  red,  #58  + 
0.6ND  for  green,  and  #47B  for  blue. 
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then  the  results  of  the  image  warping  techniques 
would  probably  have  been  even  better.  A  sig¬ 
nificant  part  of  our  error  is  due  to  the  differ¬ 
ential  blur  between  wavelengths,  which  is  not 
corrected  by  warping.  In  the  CCTV  case  this 
difference  should  be  maximal  when  focused  on 
an  extremal  wavelength  (e.g.  blue)  as  was  the 
case  here.  Note  that  focusing  on  yellow  or  an 
unfiltered  image  should  reduce  the  error  in  the 
uncorrected  images  a.s  well,  but  should  not  af¬ 
fect  the  results  of  the  active  optics  approach. 

The  image  warping  technique  would  also  likely 
benefit  from  a  denser  set  of  feature  points,  espe¬ 
cially  near  the  edges  where  the  lens  distortions 
are  changing  most  quickly.  Future  work  wiU  ex¬ 
amine  the  calibration  techniques  to  correct  both 
for  chromatic  aberration  and  radial/tangential 
lens  distortion.  Further  we  will  be  looking  into 
determining  a  functional  form,  parameterized 
by  zoom,  focus  and  aperture  settings,  for  the 
correction  functions. 

As  discussed  below,  one  of  the  main  advan¬ 
tages  of  the  active  optics  approach  is  the  ability 
to  correct  for  axial  aberrations,  i.e.  wavelength 
dependent  blur.  An  interesting  approach  would 
be  to  use  active  focus  for  each  channel,  then 
use  image  warping  on  the  results  to  correct  for 
lateral  chromatic  aberration.  This  should  pro¬ 
vide  a  good  tradeoff  between  fidelity  and  cost  of 
implementation  since  it  would  require  only  ac¬ 
tive  focus  control.  In  fact,  the  focus  differences 
might  be  implemented  directly  in  an  RGB  sen¬ 
sor,  e.g.  by  beam  splitting  and  imaging  each 
separately  with  a  slightly  different  focal  length. 

4  Critical  Comparison 

We  now  critically  compare  the  two  techniques, 
in  a  relative  fashion.  We  show  a  -|-  when  the 
image  warping  technique  has  the  advantage,  a 
-  when  the  active  optics  approach  has  the  ad¬ 
vantage,  and  a  ±,  when  the  advantage  might 
depend  on  the  application. 

-f-  Warping  can  be  applied  to  images  taken  with 
an  “RGB”  camera  where  each  frame  is  col¬ 
lected  simultaneously.  Thus  it  has  the  po¬ 
tential  for  use  in  color  sequences. 

-1-  Warping  does  not  require  specialized  equip¬ 
ment.  While  the  use  of  motorized  zoom/focus 


is  growing,  the  ability  to  precisely  translate 
the  camera  between  images  is  less  common.* 
-f-  Warping  can  handle  significant  chromatically 
varying  geometric  distortions  even  if  no  fo¬ 
cus/zoom  could  correct  for  them  (such  as 
higher  order  lens  distortions). 

-f-  The  warping  approach  holds  the  potential  to 
correct  for  other  geometric  and  radiometric 
aberrations  at  the  same  time  it  corrects  for 
chromatic  aberrations.  This,  however,  would 
need  more  calibration  information. 

±  The  warping  approach  can  be  successfully 
applied  to  lenses  that  have  undergone  “chro¬ 
matic  aberration  correction” .  Because  of  the 
complex  nature  of  the  chromatic  aberrations 
on  such  lenses,  image  warping  may,  depend¬ 
ing  on  your  error  criterion,  do  better  than 
active  optics. 

±  Because  image  warping,  as  describe  here,  is 
local  in  nature  errors  in  the  localization  of 
calibration  features  will  have  a  local  affect. 
Such  errors,  however,  wiU  not  be  mitigated, 
except  in  spatial  extent,  by  the  number  of 
features.  If  we  used  a  global  model  for  the 
distortion  then  we  might  offset  feature  local¬ 
ization  errors  by  overconstraining  the  model. 

—  For  simple  (uncorrected)  lenses,  the  active 
optics  approach  yields  better  overall  results 
since  it  can  refocus  the  channels.  We  can  not 
quantitatively  say  how  much  active  optics 
gains  since  the  image  warping  results  should 
improve  if  the  images  are  taken  focused  with 
yellow  or  unfiltered  light. 

—  The  active  optics  approach  does  need  the 
same  amount  of  calibration  information  for 
operation  on  a  number  of  focus  planes.  They 
would  only  need  to  compute  change  in  zoom, 
and  shifts  for  each  different  depth.  The  cur¬ 
rent  image  warping  would  need  a  full  cubic- 
spline  mesh  for  each  depth,  though  a  more 
global  model  might  be  developed. 

5  Conclusions  and  future  work 

This  paper  demonstrated  the  idea  of  image 
warping  for  the  correction  of  chromatic  aberra¬ 
tion  using  images  from  two  different  camera  / 

*  We  were  not  able  to  determine  the  importance  of  this  last 
stage  of  the  CMU  active  optics  approach. 
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lenses.  The  method  compared  reasonably  well, 
both  qualitatively  and  quantitative,  to  the  CMU 
active  optics  approach,  [Willson  and  Shafer,  91b]. 
The  proposed  warping  methods  used  recently 
developed  image  reconstruction / restoration  meth¬ 
ods  [Boult  and  Wolberg,  9l],  which  were  shown 
to  out  perform  other  techniques. 
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Abstract 

Algorithms  to  detect  pairs  of  edges  which  could 
be  ends  of  a  straight  homogeneous  generalized 
cylinder  (SHGC)  are  proposed.  Geometrical 
constraints  for  the  ends  of  SHGC  are  utilized 
to  group  edgels  and  edge  segments  in  a  com¬ 
plex  image.  Based  on  the  analysis  of  compu¬ 
tational  complexity,  it  is  expected  that  certain 
restrictions  to  object  shapes  or  a  priori  infor¬ 
mation  are  needed  to  avoid  enormous  computar 
tion.  Two  methods  are  investigated.  The  first 
algorithm  is  for  a  subset  of  SHGCs  where  scal¬ 
ing  f2u:tors  of  the  cross-section  at  the  ends  of 
an  SHGC  are  the  same.  The  second  algorithm 
is  for  any  SHGC.  However,  a  modified  version 
is  implemented  to  reduce  computation;  given  a 
reference  end  edge,  it  finds  the  edges  possibly 
paired  with  it.  Several  examples  of  ends  ex¬ 
tracted  from  real  images  are  reported  to  show 
the  feasibility  and  limitation  of  the  algorithms. 

1  Introduction 

Although  shape  recovery  of  three-dimensional  forms 
from  a  two-dimensional  image  is  underconstrained,  it  can 
be  simplified  when  the  objects  belong  to  a  generic  ob¬ 
ject  class.  For  curved  objects,  a  popular  class  for  object 
representation  is  the  class  of  generalized  cylinders  [Bin- . 
ford  71].  A  generalized  cylinder  is  the  solid  obtained  by 
sweeping  a  planar  region,  called  its  cross-seciion,  along  a 
space  curve,  called  its  spine  or  its  axis  [Agin  and  Binford 
73].  And  the  class  of  straight  homogeneous  generalized 
cylinders  (SHGCs)  is  a  subset  of  generalized  cylinders 
where  the  solids  are  obtained  by  an  arbitrary  scaling 
transform  of  an  arbitrary  cross-section  dong  a  straight 
axis  [Shafer  and  Kanade  83]  [Ponce  et  al.  8^. 

Effort  has  gone  into  extraction  of  the  axes  and  limbs. 
Binford  developed  an  algorithm,  called  projection,  and 
Nevatia  and  Binford  used  it  to  extract  the  axis  and  the 

Ctairs  of  the  limb  from  the  contours  of  object  boundaries 
Nevatia  and  Binford  77].  Also,  algorithms  have  been  de¬ 
veloped  to  find  so-called  ribbons,  smoothed  local  symme¬ 
tries,  and  skewed  symmetries,  which  are  two-dimensional 
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projection  of  some  type  of  generalized  cylinders  [Brady 
and  Asada  84]  [Brooks  8l]  [Ponce  90]  [Sumanaweera  et 
al.  88].  These  algorithms  are  based  on  the  local  symme¬ 
try  of  projected  edges  /  limb  contours  and  the  continuity 
of  their  axes.  Ponce  et  al.  clarified  invariant  properties 
of  SHGCs  and  developed  algorithms  to  extract  the  axes 
of  SHGC  from  their  contours  [Ponce  et  al.  89]. 

In  this  paper,  we  report  results  from  algorithms  to 
detect  pairs  of  edges  which  could  be  from  the  ends  of 
an  SHGC  part.  Since  ends  /  cross-sections  and  limbs  / 
meridians  are  complementary  and  essential  information 
to  recover  the  three  dimensional  shape  of  an  SHGC  ob¬ 
ject,  either  finding  the  ends  or  finding  the  limbs  can  be 
the  initial  step  to  generate  partial  descriptions  of  objects 
in  an  image  [Rao  and  Nevatia  87].  Especially  for  indus¬ 
trial  objects  such  as  pipes  or  valves,  ends  are  often  more 
apparent  and  tend  to  give  more  useful  information  than 
limbs. 

It  is  often  difficult  to  recognize  objects  in  a  real  image 
because  (1)  the  image  contains  noise  edges  from  back¬ 
ground,  marks,  shadows,  and  specular  faces,  and  (2)  the 
object  itself  consists  of  many  parts,  and  the  ambiguity 
in  matching  a  large  number  of  edges  causes  combina¬ 
torial  explosion!  This  work  was  initiated  because  little 
work  has  been  done  on  the  detection  of  ends  or  cross- 
sections  despite  that  strong  constraints  exist  for  the  ends 
of  SHGCs.  Finding  the  ends  of  SHGCs  raises  the  level  of 
description  from  edges  to  object  parts  and  could  make 
hypothesis  generation  for  objects  feasible  since  it  sig¬ 
nificantly  reduces  the  number  of  descriptions  by  apply¬ 
ing  strong  geometric  constraints  impost  on  the  ends  of 
SHGCs. 

Preliminary  results  were  presented  at  the  lU  Work¬ 
shop  September  ’90.  Other  researchers  show  similar  re¬ 
sults  using  B-Spline  representation  of  edges  in  detect¬ 
ing  parallel  symmetry  [Saint-Marc  and  Medioni  90].  In 
the  recovery  of  three-dimensional  shapes  from  their  two- 
dimensional  projections,  it  is  shown  that  an  SHGC  can 
be  recovered  with  a  few  degree  of  freedom  when  its 
cross-section  and  limbs  are  given.  However,  most  of  the 
previous  work  show  their  experiments  using  computer- 
synthesized  images  only  [Gross  and  Boult  90]  [Ulupinar 
and  Nevatia  90].  As  mentioned  in  them,  detection  and 
computation  of  symmetries,  which  could  be  the  limbs 
and  ends  of  SHGCs,  is  a  difficult  task  in  real  images. 

The  primary  problem  for  finding  the  ends  of  SHGCs 
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Figure  1 :  Straight  homogeneous  generalized  cylinders. 


is  computational  complexity  for  testing  geometrical  con¬ 
straints.  To  combat  complexity,  we  introduce  restric¬ 
tions  on  object  shapes  or  a  priori  information.  The 
first  algorithm  is  for  cylindrical  ends  SHGCs,  a  subset 
of  SHGCs  where  the  scaling  factors  at  the  ends  are  the 
same.  The  second  algorithm  is  for  any  SIIGC.  However, 
a  modified  version  is  implemented  to  reduce  the  com¬ 
putation;  given  a  reference  end  edge,  it  finds  the  edges 
possibly  paired  with  it.  (It  is  extensible  to  the  no  a 
priori  information  case  at  the  cost  of  execution  time  by 
giving  it  every  edge  as  a  reference  edge.)  Except  for  the 
above  restrictions,  the  algorithms  are  designed  to  avoid 
any  implicit  limitations  to  applicable  object  shapes;  they 
permit  arbitrary  cross-section  shapes,  arbitrary  angles 
between  the  axis  and  the  cross-sections,  and  arbitrary 
sweeping  rules  in  the  second  algorithm.  Also  noise  edges 
from  background,  marks,  and  specular  faces  may  be  in¬ 
cluded  in  input  images.  The  rest  of  the  paper  is  orga¬ 
nized  as  follows; 

In  section  2,  we  clarify  the  properties  of  the  ends  of 
an  SHGC  to  make  the  paper  self-contained,  though  more 
precise  definitions  and  proofs  can  be  found  in  the  pre¬ 
vious  work  mentioned  above.  In  section  3,  we  review 
the  previously  reported  algorithms  from  the  viewpoint 
of  computational  complexity.  In  section  4,  we  present  an 
algorithm  to  find  the  pairs  of  edge  segments  which  could 
be  from  two  equally  scaled  ends  of  an  SHGC.  It  shows 
substantial  ability  to  group  edges  for  an  SGHC  part.  In 
section  5,  we  propose  an  algorithm  which  can  detect  the 
ends  of  any  SHGC.  Several  experimental  results  of  the 
implemented  algorithms  for  various  objects  are  shown. 
In  this  paper,  all  image  data  used  in  the  experiments  are 
acquired  through  TV  cameras,  and  are  processed  by  an 
edge  detector  which  gives  sufficiently  good  measurement 
of  orientation  and  position,  and  links  continuous  edgels 
into  edge  segments. 

2  The  properties  of  the  ends  of  straight 
homogeneous  generalized  cylinders 

In  this  section,  we  clarify  the  properties  of  the  ends 
of  SHGC  to  make  this  paper  self-contained,  though 
they  can  be  found  in  some  previous  work  like  [Rao  and 
Medioni  88]  [Shafer  and  Kanade  83].  In  Ponce  et  d.’s 
definition,  a  straight  homogeneous  generalized  cylinder 
is  the  solid  swept  by  a  planar  cross-section  as  it  is  trans¬ 
lated  and  scaled  along  a  straight  axis  (figure  1).  Cross- 
sections  are  arbitrary  shapes  and  are  not  necessarily  or¬ 
thogonal  to  the  axis.  The  scaling  is  arbitrary  and  is  gov¬ 
erned  by  a  function  along  the  axis,  called  its  sweeping 
rule.  In  this  paper,  we  assume  the  imaging  projection  is 
orthographic. 

According  to  [Ponce  et  al.  89],  the  contour  of  an 
SHGC  in  the  image,  when  including  an  oblique  SHGC, 
are  given  by: 

=  p{0)r{z)d  +  zb, 

b  =  8in/Josin(ao  —  a)w 

■f[cos  sin  /?  —  cos  0  sin  /3o  cos  (oq  —  <»)]m, 

d  =  sin  (0  —  a)w  —  cos  /?  cos  (tf  —  a)u. 


Figure  2:  Coordinate  system  including  oblique  case. 

where  is  a  position  in  the  image;  (u;,u)  is  unit  bases 
of  the  imaging  plane;  p(0)  is  a  cross-section  curve;  r(z) 
is  a  sweeping  rule;  {o,0)  is  a  viewing  angle;  and  (ao,/?o) 
is  orientation  of  the  axis  of  the  SHGC  (figure  2).  Thus, 
the  parallels  or  the  cross-sections,  the  contours  caused 
by  a  constant  z  value,  are  observed  as  follows: 

Fi{0)  =  ai-foi9)  +  ri. 

by  letting  fo{0)  be  p{0)d,  aj  be  r(zi)  ,  and  t,-  be  Zib, 
where  i  indicates  the  i-th  parallel.  In  the  image,  Oi  cor¬ 
responds  to  the  scaling  factor  of  the  i-th  parallel;  /o  is 
the  projected  cross-section;  tj  corresponds  to  the  parallel 
transfer  vector  of  the  i-th  par2tllel. 

In  this  paper,  we  use  the  term  end  pair  or  ends  for 
the  two  ends  of  an  SHGC,  i.e.,  the  first  and  last  parallel; 
co-meridian  edgel  pair  for  a  pair  of  edgels  in  the  ends 
which  lie  in  a  meridian  curve;  corresponding  co-meridian 
edgel  pairs  for  co-meridian  edgel  pairs  which  form  an 
end  pair;  origin  of  scaling  for  the  point  at  which  all  the 
extensions  of  the  lines  which  pass  through  each  edgel 
pair  in  corresponding  co-meridian  edgel  pairs  intersect; 
and  scaling  ratio  for  the  ratio  of  scaling  factors  for  a 
co-meridian  edgel  pair  (figure  3). 

(I)  A  pair  of  edgels  which  form  a  co-meridian  edgel 
pair  have  the  same  orientation. 

In  terms  of  the  first  derivative  of  cross-section  curves, 

ft 

—  =  f'(tf)cosfl-F(fl)sintf 

=  ai(f'{0) coeO  —  f (6) sin  0), 
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are: 


Figure  3:  Origin  of  scaling  and  co-meridian  edgel  pair. 


Figure  4:  The  ends  are  linear  scaling  from  the  origin  of 
scaling. 


du 

80 


=  F'{0)  sin  0  +  F{0)  cos  0 
=  ai(/'(fl)sin6l-h/(tf)costf), 


where  F(0)  =  |F«(0)l./(^)  =  |/o(^)|'  Thus,  orientation 
of  the  tangents  at  the  same  0  points  are  the  same  re¬ 
gardless  of  their  scaling  factors. 

(II)  The  extension  lines  passing  through  each  edgel 
pair  in  corresponding  co-meridian  edgel  pairs  intersect 
at  a  point,  their  origin  of  scaling. 

Let  the  points  in  the  extension  line  be  Gij{0,l).  Then, 


GiiiOJ) 


Fi{0)-Fi{0) 

\Fm-Fi{0)\ 


■l-\-Fi{0) 


( 


Oi  -  a,- 


\Fi(0)-Fi{0)\ 

ij-tj 


+7T7 


l  +  Oj)M0) 

f  +  tj, 


where  /  is  the  length  of  extension;  ai,aj  are  the  scaling 
factors  of  i-th  and  j-th  cross-section;  U  and  fj  are  the 
translation  vectors  of  i-th  and  j-th  cross-section.  Thus, 
we  get  the  position  of  the  origin  of  scaling,  to,  and  the 
length  of  the  extension  to  it,  /o  regardless  of  0: 


lo  =  -^\fj{0)-Fii0)\, 

Oi  -  Oj 

U  =  Gii{0, /o)  =  -  0  +  C- 

(III)  The  scaling  ratio  of  each  edgel  pair  in  correspond¬ 
ing  co-meridian  edgel  pairs  is  the  same  (figure  4). 

The  vector  from  the  origin  of  scaling  to  the  i-th  cross- 
section  Voi(0),  and  that  to  the  j-th  cross-section  Voi(0) 
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Voi(0)  =  M0)  -fo  =  ai(M0)  +  -^^— ^), 

Oj  —  Qi 

VM9)  =  Fj{6)  -  <0  =  Oj{foi0)  + 

Qj  —  Qt,- 

Thus,  the  cross-sections  are  linear  scaling  of  the  original 
cross-section  from  the  origin  of  scaling.  And  the  ratios 
of  the  corresponding  lengths  are  the  same  at  any  0. 

(IV)  The  vector  between  each  edgel  pair  in  correspond¬ 
ing  co-meridian  edgel  pairs  is  the  same  when  its  scaling 
ratio  is  equal  to  1. 

A  vector  between  a  co-meridian  edgel  pair,  Vij(0),  is 
calculated  as: 

ViA^)  =  Voj(0)-Voii0) 

=  {C-l)fo{0)  +  fij. 

where  C  =  oj/oi;  Tij  =  {tj  —U)  is  the  translation  vector 
between  i-th  and  j-th  cross-sections.  Thus,  the  vectors 
are  the  same  regardless  of  0  when  the  value  of  C  is  1. 

(V)  The  curvature  at  a  point  in  an  end  is  proportional 
to  the  reciprocal  of  its  scaling  factor. 

Curvature  in  an  end,  k,  is  calculated  as; 

2F'(0)2  -I-  F{0f  -  F"{0)Fi0) 

“  {F>{0y  +  F(0)2)3/2 

Thus,  the  curvature  is  proportional  to  the  reciprocal  of 
the  scaling  factor  a.  By  using  the  scaling  ratio  C  = 
Oj/ai  =  Ki/Kj  and  the  positions  of  a  corresponding  pair, 
the  position  of  origin  of  scaling  can  be  calculated. 

3  Computational  complexity  of  finding 
symmetries 

In  this  section,  we  overview  the  previous  work  for  finding 
symmetries.  Though  they  analyze  the  limbs  of  general¬ 
ized  cylinders,  the  algorithms  used  there  have  similarity 
to  the  ones  shown  in  section  4  and  5  because  these  type 
of  algorithms  find  pairs  of  edgel  which  satisfy  certain 
geometrical  constraints.  We  don’t  refer  to  the  precise 
geometrical  constraints  here  since  they  are  different  for 
each  type  of  symmetry,  but  consider  the  computational 
complexity  of  the  algorithms. 

First,  the  difficulty  of  the  problem  can  be  classified 
into  the  cases  where  the  verification  of  the  constraints 
requires  only  local  measurements,  such  as  position,  ori¬ 
entation,  and  curvature,  of  a  pair  of  edgel,  or  not.  In 
the  former  case,  the  simplest  implementation  of  the  al¬ 
gorithm  has  two  steps;  generate  every  pair  of  edgel,  and 
then  verify  the  geometrical  constraints  for  each  pair  of 
edgel.  This  algorithm’s  complexity  is  O(n’)  where  n  is 
the  number  of  edgels,  as  shown  in  Brady’s  algorithm  for 
finding  smoothed  local  symmetries  [Brady  and  Asada 
84].  (They  also  showed  another  algorithm  which  fits 
curves  to  the  edges  and  finds  symmetries  analytically.) 

Secondly,  the  computational  complexity  depends  on 
the  complexity  of  input  data;  a  simple  curve  /  edge. 


such  as  a  boundary  of  an  object,  or  multiple  curves  / 
edges.  In  the  former  case,  the  complexity  of  the  al¬ 
gorithm  can  be  reduced  to  0{nk)  by  using  projection 
technique  [Nevatia  and  Binford  77]  (where  k  is  the  num¬ 
ber  of  discretized  orientations).  A  rough  description  of 
the  algorithm  is:  discretize  the  possible  directions  of  the 
correspondence  between  an  edgel  pair;  for  each  of  these 
directions,  project  all  edgels  into  buckets,  and  verity  the 
conditions  for  edgel  pairs  within  the  same  bucket;  lastly, 
group  the  resulting  pairs  into  symmetries.  Ponce  also 
used  this  method  for  finding  skewed  symmetries  [Ponce 
90]. 

When  multiple  edges  are  given  as  input  data,  such 
as  an  edge  image  extracted  from  an  ordinary  intensity 
image,  still  simple  implementation  can  be  applicable  in 
the  complexity  of  O(n^).  This  can  be  reduced  to  0{nke) 
by  using  projection,  where  c  is  the  number  of  edges, 
assuming  that  only  one  edgel  is  put  to  each  bucket  for 
each  edge,  as  mentioned  in  [Sumanaweera  et  al.  88]. 

When  it  is  not  possible  to  determine  whether  a  pair  of 
edgels  satisfy  the  constraints  with  their  local  measure¬ 
ments  only,  in  other  words,  when  the  constraints  include 
one  or  more  free  variables  to  be  selected  globally,  a  hy¬ 
pothesis  generation  and  verification  process  like  Hough 
transform  [Hough  62]  is  reciuired.  Then  its  complexity 
becomes,  for  instance,  0(n^d)  in  the  one  free  variable 
case,  where  d  is  the  number  of  hypotheses  for  a  pair. 
Ponce  et  al.  used  this  type  of  algorithm  to  find  axes  of 
SHGCs  from  their  limb  contours  (Ponce  et  al.  89].  (They 
also  implemented  another  algorithm  which  use  curvature 
feature  points  instead  of  all  the  edgels.) 

In  the  above  discussion,  we  concentrated  on  the 
grouping  of  the  edgel  pairs  and  neglected  necessary 
post-processings.  Most  of  the  algorithms  require  post¬ 
processings,  such  as  selection  of  a  group  in  hypotheses, 
restoration  of  the  pairs  of  edge  segments  from  the  pairs 
of  edgels.  Some  of  the  algorithms  in  the  previous  work, 
such  as  [Brady  and  Asada  84]  [Ponce  et  al.  89]  and 
[Saint-Marc  and  Medioni  90],  use  fitting  of  curves  to 
edges  or  selection  of  edgel  points  by  certain  geometrical 
features.  We  didn’t  examine  their  computational  com¬ 
plexity,  since  these  pre-processings  themselves  are  not 
simple  and  their  complexities  don’t  depend  on  the  num¬ 
ber  of  edgels. 

It  is  natural  that  the  required  computational  complex¬ 
ity  depends  on  both  the  geometrical  constraints  between 
the  edge  pair  to  be  extracted  and  the  number  of  possi¬ 
ble  edge  pairs  in  input  data.  Though  it  is  difficult  in 
general  to  estimate  computing  time  for  an  algorithm,  we 
can  guess  it  for  the  silimar  types  of  algorithms  which 
verify  the  constraints  on  edgels.  In  [Ponce  90],  the  re¬ 
ported  computing  time  by  an  0(nk)  algorithm  is  around 
30  seconds.  Since  the  number  of  edgels  n  increases  by 
the  number  of  edges  e,  O(nke)  algorithms  for  multiple 
edges  should  take  much  longer  time.  The  reported  com¬ 
puting  time  by  an  0(nke)  algorithm  in  [Sumanaweera  et 
al.  88]  is  about  1.5  hour.  And  in  the  previous  work,  the 
authors  presented  alternate  algorithms  for  the  O(n^)  al¬ 
gorithms.  According  to  these  facts  and  our  experiences 
in  implementing  the  algorithms  proposed  in  this  paper, 
it  is  expected  that  for  finding  edgel  pairs  by  verifying 


the  geometrical  constraints,  the  algorithms  which  require 
more  than  O(n^)  computation  are  often  impractical  for 
the  computational  power  of  current  sequential  comput¬ 
ers,  and  that  O(nke)  algorithms  tend  to  take  a  fairly 
long  time.  Since  accurate  measurement  of  the  curveture 
values  may  not  be  available  in  an  edge  image,  the  com¬ 
putational  complexity  of  finding  the  ends  of  SHGCs  is 
roughly  O(nkdc),  where  c  is  the  number  of  partially  par¬ 
allel  edges,  and  d  is  the  number  of  hypotheses  for  a  pair 
of  edgels.  (In  this  paper,  we  assume  that  a  lot  of  edges 
are  given  as  inputs  to  the  algorithms  and  the  constraints 
for  the  ends  of  SHGCs  leave  one  free  variable  without 
curvature  measurement.  See  section  4  and  5.)  This  sug- 
gensts  that  certain  restrictions  may  be  required  to  avoid 
enormous  execution  time.  In  the  following  sections,  we 
investigate  two  possibilities;  one  is  the  restriction  of  the 
shapes,  the  other  is  a  given  a  priori  information. 

4  Simplified  algorithms  in  the  case  of 
cylindrical  ends 

The  algorithm  to  find  the  ends  of  SHGCs  can  be  simpli¬ 
fied  significantly  for  cylindrical  ends,  which  means  the 
scaling  factors  at  the  both  ends  are  the  same.  For  exam¬ 
ple,  the  center  and  right  objects  in  figure  1  have  cilindri- 
cal  ends.  Note  that  ‘cylindrical’  doesn’t  mean  circular 
cross-sections.  For  cylindrical  ends,  the  origin  of  scalings 
are  infinitely  distant,  and  the  scaling  ratio  is  1.  Every 
pair  of  edgels  that  has  the  same  origin  of  scaling  makes 
the  same  projection  angle.  Thus,  we  have  more  strict 
constraints  for  the  ends  of  cylindrical  SHGCs  as  follows: 

(i)  A  pair  of  edgels  which  form  a  co-meridian  edgel  pair 
have  the  same  orientation. 

(ii)  The  vectors  between  each  edgel  pair  in  correspond¬ 
ing  co-meridian  edgel  pairs  have  the  same  orienta¬ 
tion. 

(iii)  The  vectors  between  each  edgel  pair  in  correspond¬ 
ing  co-meridian  edgel  pairs  have  the  same  length. 

By  using  projection,  we  can  get  the  algorithm  to  detect 
cylindrical  ends,  algorithm  1,  summarized  as  follows: 

Algorithm  1 

1.  discretize  orientation  for  projection  direction,  pre¬ 
pare  buckets  for  projection, 

2.  group  edgels  according  to  their  orientation  into 
orientation-groups, 

3.  for  each  projection  direction, 

(a)  for  each  orientation-group,  project  edgels  in  the 
group  into  buckets;  (The  result  is  projection- 
orientation-groups.) 

(b)  generate  all  pairs  of  edgels  in  each  projection- 
orientation-group; 

(c)  calculate  the  distance  between  each  edgel  pair, 
and  group  the  pairs  according  to  their  distances 
into  distance-groups, 

(d)  for  each  distance-group,  restore  the  edge-pairs 
which  correspond  to  the  edgel-pairs  in  the 
group. 
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The  computational  complexity  of  this  algorithm  is 
roughly  O(njfcc),  where  c  is  the  number  of  the  edgels 
which  have  the  same  orientation  in  a  bucket,  i.e.,  the 
number  of  partially  parallel  edges.  More  precisely,  the 
grouping  by  the  above  constraint  (i),  orientation  of  an 
edge!,  requires  0(n)  work  because  it  requires  the  local 
measurement  of  each  edgel  only.  The  grouping  by  the 
constraint  (ii)  and  (iii)  require  0(nkc)  because  the  aver¬ 
age  number  of  edgels  in  a  projection-orientation-group  u 
c  and  that  is  the  average  number  of  edgel  pairs  per  edgel 
in  a  projection  direction.  Edgel-pairs  in  each  group  are 
back-projected  to  edges  and  edge-pairs  are  selected  if  a 
large  part  of  the  original  edgels  are  recovered. 

Examples  of  the  input  image  to  the  implemented  pro¬ 
gram,  which  is  the  edges  detected  in  an  intensity  image, 
and  the  results  of  the  program  are  shown  in  figure  5 
and  figure  6.  (In  these  images,  some  edges  from  back¬ 
ground  are  removed  but  it  is  not  necessary  as  shown  in 
figure  9.)  It  takes  about  15  minutes  to  group  the  possible 
end-edgel-pairs.  In  the  figures,  sets  of  parallel  edge  seg¬ 
ments  are  shown  as  examples  of  the  program’s  results. 
In  figure  5(b),  the  algorithm  grouped  three  edge  seg¬ 
ments  and  recovered  most  of  the  ends  for  a  cylinder  part, 
though  they  have  considerable  gaps  among  them.  In 
figure  5(c),  the  algorithm  detects  many  parallel  curves. 
Although  we  don’t  have  techniques  to  group  such  repet¬ 
itive  structure,  a  clue  for  that  type  of  structure  can  be 
acquired.  In  figure  5(d),  the  algorithm  detects  parallel 
straight  lines.  Though  they  are  neither  ends  nor  cross- 
sections,  the  algorithm  should  detect  parallel  straight 
lines  to  cope  with  polyhedral  objects.  Simple  curves  like 
ellipses  and  straight  lines  are  shown  in  figure  5,  however, 
there  are  no  restriction  on  the  shape  of  cross-section.  In 
the  figures,  we  find  that  the  algorithm  has  the  ability  to 
group  the  edges  into  the  sets  of  parallel  edges,  so  that 
the  results  of  the  algorithm  could  be  used  to  generate 
hypotheses  of  SHGC  parts  for  recognition. 

5  Finding  the  ends  of  SHGCs 

In  this  section,  we  propose  an  algorithm  to  detect  the 
ends  of  any  SHGC.  However,  to  avoid  enormous  compu¬ 
tation,  the  algorithm  is  modified  to  cope  with  a  given 
reference  end  edge.  It  is  extensible  to  no  a  priori  infor¬ 
mation  case  simply  by  giving  it  every  edge  as  a  reference 
end  edge.  Also,  it  is  easy  to  predict  the  performance 
of  the  algorithm  in  the  case  with  no  a  priori  informa¬ 
tion  because  there  is  no  difference  in  the  input  data  or 
grouping  functions  and  the  implemented  algorithm  re¬ 
sults  a  subset  of  all  of  the  results.  The  constraints  for 
the  ends  of  SHGC  are  as  follows: 

(i)  A  pair  of  edgels  which  form  a  co-meridian  edgel  pair 
have  the  same  orientation. 

(ii)  The  extensions  of  the  lines  passing  through  each 
edgel  pair  in  corresponding  co-meridian  edgel  pairs 
intersect  at  an  origin  of  scaling. 

(iii)  The  scaling  ratio  of  each  edgel  pair  in  corresponding 
co-meridian  edgel  pairs  is  the  same. 

(iv)  When  the  scaling  ratio  is  nearly  equal  to  1,  the  dis¬ 
tance  between  each  edgel  pair  in  corresponding  co¬ 
meridian  edgel  pairs  is  nearly  the  same. 


In  order  to  deal  with  cylindrical  ends,  the  fourth  con¬ 
straint  are  used.  However,  to  make  the  explanation  sim¬ 
ple  we  don’t  refer  to  the  fourth  constraint  in  this  section. 
(See  section  4  for  the  fourth  constraint  for  the  ends.) 

Availability  of  curvature  measurement  effects  the  com¬ 
putational  complexity  of  algorithm  to  detect  the  ends  of 
SHGCs  because  the  position  of  the  origin  of  scaling  and 
the  scaling-ratio  can  be  calculated  from  the  positions 
and  curvatures  of  a  co-meridian  edgel  pair.  If  we  could 
measure  curvature  accurately,  the  simplest  algorithm  to 
detect  the  ends  would  be  summarized  as  follows:  dis¬ 
cretize  the  position  in  two-dimensional  space,  and  pre¬ 
pare  cells  for  origin  of  scaling;  for  each  pair  of  edgels, 
verify  the  constraint  of  their  orientation,  calculate  the 
scaling  ratio  Ci  and  position  of  origin  of  scaling  (ri,^,), 
then  put  the  edgel-pair  into  the  discretized  cell  indexed 
hy  (ri,0<,C7,);  get  the  group  of  edgel-pairs  in  each  cell. 

The  complexity  of  this  algorithm  is  O(n^)  except  for 
the  post  processings  to  determine  the  edge-pairs  which 
correspond  to  the  edgel-pairs.  This  can  be  reduced  to 
0{nkc)  by  using  projection  where  n  is  the  number  of 
edgels,  k  is  the  number  of  discretized  directions,  and  c 
is  the  number  of  partially  parallel  edges. 

The  difficulties  are  caused  by  the  facts,  (a)  the  cur¬ 
rent  measurement  of  curvature  is  not  accurate  enough 
to  classify  the  edgel-pairs  with  their  ratio  nor  to  deter¬ 
mine  the  position  of  origin  of  scaling,  (b)  the  position  of 
an  origin  of  scaling  may  be  infinitely  distant,  so  that  it 
may  require  infinite  number  of  cells  without  non-linear 
discretization. 

To  solve  the  former  difficulty,  the  algorithm  shown 
below  hypothesizes  origin  of  scalings  in  the  direction  of 
projection.  Same  as  a  Hough  transform,  this  method  re¬ 
quires  post  processing  to  determine  the  position  of  origin 
of  scaling. 

For  the  latter  difficulty,  we  select  a  polar  coordi¬ 
nate  system  (r,  ^)  and  use  non-linear  scaling  function  to 
roughly  normalize  the  error.  It  is  selected  because  the 
error  in  the  angle  is  roughly  independent  of  the  length  of 
the  extension,  except  around  the  origin  of  coordinates, 
and  the  error  in  radius  is  magnified  proportionally  to  the 
length  of  extension,  i.e.,  the  distance  between  an  edgel  to 
the  origin  of  scaling.  A  scaling  function  a/(r-f  6),  where 
a  and  6  are  constants,  is  selected  because  it  can  be  ap¬ 
proximated  to  a/r,  a  reciprocal  function  when  r  ^  6, 
and  approximated  to  j  —  ft  &  linear  function  when 
r  4^  b  (figure  7).  Here  we  propose  an  alternate  algo¬ 
rithm  summarized  as  follows  (see  figure  8  also): 

Algorithm  2 

1.  discretize  orientation  for  projection  direction,  dis¬ 
cretize  two-dimensional  space  for  origin  of  scaling, 
and  prepare  buckets  for  projection  and  cells  for  ori¬ 
gin  of  scaling, 

2.  group  the  edgels  according  to  their  orientation,  (the 
results  are  called  orientation-groups,) 

3.  for  each  projection  direction,  and  for  each 
orientation-group, 

(a)  project  edgels  in  the  group  into  buckets, 
(the  results  are  called  projection-orientation- 
groups,) 
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(C)  (d) 

Figure  5:  Examples  of  the  extracted  parallels  by  algorithm  1. 


V 


Figure  7:  Accumulator  cells  for  origin  of  scalings  in  al¬ 
gorithm  2. 


Figure  8:  Projection  and  position  of  origin  of  scaling. 

(b)  calculate  the  extension  lines  in  the  projection 
direction  and  determine  the  cells  (r,  which 
each  extension  line  passes  through, 

(c)  put  the  projection-orientation-group  into  the 


(C)  (d) 


Figure  6:  Other  examples  of  the  extracted  parallels  by  algorithm  1. 


4.  for  each  cell, 

(a)  calculate  the  scaling  ratio  for  each  edgel  pair  in 
each  projection-orientation-group  in  the  cell, 

(b)  group  the  edgel-pairs  according  to  their  scaling 
ratios  and  position  of  the  cell,  (the  results  are 
called  scaling-origin-groups,) 

5.  for  each  scaling-origin-group,  restore  the  edge-pairs 

corresponding  to  the  edgel-pairs  in  the  group. 

The  computational  complexity  of  the  above  algorithm 
is  roughly  0{nkdc),  where  d  is  the  number  of  hypothe¬ 
sized  origins  of  scaling  for  an  edgel-pair.  Unfortunately, 
this  algorithm  still  takes  a  very  long  time.  For  instance, 
it  would  take  a  few  days  on  Symbolics  3600  to  get  the 
results  from  the  data  shown  in  figure  9,  so  that  it  is 
not  practical  even  for  the  laboratory  use.  We  reduce  it 
by  giving  a  reference  edge  as  an  end  of  an  SHGC  and 
pre-select  the  candidate  positions  of  origin  of  scaling. 
The  above  algorithm  is  modified  and  the  above  step  4  is 
changed  to  the  followings; 

4-1  select  candidates  for  the  position  of  origin  of  scaling, 
4-II  for  each  candidate  cell  (r,  (f>) 

(a)  for  each  projection-orientation-group  in  the 
cell,  generate  pairs  of  edgel  in  the  group  which 
consist  of  an  edgel  in  the  reference  edge  and 


another  in  the  other  edges,  then  calculate  the 
scaling  ratio  for  each  pair, 

(b)  group  the  edgel-pairs  according- 
to  their  scaling-ratios  and  position  of  the  cell 
into  scaling-origin-groups, 

Hereafter  we  call  this  modified  algorithm  algorithm  2'. 
The  precise  description  of  the  pre-selection  of  the  can¬ 
didates  in  step  4-1  is  presented  in  the  appendix  A.  The 
complexity  of  this  algorithm  is  much  less  than  0{nkdc)\ 
step  2  requires  0(n),  step  3  requires  0{nkd)  at  most, 
step  4-1  requires  0{nkd),  and  step  4-II  requires  O(mpc) 
where  m  is  the  number  of  edgels  in  the  reference  edge 
and  p  is  the  number  of  candidates  for  origin  of  scaling. 
Also,  the  selection  of  the  edge-pairs,  step  5,  which  is 
another  time-consuming  process,  needs  execution  time 
roughly  proportional  to  the  number  of  edgel-pairs,  mpc. 

Figure  9  and  10  shows  examples  of  the  results  by  algo¬ 
rithm  2’.  The  input  edges  are  supposed  to  be  divided  at 
the  peak  of  curvature.  In  figure  9,  most  of  the  end  edges 
are  visible,  so  that  detection  of  the  ends  is  quite  stable 
even  among  the  noise  edges  from  background,  marks,  or 
specular  faces.  In  figure  10,  it  detects  many  false  posi¬ 
tives  for  a  straight  edge  because  every  parallel  straight 
edges  satisfy  the  constraints  used  in  the  algorithm,  how¬ 
ever,  for  curved  edge,  it  results  few  false  positives.  The 
program  takes  about  an  hour  with  30  candidates  for  ori- 
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gin  of  scaling.  Because  of  discretizing  error  and  ranges 
for  grouping,  it  may  find  false  positives.  However,  the 
experimental  results  show  only  a  small  number  of  false 
positives  except  for  parallel  straight  lines,  since  the  geo¬ 
metrical  constraints  derived  from  the  definition  of  SHGC 
are  sufficiently  strict  to  reject  most  false  correspondences 
when  collected  along  an  edge. 

6  Conclusion 

In  this  paper,  we  proposed  two  algorithms  to  detect  the 
pairs  of  edge  segments  that  could  be  from  the  ends  of 
an  SHGC  by  using  geometrical  constraints  derived  from 
the  shape  restriction  of  SHGCs.  Algorithm  1  is  for  the 
subset  of  SHGCs  where  scalings  at  the  both  ends  are  the 
same.  This  restriction  to  the  object  shapes  gives  strict 
constraints,  and  reduces  the  computational  complexity 
of  the  algorithm.  Algorithm  2  is  for  any  SHGC  and  is 
supposed  to  detect  ail  the  pairs  of  edges  which  could 
be  from  the  ends  of  an  SHGC.  However,  to  avoid  an 
enormous  computation  time,  a  modified  version  which 
copes  with  a  given  reference  end  edge  is  implemented. 
Both  algorithms  are  implemented  on  Symbolics  3600  and 
show  their  performances  to  detect  the  ends  of  SHGCs  in 
a  reasonable  time. 

The  algorithms  employ  parameters  which  determine 
the  range  of  feature  values  for  a  group.  They  depend 
on  the  error  in  local  measurements,  position  and  orien¬ 
tation,  and  the  deformation  by  perspective  projection  in 
real  images.  If  the  deformation  were  visible,  then  a  part 
of  the  algorithms  should  be  changed.  Under  perspective 
projection,  the  edgels  that  form  a  co- meridian  edgel  pair 
do  not  have  the  same  orientation,  but  their  tangent  lines 
intersect  in  a  line.  (See  appendix  B  for  its  proof.)  Thus, 
the  second  step  of  the  algorithms  is  replaced  with  the 
following;  project  each  edgel  to  a  bucket,  generate  edgel- 
pairs  in  a  bucket,  calculate  intersection  of  tangent  lines, 
use  Hough  transform  to  group  the  edgel-pairs.  This  al¬ 
gorithm  requires  O(n^d)  or  0(nkde)  computation.  Un¬ 
fortunately,  this  algorithm  may  not  be  practical  without 
some  restrictions  and  is  not  implemented.  Instead,  we 
use  loose  parameters  in  the  experiments.  The  loose  pa¬ 
rameters  absorb  the  measurement  error  and  the  defor¬ 
mation  to  some  extent,  so  that  the  algorithms  tend  to 
detect  all  the  end  pairs  and  false  positives.  Even  with 
fairly  loose  parameters,  the  detected  false  positives  are 
not  so  many  in  experiments.  This  fact  shows  that  even 
though  the  geometrical  constraints  at  a  point  is  weak, 
those  constraints  collected  along  an  edge  become  strong 
enough  to  exclude  most  of  false  positives. 

Accurate  measurement  in  the  edge  detection  may  sig¬ 
nificantly  reduce  the  computation  time  required  to  find 
the  ends  of  SHGCs.  If  we  got  reliable  curvature  values, 
the  computational  complexity  to  find  the  pairs  of  the 
corresponding  edges  would  be  0{nkc)  and  detection  of 
the  ends  of  any  SHGC  could  be  completed  in  a  reason¬ 
able  time.  Though  it  would  be  reduced  further  by  using 
curvature  feature  points  or  curve  fitting  [Saint-Marc  and 
Medioni  90],  reliable  and  efficient  pre-processings  may 
still  be  another  difficult  problem. 

Finding  the  limbs  or  meridian  edges  in  real  images  be¬ 
comes  more  reliable  and  efficient  when  we  have  the  in¬ 


formation  from  the  end  pairs.  And  as  shown  in  previous 
work,  it  is  possible  to  recover  3D  shape  of  SHGCs  when 
we  get  the  cross-sections  and  the  meridians  or  limbs, 
with  other  information,  such  as  shading  or  an  assump¬ 
tion  of  skewed  symmetries.  So,  recovering  3D  shape  of 
SHGCs  from  the  edges  in  real  images  is  natural  extension 
of  this  work  [Sato  and  Binford  92J. 

The  segmentation  or  grouping  ability  of  the  algorithms 
are  especially  useful  for  hypothesis  generation  in  the 
model-based  image  recognition  systems.  Since  the  ge¬ 
ometrical  constraints  for  the  ends  of  an  SHGC  exclude 
most  of  the  false  edge  pairs,  the  algorithms  detect  only 
a  limited  number  of  end  pairs  for  the  SHGC  parts  in  an 
object.  In  the  context  of  the  SUCCESSOR  model-based 
vision  system,  the  detected  ends  of  SHGCs,  combined 
with  other  visual  clues  such  as  skewed  symmetry  and 
ribbons,  would  be  used  to  hypothesize  the  partial  faces 
and  volumes,  which  are  further  matched  to  the  models 
represented  in  the  whole-part  graph  of  SHGCs. 
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A  Pre-selection  of  candidate  positions 
for  origin  of  scaling 

Though  detection  of  the  ends  of  an  SHGC  requires 
grouping  the  edgel-pairs  by  their  properties,  we  reduce 
the  computation  for  them  by  selecting  candidate  posi¬ 
tions  for  origin  of  scalings.  The  pre-selection  can  be  done 
by  applying  a  looser  necessary  condition,  i.e.,  an  edgel 
in  the  reference  end  edge  should  have  parallel  edgels  in 
the  same  projection  bucket.  Thus,  the  following  method 
is  used  to  select  the  candidates. 

1.  for  each  cell  for  origin  of  scaling, 

2.  for  each  projection-orientation-group, 

3.  accumulate  the  number  of  the  groups  con¬ 
taining  edgels  in  the  reference  edge,  Ne, 

4.  accumulate  the  number  of  the  groups  con¬ 
taining  both  edgels  in  the  reference  edge 
and  edgels  in  the  other  edges,  Nc, 

5.  calculate  the  ratio  of  Nc  to  Ne, 

6.  sort  the  ratio  and  return  the  cell  positions  which 
give  high  scores. 

B  A  constraint  for  the  ends  of  SHGC 
in  perspective  projection 

Constraint:  Under  perspective  projection,  for  a  pair  of 
points  in  ends  which  lie  in  an  meridian,  the  tangents  to 
the  contours  at  these  points  intersect  in  a  line  if  they 
intersect,  otherwise  they  are  parallel  to  the  line. 

Proof:  Tangents  to  the  contours  at  these  points  are 
parallel  in  3D  and  are  on  a  plane  made  by  the  line  be¬ 
tween  the  points  and  the  tangent  line.  Parallel  lines  in 
3D  space  intersect  at  a  point  in  perspective  projection 
unless  they  are  paiallel  in  the  imaging  plane.  Suppose 
a  3D  plane  which  contains  the  camera  position  and  is 
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Figure  9:  Examples  of  the  extracted  ends  of  SHGC  by  algorithm  2’. 


parallel  to  the  plane  including  an  end  face,  the  intersec¬ 
tion  of  the  plane  and  extension  of  the  side  face  makes  a 
virtual  cross-section  which  is  projected  to  a  line  in  imag¬ 
ing  plane.  The  tangent  lines  at  the  corresponding  points 
in  a  meridian,  a  pair  of  in  the  ends  and  a  point  in  the 
virtual  cross-section,  intersect  at  a  point  in  the  imaging 
plane  and  the  intersection  is  in  the  line  caused  by  the 
projection  of  the  virtual  cross-section,  unless  they  are 
parallel  in  the  image  plane. 
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Abstract 

An  important  aspect  of  successful  processing 
of  image  data  into  representations  for  higher 
level  analysis  involves  preserving  such  feature 
parameters  as  position,  location,  contrast  and 
blur,  while  substantially  compressing  the  data 
into  a  sparse  or  compact  form.  In  spite  of  this, 
most  “edge  finders”  eUminate  this  information 
almost  inunediately  in  the  process  of  converting 
the  continuous  image  into  a  binary  edge  map. 

An  alternative  representation  concept  is  pro¬ 
posed  which  preserves  these  measures  through 
a  simple  nonlinear  transform.  It  should  be  ap¬ 
plicable  to  numerous  vision  problem  domains 
such  as  stereo,  motion  and  recognition.  This 
representation,  called  the  Displacement  Repre¬ 
sentation,  can  then  be  utilized  using  simple  al¬ 
gorithms  to  extract  higher  level  representations 
such  as  stereo  disparity,  optical  flow,  and  possi¬ 
bly  others.  A  stereo  algorithm  is  described  and 
tested  which  demonstrates  the  utility  of  the  ap- 
pro^u;h  as  well  as  the  integrity  of  the  resulting 
disparity  measures. 

1  Introduction 

The  goals  of  early  vision  algorithms  are  usually  stated  as 
a)  to  process  image  data  in  such  a  way  as  to  produce  new 
representations  which  are  rich  in  feature  related  infor¬ 
mation  useful  for  higher  level  tasks  such  as  stereo  depth, 
motion,  and  object  recognition,  and  h)  to  substantially 
reduce  signal  bandwidth.  Thus,  if  successful,  the  result¬ 
ing  representation  should  be  sparse  without  significant 
loss  of  feature  content,  while  noise  and  spurious  features 
would  be,  by  and  large,  eliminated. 

In  usual  practice,  however,  the  extraction  of  symbolic 
representations  at  the  earliest  stages  obviates  this  goal. 
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telligence  Laboratory  of  the  Massachusetts  Institute  of  Tech¬ 
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Many  edge  finders,  for  instance,  take  intensity  steps  in 
scenes  as  features  using  a  variety  of  methods  [1 ,  2,  3, 4,  5]. 
The  resulting  representation  is  usually  a  bit  map  where 
a  pixel  is  marked  if,  by  using  some  criteria,  it  is  deemed 
to  be  near  some  image  feature.  By  using  this  encoding, 
pixel  level  acuity  is  imposed  on  such  parameters  as  loca^ 
tion  and  orientation  while  others,  such  as  contrast  and 
focus,  are  simply  lost.  Subsequent  operations  such  as 
grouping  or  filtering  cannot  be  expected  to  reverse  this. 
At  the  same  time,  these  algorithms  are  often  prone  to 
marking  noise  as  readily  as  edges,  given  the  small  sup¬ 
port  of  the  operators  [9]. 

An  alternative  approach  is  found  in  linear  correlei- 
tion  models  which  preserve  the  continuous  image  repre¬ 
sentation  while  using  various  linear  operators  to  attack 
problems  such  as  stereo  [8]  and  recognition  [10].  These 
methods,  despite  some  significant  drawbacks,  have  often 
produced  some  promising  results.  Ultimately,  however, 
symbolic  reasoning  must  be  involved  for  the  system  to 
be  of  any  utility  in  a  recognition  scheme.  Less  progress 
has  been  made  on  this  approach,  in  general,  than  the 
strongly  symbolic  approach  of  edge  extraction  and  anal¬ 
ysis.  This  is  perhaps  due  to  the  fact  that  these  methods 
tend  to  be  intolerant  of  such  common  imaging  charac¬ 
teristics  as  varying  scene  illumination  and  foreshortening 
with  rotation. 

An  alternative  approach  is  described  here,  whereby  a 
simple  nonlinear  transform  on  the  image  produces  a  rep¬ 
resentation  which  can  be  used  in  higher  level  vision  do¬ 
mains.  The  properties  of  edge  features  such  as  position, 
orientation,  contrast  and  blur  are  preserved  in  a  sparse 
representation  with  low  noise  characteristics.  I  call  it  the 
Displacement  representation.  It  spears  applicable  in  a 
number  of  problem  domains  such  as  motion,  stereo,  and 
object  recognition.  This  paper  will  develop  the  stereo 
model  and  show  some  results  on  real  image  pairs. 

2  The  Displacement  Representation 

The  Displacement  representation  is  intended  to  be  a  con¬ 
tinuous  representation  which,  at  any  point  in  the  image, 
compactly  encodes  1)  the  distance  to  the  nearest  contrast 
edge,  2)  the  orientation  of  that  edge,  3)  the  contrast  of 
the  edge,  and  4)  the  sharpness  of  the  focus.  Should  such 
a  representation  be  computable,  such  things  as  optica! 
flow  and  stereo  disparity  could  be  obtained  by  such  sim¬ 
ple  linear  operations  on  the  Displacement  function  as 


temporal  differentiation  or  spatial  subtraction.  The  Dis¬ 
placement  function”,  as  used  here,  is  defined  as  that 
part  of  the  representation  which  encodes  the  feature  dis¬ 
tance. 

For  example,  taking  the  ID  image  I{x)  of  Figure  1, 
assume  that  an  image  intensity  step  exists  at  some  loca¬ 
tion  Xo  in  the  image  domain.  The  Displacement  function 
cl(x)  would  encode  the  distance  of  any  point  to  *«,  i.e. 
it  would  be  proportional  to  x  —  Xg-  The  representation 
would  also  encode  contrast,  blur  and  orientation,  but 
these  are  not  shown  here. 

If  you  had  two  such  inputs,  say  one  from  the  left  eye 
and  one  from  the  right,  and  you  subtracted  their  re¬ 
spective  Displacement  functions,  the  resulting  function 
would  represent  the  stereo  disparity  in  the  neighborhood 
of  the  features  (see  Figure  2).  If  you  took  the  temporal 
derivative  of  the  Displacement  function,  d(x) ,  the  motion 
of  the  image  features,  or  optical  flow,  could  be  measured. 
For  this  reason,  finite  differences  or  differentiations  such 
as  this  of  Displacement  functions  are  called  Disparity 
functions. 

The  2D  DiBpl2tcement  representation  will  take  as  its  in¬ 
put  the  grey-scale  image  which  has  been  convolved  with 
the  Laplacian  of  a  Gaussian.  This  is  a  commonly  used 
model  of  the  retinal  center-surround  response  in  biologi¬ 
cal  vision.  Edges  are  defined  as  image  events  model  able 
ets  isolated  piecewise  linear  image  boundaries  disconti¬ 
nuities  in  intensity.  The  degree  to  which  edges  must  be 
approximated  as  linear  or  isolated  depends  on  the  width 
of  the  convolving  Gaussian. 

In  the  following  section,  I  will  derive  a  version  of  the 
Displacement  representation  appropriate  for  ID  image 
slices.  Orientation  will  not  be  encoded  by  this  simplified 
approach.  Instead  of  the  Laplacian  of  Gaussian  con¬ 
volved  input,  a  simple  second  derivative  of  a  ID  Gaus¬ 
sian  is  used. 

The  general  2D  version  of  this  model  is  developed 
next.  This  is  then  applied  to  the  problem  of  stereo  cor¬ 
respondence  using  a  scale  space  scheme. 

2.1  The  ID  Displacement  Model 

The  Displacement  model  is  predicated  on  image  func¬ 
tions  being  composed  of  superposed  “edges” ,  or  step  in¬ 
tensity  features: 


dg(x)  =  x-Xo 
I(x)  =  au(dg(x)) 


where  u(a:)  is  the  unit  step  function,  d<,(x)  is  the  dis¬ 
tance  from  any  point  z  to  the  edge  position  Xg,  and  o 
establishes  both  the  sign  and  contrast  of  the  edge  feature. 
Obviously,  any  arbitrary  discrete  image  can  be  described 
using  such  features,  but  for  the  purposes  of  this  model, 
at  some  scale  features  are  assumed  to  be  isolated.  This 
issue  of  what  constitutes  “isolated”  and  what  happens 
when  edges  aren’t  isolated  will  not  be  dealt  with  in  any 
greater  depth  here. 

The  processing  starts  with  the  image  convolved  with 
a  Gaussian  of  some  width  and  the  second  derivative 
taken: 
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This  is  a  ID  approximation  to  the  retinal  operation  on 
the  image.  A  second  representation,  I'(x),  is  calculated, 
either  through  numerical  integration  of  I"{x),  convolu¬ 
tion  of  the  image  with  a  g'{x)  operator,  or  simply  taking 
the  derivative  of  the  Gaussian  smoothed  image.  In  any 
case,  the  resulting  function  for  the  isolated  image  feature 
I{x)  is: 


/'(z)  =  f  I"{x)dx 

J—OO 

=  ag(dg(x))  (1) 


The  Displacement  function  is  simply  the  ratio  of  these 
two  representations: 


d(z) 


‘  /'(X) 
dg(x) 


In  other  words,  by  taking  the  ratio  of  these  two  im¬ 
age  functions,  we  can  arrive  at  an  approximation  to  the 
distance  function  dg.  This  function  d  is  the  ID  Displace¬ 
ment  representation.  The  denominator,  I'{x),  which  en¬ 
codes  contrast,  is  discussed  in  the  next  section. 


2.2  ID  Weighting  Function 

One  issue  is  determining  the  reliability  of  the  d(z)  es¬ 
timate  when  the  image  is  corrupted  by  additive  normal 
noise.  By  inspection,  the  denominator  of  d(x)  is  Gaus¬ 
sian  (see  equation  1)  and,  as  such,  will  approach  zero  as 
|do(z)|  increases.  When  |z  — z^l  is  large,  then,  we  would 
expect  d(x)  to  be  a  less  reliable  estimate  of  feature  prox¬ 
imity  than  when  z  is  close  to  the  edge.  The  denominator 
magnitude  is  an  intuitively  reasonable  estimate  of  the 
merit,  or  weight,  of  the  d(z)  estimate.  A  recent  study 
has  demonstrated  the  validity  of  this  weighting  function. 

If  one  poses  the  problem  of  finding  the  optimal  esti¬ 
mation  of  Xg  in  the  best  least  squatres  sense  using  the 
Gaussian  denominator  /'(z),  it  turns  out  that  the  solu¬ 
tion  is  the  weighted  average  of  d(x)  where  the  weight  is 
the  denominator  I'(x)  scaled  by  g(x  —  x„): 


w(x)  =  /'(x)g(x  -  Xg) 
_  f  w(x)  [z  —  d(z)]  dx 
“  fw(x)dx 


(2; 

(3) 


One  way  of  effecting  the  g(x  —  Xg)  function  is  by  con¬ 
volving  the  displacement  and  weighting  functions  with 
the  Gaussian  g(x).  Although  convolution  introduces  a 
speed  penalty  —  turning  a  process  which  is  0(1)  per 
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Figure  1;  Basic  ID  Displacement  Function 


pixel  into  something  like  0(<ri),  it  renders  the  local 
weighted  estimate  near  the  feature  position  Xg  optimal. 
Since  convolution  is  a  linear  operation,  as  are  the  op¬ 
erations  which  derive  Disparity  from  Displacement  rep¬ 
resentations,  this  convolution  step  can  be  deferred  until 
after  the  Disparity  computation  in  the  present  model  to 
improve  efficiency. 

Thus  the  two  functions  d(x)  and  I'(x)  encode  the  dis¬ 
tance  and  contrast  of  proximate  features  in  the  ID  im¬ 
age.  Blur  is  encoded  in  the  displacement  representation 
by  its  derivative  —  a  sharp  edge  will  result  in  a  slope  of 
<r^,  blurring  will  produce  a  lesser  slope.  Orientation  will 
require  the  2D  model  discussed  later. 

2.3  ID  Stereo  Disparity  Model 

As  mentioned  earlier,  linear  combinations  of  the  Dis¬ 
placement  function  yield  representations  which  should 
prove  useful  for  such  vision  tasks  as  stereo  disparity,  op¬ 
tical  flow,  and  object  recognition.  This  paper  focuses  on 
stereo  as  an  application  domain. 

In  stereo,  two  viewpoints  of  an  object  can  provide 
unambiguous  depth  information  only  if  the  correspon¬ 
dences  between  image  points  can  be  determined.  This 
correspondence  is  a  partial  function  mapping  some 
points  in  the  right  image  to  points  in  the  left.  The 
difference  between  these  positions  is  the  disparity,  and 
uniquely  determines  the  distance  of  the  feature  in  front 
or  behind  the  fixation  point. 

In  the  case  of  the  Displacement  model,  stereo  disparity 
is  determined  by  subtracting  the  left  and  right  Displace¬ 
ment  functions: 


D(x)  =  dr(x)  -  di(x) 

=  (x  -  Xr)  -1-  (x  -  xi) 

=  Xi-  Xr.  (4) 

This  approach  imposes  the  constraint  that  the  Dis¬ 
placement  functions  for  both  eyes  at  all  x  points  in  the 
domain  represent  the  distance  to  corresponding  features. 
This  can  only  be  completely  true  when  the  disparity  is 
zero.  On  the  other  hand,  when  disparity  is  nonzero,  at 
some  scale,  the  features  are  proximate,  and  the  disparity 
measure  D  will  correspond  to  like  features  in  the  image 
and  thereby  give  the  correct  result.  This  is  the  con¬ 
cept  behind  the  scale  space  approach  used  in  the  stereo 
model.  Also,  unlike  features  will  usually  display  differ¬ 
ences  in  contrast,  contrast  sign,  focus,  and  orientation 
(in  2D).  All  of  these  can  be  tested  to  reject  false  corre¬ 
spondences. 

When  matching  feature  Displacement  functions  are 
subtracted,  the  resulting  Disparity  is  constant  in  the 
neighborhood  of  the  estimate  (see  Figure  2).  If  the 
sharpness  of  the  edges  are  mismatched,  the  slope  of  the 
Disparity  will  be  nonzero.  The  absolute  magnitude  of 
slope  of  the  Disparity  also  is  constrained  to  be  less  than 
unity  between  any  two  features.  Disparity  slope,  there¬ 
fore  is  a  useful  constraint  on  the  matching  process  [7]. 

In  regions  of  the  image  where  little  or  no  feature  infor¬ 
mation  exists,  such  as  areas  of  near  constant  intensity, 
little  depth  information  can  be  inferred  without  impos¬ 
ing  (top  down)  constraints  on  the  solution.  Marr  and 
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Figure  2:  Basic  ID  Stereo  Disparity  Function 

Poggio  [6]  argued  that  the  image  should  exhibit  “smooth¬ 
ness”  which  has  been  exploited  to  handle  such  pathologi¬ 
cal  cases.  Any  measure  of  stereo  disparity  must  account 
for  these  pathological  situations.  In  this  model,  I'(x) 
when  convolved  with  a  Gaussian  is  taken  as  the  inverse 
variance  of  the  Displacement  Function.  Since  Disparity 
is  a  linear  combination,  the  variance  of  the  estimate  will 
be  the  sum  of  the  individual  input  variances.  This  vari¬ 
ance  estimate  is  useful  in  avoiding  featureless  regions  in 
the  domain. 

The  model,  as  described  so  far,  can  account  for  feature 
position,  contrast,  and  focus.  It  also  has  a  means  for 
rejecting  spurious  estimates.  The  full  2D  model  adds 
the  feature  orientation  measure. 

2.4  The  General  2D  Displacement  Model 

Happily,  the  2D  edge  can  be  represented  in  such  a  way 
that  the  analysis  is  still  essentially  ID.  First  we  redefine 
a  primitive  edge  function  as; 

do(x,y)  =  I  sin  ^ -b  y  cos  -  do 
I{x,y)  =  ou(do(*,y)) 

Note  that  we  have  merely  added  the  orientation  of  the 
edge  9  to  the  distance  function  d„(^x,y). 

As  with  the  ID  model,  the  image  is  convolved  with 
the  Laplacian  of  a  Gaussian  (or  a  Gaussian  convolved 
image  is  processed  with  a  Laplacian  operator): 

/"(*,  y)  =  V®y(a:,  y)  *  I(x,  y) 

=  ay'(</o(®,y)) 

<io{x,y) 

=  -Of  -V..  'y(do(a;,y)) 

Note  we  are  still  able  to  express  this  as  a  ID  Gaussian 
function  of  the  distance  to  the  edge  do(x,y). 

Finally,  instead  of  integrating  this  representation,  as 
we  did  in  the  ID  model,  we  will  take  the  gradient  of  the 
Gaussian  convolved  image.  In  cortex,  it  is  probable  that 
the  Gradient  representation  (or  something  very  much 
like  it)  is  derived  directly  from  the  symmetric  Laplacian- 
like  center  surround  representation  found  in  the  retina 
and  the  Lateral  Geniculate  Nucleus.  From  a  computa¬ 
tional  standpoint,  however,  this  issue  is  not  central  to 


this  discourse.  Calculation  of  the  gradient  directly  from 
the  smoothed  image  is  not  a  major  computational  chal¬ 
lenge. 

T(r,y)  =  V(g(x,y)*  I(x,y)) 

=  Qg{do{x,y))(sin9i+ cosOj) 

=  ag{do{x,y))& 

From  these  two  representations,  we  can  arrive  at  a 
Displacement  estimate  of  do{x,y)  just  as  we  did  in  the 
ID  case,  except  we  can  choose  between  a  vector  and  a 
scalar  format; 

2 

‘lir(*,y)ll 

doix,y) 

2^"(ig.y)i'(^.y) 

^  l|r(*,y)IP 

d<,{x,y)& 

The  above  vector  and  scalar  representations  of  Dis¬ 
placement  functions  can  yield  Disparity  representations 
through  subtraction  or  differentiation,  just  as  was  done 
in  the  ID  case.  With  the  vector  form  d(x,y),  the  sub¬ 
traction  of  two  representations  results  in  a  Disparity 
vector  field  where  the  magnitude  is  proportional  to  the 
shortest  distance  between  the  two  features,  and  the  di¬ 
rection  is  normal  to  the  features  (assuming  aligned  fea¬ 
tures). 

This  form  of  measuring  Displacement  is  ideal  for 
matching  and  motion  in  the  2D  scene.  For  stereo,  how¬ 
ever,  we  are  interested  in  a  more  restricted  measure  of 
feature  positions  —  epipolar  Displacements. 

2.5  Tae  2D  Stereo  Displacement  Model 

A  stereo  image  pair  has  the  constraint  that  for  any  im¬ 
age  point  in  one  scene  p/,  there  exists  a  one  dimensional 
slice  on  the  other  image  Cr  which  must  contain  the  cor¬ 
responding  point,  assuming  it  exists.  Conversely  any 
point  along  that  corresponding  line  in  the  second  image 
Pr  is  constrained  to  match  to  a  point  along  a  line  e/  in  the 
first  image  through  the  point  Pi .  This  is  a  constraint  im¬ 
posed  by  the  geometry  of  the  optical  setup.  These  lines, 
called  epipolars,  also  lie  on  a  common  plane  which  passes 
through  the  imaged  point  as  well  as  ihe  focal  points  of 
the  cameras  (See  Figure  3). 

Another  aspect  of  the  stereo  configuration  related  to 
the  alignment  of  edges  to  the  epipolar  lines.  When  fea^ 
tures  are  normal  to  these  basically  horizontal  lines,  the 
variance  of  the  displacement  measure  along  the  epipolar 
is  as  stated  before;  inversely  proportional  to  the  gradi¬ 
ent  denominator  I'(x).  As  features  are  rotated  such  that 
they  are  oblique  to  the  epipolars,  the  variance  on  feature 
displacement  estimation  increases  to  the  point  where, 
when  features  are  aligned  with  the  epipolar,  there  is  ab¬ 
solutely  no  way  to  measure  the  intersection  point  of  the 
feature  and  the  epipolar. 

The  correct  weight,  or  inverse  variance,  for  an  epipolar 
Displacement  can  be  calculated  from  the  Displacement 


d(x,  y)  = 

d(*,  y)  = 
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Figure  3;  Stereo  Epipolar  Constraint 


Figure  4:  Epipolar  Displacement  dg 


weight,  V{x,y),  and  the  epipolar  direction  vector  at  the 
image  point  e: 


Wg(x,y)  =  I'(x,y)%(x,y) 

=  ag{dg{x,  y))(0  •  e(a;,  y)). 

A  similar  modification  must  be  made  to  the  2D  Dis¬ 
placement  representation,  since  a  small  normal  displace¬ 
ment  d{x,  y)  will  result  in  large  displacements  along  the 
epipolar  when  the  feature  is  oblique  to  it  (See  Figure  4): 


dt{x,y)  = 


'’wt{x,y) 

<ioi.x,y)g{dg(^x,y)) 

We{x,y) 

do{x,y) 

&(x,y)  ■e(x,y) 


The  epipolar  Displacement,  therefore,  is  similar  to  the 
ID  Displacement  in  being  a  scalar  function,  but  is  based 
on  the  relationship  between  the  2D  Displacement  vector 
function  d(x,]/)  and  the  epipolar  held  e(x,y). 

It  is,  perhaps,  useful  to  note  that  unlike  many  stereo 
algorithms,  this  approach  requires  no  restriction  on  the 
epipolar  configuration  or  other  imaging  alignment  as¬ 
sumptions. 


3  The  Scalespace  Stereo  Algorithm 

The  above  sections  outlined  the  derivation  of  the  stereo 
displacement,  i.e.  the  distance  along  any  epipolar  from 
any  point  on  the  epipole  to  the  nearest  feature.  This 
is  weighted  by  an  inverse  variance  estimate  Wg  which 
accounts  for  the  uncertainty  associated  with  oblique  fea¬ 
ture  alignments  as  well  as  low  feature  contrast.  The  blur 
and  feature  position  are  encoded  in  the  displacement  es¬ 
timate  dg.  All  that  remains  is  to  combine  them  into  a 
stereo  Disparity  estimator. 

As  before,  the  basic  calculation  involves  simple  sub¬ 
traction  of  displacement  estimates  and  summation  of 
variance  estimates: 

y)  =  (dr{xr,  Vr)  “  d,{xi,  yi))/2  +  D{x,y) 

W{x,y)  =  (ti;-*(xr,yr)  +  wj’^{x,,yi))~'^ 

where  (inJ/r)  and  (xi,yi)  are  positions  along  matched 
epipolars  in  the  stereo  image  pair  (a  calibration  issue) 
displaced  by  any  prior  Disparity  estimate  available  — 
D'{x,y).  These  choices  for  prior  estimates  of  disparity 
along  the  epipoles  is  where  the  scale  space  approach  en¬ 
ters  in. 

As  mentioned  earlier,  there  is  a  need  to  avoid  false 
correspondences  in  any  stereo  algorithm.  With  this  rep¬ 
resentation,  however,  we  have  substantial  information  on 
hand  to  help  avoid  such  mistakes.  Given  an  estimated 
Disparity,  the  algorithm  looks  up  the  positions  along  the 
appropriate  epipoles  (xr,yryXi,  andyt)  and  calculates  the 
Disparity  D  and  weight  W  above.  Any  or  all  of  the  fol¬ 
lowing  criteria  can  be  used  to  detect  mismatches  and,  if 
so,  zero  the  weight  function  W{x,y)  over  the  offending 
intervads: 

•  If  the  contrast  signs  are  mismatched  (the  signs  of 
the  We’s  don’t  match),  or 

•  a  real  edge  is  being  matched  to  an  illusory  edge  ^ , 
or 

•  the  blurs  are  severely  mismatched  (resulting  in  a 
sloped  D  function  near  the  feature),  or 

•  the  edges  are  misaligned,  or 

•  the  absolute  Disparity  or  the  Disparity  slope  is  too 
latrge. 

Once  this  filtering  takes  place,  the  functions  W  and  D' 
are  both  convolved  with  a  gaussian  of  width  ci,.  This,  in 
effect,  generates  the  optim^d  variance  estimate  near  the 
feature  locations,  as  discussed  earlier  (  see  Equations 
2  and  3).  Also,  this  operation  effectively  interpolates 
between  features  and  “fills  in”  where  mismatches  were 
detected  and  their  erroneous  disparities  rejected: 

g*Wix,y) 

The  smoothed  Disparity,  then,  is  fed  to  the  next 
smaller  scale  calculation.  This  determines  the  displace¬ 
ment  along  the  epipolars  for  the  sampling  (rr.yr)  and 

•  These  illusion  edges  are  referred  to  as  Chevrueh  in  the 
psycophysical  literature  and  are  characterized  by  inverse 
sloped  disparity  functions 


395 


Left  Eye  Displacement 


Dii^arity 
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Figure  5:  Stereo  Scale  Space  Algorithm 


(3!t,yi).  At  the  largest  scale  the  initial  Disparity  estimate 
D{x,  y)  is  assumed  to  be  zero.  The  weighting  function 
W{x,y),  is  not  used  further  in  this  model,  although  it 
almost  certainly  could  be  incorporated  in  the  scale  space 
scheme. 

At  the  smallest  scale,  however,  the  weight  function  is 
sampled  where  the  “cyclopean”  Displacement  function 
has  zero  crossings.  This  function; 

C(x,y)  =  idr{Xr,yr)  +  d,(x,,y,))/2 

is,  in  effect,  the  fused  binocular  Displacement  func¬ 
tion.  The  Disparities  D'(x,y)  associated  with  significant 
weight  values  W(x,y)  at  zero  crossings  in  the  cyclopean 
Displacement  C(x,  y)  are  precisely  the  sparse  stereo  Dis¬ 
parity  representation  desired. 

4  Three  Experiments 

The  algorithm  described  above,  was  run  on  many  real 
stereo  pairs.  Three  are  presented  here  as  representative 
of  the  algorithm’s  performance.  The  first,  the  “Jet  and 
Decoy”  demonstrates  some  of  the  capabilities  of  the  ap¬ 
proach  in  terms  of  acuity  resolution.  The  second,  the 
“Hallway”  contains,  in  contrast,  extremely  large  abso¬ 
lute  disparities  and  disparity  gradients.  The  last  “UBC”, 
has  a  large  amount  of  detail  and,  by  it’s  poor  calibration, 
noise.  It  is  a  good  test  of  the  robustness  of  the  approach 
in  such  situations. 


It  is  important  to  note  that  many  possible  filters  men- 
ticmed  earlier,  such  as  blur  matching,  edge  misalignment, 
and  disparity  slope  have  not  been  incorporated  into  these 
experiments,  since  I  felt  it  would  be  more  interesting  to 
see  when  such  filters  might  be  useful  by  allowing  their 
absence  to  result  in  pathologic^  correspondences.  I  shall 
show  one  such  example  below. 

Also,  it  should  be  noted  that,  because  of  this  hands-off 
experimental  design,  there  were  very  few  “knobs”  in  the 
algorithm.  In  all  instances,  no  parameter  was  found  to 
be  either  critical  in  it’s  selection  or  vary  observably  in 
its  optimal  setting  between  experimental  runs.  As  such, 
basically  arbitrary  choices  were  made  as  to  the  number 
of  scales  and  the  scale  width  spacing  (ot  steps  of  2.0  were 
used),  absolute  disparity  measure  at  any  scale  (±3.0trj), 
iterations  at  any  scale  (2).  The  latter  was  usefiil  at  the 
largest  scales  where  no  prior  on  D(x,  y)  is  available. 

4.1  The  Jet  and  Decoy 

This  image  pair  was  taken  of  a  model  of  a  jet  and  a  paper 
cutout  using  the  MIT  head/eye  stereo  camera  system. 
The  two  images  shown  in  Figure  6  are  the  left  grey¬ 
scale  image  and  an  image  with  only  the  sparse  depth 
estimates  highlighted.  Only  the  highest  contrast  features 
are  shown.  The  toy  jet,  in  the  lower  left  corner,  has  a 
disparity  of  only  1-2  pixels  larger  than  the  paper  decoy. 

In  the  depth  map,  the  Disparity  readings  are  encoded 
as  white  being  a  Disparity  of  three  pixels  or  more  and 
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black  being  two  pixels  or  less.  In  other  words,  the  to¬ 
tal  dynamic  range  indicated  for  this  picture  is  one  pixel. 
Note  that  the  model  jet  has  all  of  its  Disparity  measures 
as  three  or  more  pixels  with  the  exceptions  of  the  shad¬ 
ows  it  casts  on  the  ground  and  on  it’s  own  tail,  whereas 
the  decoy  is  clearly  darker.  It  is  even  possible  to  discern 
the  relative  Disparity  within  the  decoy,  in  spite  of  this 
one  pixel  total  dynamic  remge. 

4.2  The  Hallway 

In  the  second  example  (Figure  7),  a  corridor  provides  an 
environment  which  has  very  large  absolute  disparity  as 
well  as  disparity  gradients  approaching  the  limit  of  1.0. 
The  disparity  range  of  -20  pixels  (black)  to  10  pixels 
(White)  is  substantially  larger  than  the  Jet  and  Decoy 
example.  In  fact,  an  example  of  the  disparity  gradient 
limit  being  exceeded  can  be  seen  in  the  middle  of  the 
picture,  where  the  only  error  correspondence  takes  place. 
In  one  image,  the  reflection  of  the  overhead  lamps  is  in 
the  middle  of  the  window  and  in  the  other  (shown)  they 
abut  the  central  door  edge.  In  this  instance,  along  with 
the  “bloom”  associated  with  the  reflected  image,  there  is 
no  like  contrast  edge  to  match  at  the  window  edge,  and 
the  matcher  (incorrectly)  matched  it  to  the  edge  between 
the  doors. 

A  simple  hack  could  rule  this  out  (the  gradient  ex¬ 
ceeds  1.0  between  neighboring  features)  but  I  thought  it 
more  interesting  to  leave  out  any  gradient  Alters  to  test 
precisely  when  such  a  Alter  might  be  useful.  It  is  impor¬ 
tant  to  note  that,  despite  the  ample  opportunity  for  false 
correspondences  of  this  sort  in  any  of  these  images,  it  is 
only  when  the  geometry  and  optics  of  the  setup  created 
such  an  extreme  violation  of  the  gradient  limit  that  such 
a  failure  is  found. 

4.3  The  Campus 

Figure  8  is  the  depth  map  made  from  a  stereo  pair  taken 
from  an  aerial  shot  of  the  University  of  British  Columbia 
campus.  It  is  a  substantially  denser  depth  map  than  the 
previous  examples.  The  minimum  weight  W(a:,y)  was 
chosen  such  that  about  10,000  estimates  are  plotted. 

This  example  is  illustrative  of  the  substantial  detail 
achievable  with  the  algorithm  without  signiflcant  spuri¬ 
ous  feature  content.  The  upper  left  region  is  all  labeled 
with  lighter  depth  markings  due  to  the  depressed  ele¬ 
vation.  The  buildings  in  the  central  region  are  distin¬ 
guished  by  their  high  (dark)  elevation  markings.  Note 
the  small  rectangular  place  in  the  middle  of  the  cen¬ 
tral  building.  The  depression  in  the  disparity  map  ap¬ 
proaches  the  theoretical  disparity  gradient  limit  of  1.0. 

An  artiflcial  elevation  slope  in  the  image  is  caused  by 
a  slight  rotation  between  the  two  images.  This  could  be 
calibrated  out  using  the  epipole  tables,  if  desired. 

5  Conclusion 

A  very  simple  Displacement  representation  was  derived 
from  the  Laplacian  and  gradient  of  a  Gaussian  convolved 
image  which  provides,  at  any  image  point,  a  measure  of 
feature  proximity,  contrast,  orientation,  and  blur.  Sim¬ 
ple  subtraction  of  the  Displacement  function  renders  a 


Disparity  meeisure  which  is  useful  for  higher  level  pro¬ 
cessing. 

Stereo  differs  from  other  vision  modalities  in  that  a 
constraint  can  be  imposed  both  on  the  matching  space 
(the  epipolar  constraint)  and  on  the  maximum  gradient 
in  the  disparity  map  (1.0).  The  epipolar  constraint  can 
be  easily  integrated  into  the  the  Displacement  represen¬ 
tation.  Other  constraints  can  be  imposed  to  reduce  false 
matches. 

In  experiments,  however,  the  need  for  further  con¬ 
straints  arises  infrequenty.  With  the  exception  one  un¬ 
corrected  violation  of  the  disparity  gradient  created  by  a 
conspiracy  of  the  physical  setup  and  the  optical  behav¬ 
ior  of  the  imager,  as  well  as  a  few  miscorrespondences 
due  to  incorrect  assumptions  about  the  alignment  (cal¬ 
ibration)  of  the  images,  no  other  errors  were  found  in 
these  or  other  experimental  runs.  Including  all  sources, 
the  overall  error  rate  never  exceeded  0.1%  of  the  total 
feature  count. 

In  addition  to  the  foregoing  work  in  stereo,  continuing 
research  indicates  that  the  Displacement  model  of  early 
vision  processing  may  hold  considerable  promise  in  many 
other  areas  such  as  motion,  recognition,  and  even  in  the 
modeling  of  eauly  cortical  processing  in  primates. 
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Figure  6:  Jet  and  Decoy  —  Image  and  Depth  Map 


Figure  7:  Corridor  —  Depth 
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Figure  8:  UBC  —  Depth 
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Abstract 

We  examine  the  problem  of  computing  volu¬ 
metric  shape  from  stereo.  We  argue  that  in¬ 
termediate  2^-D  dense  or  wire-frame  descrip¬ 
tions  may  not  be  always  possible  from  stereo, 
especially  when  there  are  curved  surfaces  in  the 
scene,  and  that  3-D  volumetric  descriptions  of 
objects  may  have  to  be  derived  directly  from 
stereo  correspondences.  We  then  present  meth¬ 
ods  to  recover  volumetric  shape  using  LSHGCs 
and  SHGCs  as  the  shape  models,  based  on 
some  invariant  properties  in  their  monocular 
and  stereo  projections.  Experimental  results 
on  both  synthetic  and  real  images  of  objects 
with  curved  surfaces  are  given. 

1  Introduction 

One  basic  goal  of  vision  is  the  capability  to  extract  de¬ 
scriptions  about  shapes  of  objects  in  a  given  scene,  ei¬ 
ther  for  storing  into  the  memory  if  the  objects  are  new, 
or  for  comparing  with  descriptions  of  known  objects  to 
achieve  recognition.  Use  of  stereo  is  common  to  recover 
3-D  descriptions  of  a  scene  by  using  multiple  images. 
Yet  features  such  as  corners  and  line  segments  extracted 
from  images  and  posribij  matched  in  stereo  are  gener¬ 
ally  too  local  for  indexing  purposes  during  recognition. 
The  alternative  is  to  use  more  global  descriptions  such 
as  surface  and  volumetric  descriptions  which  have  more 
discriminative  power  for  indexing.  The  question  is,  how 
can  we  recover  such  shape  descriptions  from  stereo  im¬ 
ages  without  referencing  to  specific  object  models? 

Traditionally,  stereo  has  been  thought  of  as  a  process 
merely  to  build  a  2^-D  dense  depth  map  of  a  scene, 
and  separate  from  the  process  of  global  sliape  descrip¬ 
tion  which  comes  later.  Yet  in  cases  where  the  objects 
in  the  scene  are  not  densely  textured,  most  of  the  fea¬ 
tures  that  can  be  extracted  and  matched  in  stereo  are 

*This  research  was  supported  by  the  Advanced  Research 
Projects  Agency  of  the  Department  of  Defense  and  was  mon¬ 
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ment  is  authorized  to  reproduce  and  distribute  reprints  for 
governmental  purposes  notwithstanding  any  copyright  nota¬ 
tion  hereon. 


merely  edges  along  surface  boundaries.  Surface  interpo¬ 
lation  from  such  boundary  conditions  alone  is  generally 
difficult.  Some  might  argue  that  dense  surface  descrip¬ 
tions  may  not  be  necessary;  even  wire-frame  descriptions 
of  surfaces  are  very  useful  in  many  applications,  and  we 
can  infer  volumetric  descriptions  from  them  if  necessary. 
However,  as  we  shall  see  in  the  following,  even  depth 
measurements  along  surface  boundaries  may  not  be  al¬ 
ways  directly  available  from  stereo. 

It  has  long  been  known  that  the  apparent  boundaries 
of  a  curved  surface  at  its  limbs  in  an  image  are  viewj)oint- 
dependent;  the  boundary  edges  in  the  image  are  |)ro- 
jected  from  different  contours  on  the  surface  with  differ¬ 
ent  angles  of  view.  As  a  result,  the  apparent  boundaries 
of  a  curved  surface  in  its  stereo  images  may  not  actually 
correspond.  This  is  illustrated  in  figure  1.  In  the  follow¬ 
ing  we  call  those  contours  on  the  surface  wliich  project  to 
image  contours  contour  generators,  and  the  image  con¬ 
tours  projected  from  the  limbs  of  curved  surfaces  limb 
edges. 


Figure  1:  Stereo  projections  of  a  curved  surface. 

This  is  a  unique  problem  of  using  multiple  intensity 
images  for  measuring  depth,  and  is  basically  what  makes 
deriving  shape  from  stereo  different  from,  and  more  dif¬ 
ficult  than,  that  from  range  images.  This  limb  problem 
only  gets  worse  as  the  image  resolution  or  the  stereo 
angle  is  increased  in  an  attempt  to  get  higher  accuracy 
in  the  depth  estimates.  Worse  yet,  if  this  issue  about 


curved  surfaces  is  not  addressed,  the  surface  descriptions 
of  the  scene  may  be  mistaken  in  taking  limb  edges  as  real 
edges. 

In  an  earlier  work  [3],  we  have  proposed  a  stereo  sys¬ 
tem  which  uses  high  level  structural  descriptions  for 
stereo  correspondence.  Hierarchical  descriptions  up  to 
surface  level  are  computed  from  each  image  using  a 
perceptual  grouping  technqiue  based  on  symmetries  [6; 
5],  and  such  descriptions  in  the  two  images  are  used  for 
stereo  correspondence.  The  output  of  such  a  system  is 
therefore  not  merely  depth  estimates  along  edges  in  the 
scene,  but  also  segmented  surfaces.  The  system  is  also 
able  to  distinguish  limb  edges  from  real  edges  by  observ¬ 
ing  the  behavior  of  the  junctions  at  the  ends  of  the  edges. 
However,  it  did  not  recover  3-D  information  along  those 
limb  edges.  This  paper  describes  how  depth  information 
along  those  limb  edges  can  be  estimated  from  stereo  as 
a  by-product  of  the  shape  description  process. 

Since  dense  2^-D  depth  measurements  are  not  always 
available  from  stereo  correspondences,  we  propose  to  in¬ 
fer  volumetric  descriptions  not  via  intermediate  dense 
depth  measurements,  but  rather  directly  from  stereo  cor¬ 
respondences.  Moreover,  as  there  are  generally  infinite 
number  of  volumetric  shape  that  can  project  to  the  same 
sparse  scene  data,  we  need  to  make  use  of  some  shape 
models  that  are  common  and  give  results  that  are  consis¬ 
tent  with  human  perception.  We  propose  to  use  General¬ 
ized  Cones  (GCs),  first  introduced  by  Binford  [l],  as  the 
shape  primitives  for  the  reconstruction.  This  paper  con¬ 
centrates  on  how  Linear  Straight  Homogeneous  General¬ 
ized  Cones  (LSHGCs)  and  Straight  Homogeneous  Gener¬ 
alized  Cones  (SHGCs)  can  be  reconstructed  from  stereo 
images.  The  key  to  the  solution  is  the  use  of  some  in¬ 
variant  properties  of  LSHGCs  and  SHGCs  in  both  their 
monocular  and  stereo  projections.  Our  technique  can  be 
summarized  as  follows:  first  hypothesize  LSHGCs  and 
SHGCs  from  the  images  using  some  invariant  properties 
of  their  2-D  projections;  then  establish  correspondences 
among  the  image  contour  points  across  the  stereo  images 
using  the  known  epipolar  geometry;  finally  reconstruct 
volumetric  descriptions  from  the  correspondences. 

We  first  outline  in  section  2  some  previous  work  on 
reconstructing  curved  surfaces  and  inferring  volumetric 
descriptions  from  stereo.  We  describe  how  LSHGCs  and 
SHGCs  can  be  reconstructed  from  stereo  and  present 
some  experimental  results  in  sections  3  and  4  respec¬ 
tively.  Finally  we  give  the  conclusion  in  section  5. 

2  Previous  Work 

Rao  and  Nevatia  [9;  8]  described  a  technique  of  deriving 
volumetric  descriptions  of  LSHGCs  from  stereo.  Their 
approach  employs  a  hypothesis-and-verification  strategy, 
using  the  following  two  properties  of  LSHGCs  originally 
given  in  [lO]:  (1)  The  contour  generators  of  a  LSHGC  are 
two  straight  line  segments;  (2)  The  contour  generators 
of  a  LSHGC  are  coplanar.  The  system  takes  an  input  of 
possibly  sparse  and  imperfect  2^-0  depth  measurements 
from  stereo.  An  LSHGC  is  either  hypothesized  from  its 
contour  generators  exhibiting  the  above  properties  and 
verifed  by  the  cuts  which  they  call  terminators  at  the 


two  ends  of  the  LSHGC,  or  the  reverse. 

The  proposed  methods,  however,  make  the  assump¬ 
tion  that  2^-D  depth  measurements  along  the  contour 
generators  are  available,  which  is  clearly  not  the  case 
along  the  limbs  of  curved  surfaces  as  described  earlier. 

Lim  and  Binford  [4]  were  the  first  to  explicitly  addr(>ss 
the  problem  caused  by  limb  edges  in  the  proce.ss  of  recon¬ 
structing  curved  surfaces  from  stereo.  They  did  not  indi¬ 
cate  how  limb  edges  can  be  identified,  but  they  proposed 
a  method  to  reconstruct  the  curved  surfaces  from  stereo. 
The  object  in  the  scene  is  assumed  to  be  composed  of 
a  number  of  parallel  cross-sections,  each  of  them  can  be 
described  by  a  conic  function  which  involves  five  param¬ 
eters.  The  curved  solid  is  first  cut  into  a  number  of  slices 
such  that  each  slice  is  in  an  epipolar  plane.  Within  each 
epipolar  plane,  the  four  lines  of  sight  can  be  recovered 
from  the  stereo  images,  and  the  cross-section  is  a  conic 
which  is  tangential  to  the  four  lines  of  sight.  This  leaves 
only  one  free  parameter  for  the  conic  in  each  epipolar 
plane.  Two  constraints,  the  extremum  constraint  and 
the  terminator  constraint,  have  been  proposed  and  ei¬ 
ther  of  them  can  be  used  to  determine  the  free  parame¬ 
ter.  The  extremum  constraint  chooses  the  most  compact 
shape  among  all  possible  conics  in  each  epipolar  plane. 
The  terminator  constraint  chooses  the  conic  which  has 
the  same  eccentricity  as  that  of  the  boundary  of  the  ter¬ 
minator  surface. 

There  are  still  problems  with  Lim  and  Biuford’s 
method:  (1)  Epipolar  planes  are  parallel  to  each  other 
only  when  the  projection  geometry  is  orthographic. 
However,  the  limb  problem  is  significant  only  when  the 
curved  surface  is  at  close  range  where  orthographic  ap¬ 
proximation  is  not  a  good  one.  This  means  epipolar 
slices,  being  non-parallel  to  one  another,  generally  do 
not  possess  similar  characteristics  and  neither  termina¬ 
tor  constraint  nor  extremum  constraint  can  be  applied 
to  them  as  a  whole.  (2)  Epipolar  planes  are  in  gen¬ 
eral  not  parallel  to  the  terminator  surface  as  shown  in 
figure  2.  The  difference  in  the  orientations  strictly  de¬ 
pends  upon  the  orientation  of  the  solid  with  respect  to 
the  cameras.  This  implies  the  epipolar  slices  have  no 
direct  relationship  with  the  terminator  surface  and  thus 
terminator  constraint  is  generally  not  applicable.  On 
the  other  hand,  applying  extremum  constraint  will  give 
a  stack  of  conics  being  most  compact  only  in  the  ori¬ 
entations  of  the  epipolar  slices.  The  shapes  recovered 
therefore  will  be  different  with  different  angles  of  view  of 
the  solid. 

3  Shape  Recovery  of  LSHGCs 

To  reconstruct  LSHGCs  from  stereo,  we  first  need  to 
look  at  a  number  of  properties  of  LSHGCs  in  the  images. 
We  present  the  properties  and  propose  a  method  for  the 
reconstruction  in  section  3.1.  Experimental  results  then 
follow  in  section  3.2. 

3.1  The  Method 

An  LSHGC  is  a  volume  defined  by  sweeping  a  given 
cross-section  function  along  a  straight  line  called  axis, 
such  that  the  cross-section  is  scaled  linearly  along  the 
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Figure  2:  Lateral  view  of  the  stereo  camera  geometry. 
Epipolaur  planes  in  general  are  neither  parallel  to  one 
another  nor  parallel  to  the  terminator  surface  of  a  solid. 

ajcis.  Linking  the  points  on  the  surface  which  correspond 
to  the  same  unsealed  arc  length  s  along  boundaries  of  the 
cross-sections,  we  have  meridians.  Because  of  the  linear¬ 
ity  in  the  sweeping  function,  all  meridians  are  straight 
and  intersect  at  a  point  which  we  call  apex  point  on  the 
axis,  at  which  the  scaling  is  zero. 

Lemma  1  The  image  contour  of  an  LSHGC  is  the  pro¬ 
jection  of  one  of  its  meridians  under  orthographic  or  per¬ 
spective  projection. 

Proof  Shafer  and  Kanade  [lO]  have  algebraically 
shown  that  the  contour  generator  of  an  LSHGC  is  a 
straight  line  under  orthographic  projection.  Here  we 
give  a  more  intuitive  proof  that  the  contour  generator 
is  in  fact  one  of  the  meridians  regardless  of  the  projec¬ 
tion  geometry.  Because  of  the  linearity  in  the  sweeping 
function,  the  surface  normals  to  the  LSHGC  surface  at 
points  along  a  meridian  are  all  parallel  and  perpendicular 
to  the  meridian.  Say  a  line  of  sight  from  the  optical  cen¬ 
ter  C  of  a  camera  touches  the  LSHGC  surface  at  point 
P  as  shown  in  figure  3.  Let  mp  be  the  meridian  passing 
through  P.  Then  both  CP  and  mp  are  perpendicular 
to  the  surface  normal  at  point  P  and  they  define  the 
tangent  plane  t  to  the  surf^ace  at  point  P.  Since  all  sur¬ 
face  normals  along  a  meridian  are  parallel,  the  tangent 
plane  n  is  orthogonal  to  all  the  surface  normals  along 
the  meridian  mp,  and  so  is  any  line  on  the  plane  ir.  As 
a  result  any  point  Q  on  the  meridian  is  a  point  on  the 
contour  generator  since  the  line  of  projection  CQ  lies  on 
the  plane  t.  □ 


Figure  3:  Contour  generator  of  an  LSHGC  is  a  meridian. 

Theorem  1  Given  four  image  contours  of  an  LSHGC 
in  a  stereo  pair  of  images,  the  points  of  intersection 
among  their  extensions  in  the  two  images  are  projections 


of  the  apex  point  of  the  LSHGC  and  fall  on  corresponding 
epipolar  lines  under  orthographic  or  perspective  projec¬ 
tion  (An  example  illustrating  the  theorem  is  shown  in 
figure  4). 

Proof  Since  all  meridians  intersect  at  the  apex  point 
in  3-D,  the  projection  of  meridians  to  any  image  will  also 
intersect  at  the  projection  of  the  apex  point  to  that  im¬ 
age.  By  lemma  1  the  image  contours  in  the  stereo  images 
are  projections  of  meridians;  their  extensions  therefore 
intersect  at  the  images  of  the  same  point  in  3-D,  the 
apex  point.  As  a  result  the  points  of  intersection  fall  on 
corresponding  epipolar  lines.  □ 


Figure  4:  Stereo  correspondence  of  LSHGC  contours. 

Theorem  1  not  only  allows  us  to  hypothesize  an 
LSHGC  from  stereo  images,  but  also  to  recover  the  apex 
point  in  3-D  simply  by  matching  the  image  apex  points 
projected  by  the  image  contours.  Notice  that  a  contour 
in  3-D  on  the  surface  of  the  LSHGC  can  be  recovered  by 
matching  the  terminator  contours  in  stereo.  The  apex 
and  the  3-D  contour  therefore  uniquely  define  an  LSHGC 
which  projects  to  the  four  image  contours,  regardless  of 
how  the  cone  is  cut  at  the  two  ends.  If  the  cuts  are  im¬ 
portant,  we  can  first  recover  their  partial  descriptions  in 
3-D  by  matching  their  images,  and  the  cuts  are  where 
the  partial  descriptions  intersect  with  the  cone  in  3-D. 

Since  we  have  not  specified  any  particular  cut  on  the 
LSHGC,  the  method  works  even  for  cuts  being  non- 
planar  or  non-orthogonal  to  the  axis. 

3.2  Experimental  Results 

Results  on  synthetic  images  of  an  LSHGC  are  shown 
in  figure  5.  We  begin  with  a  stereo  pair  of  intensity  im¬ 
ages  and  compute  the  edges  using  Canny's  edge  detector 
(^.  We  then  compute  the  hierarchical  descriptions  from 
each  image  using  a  perceptual  grouping  technique  [6; 
5],  and  match  those  descriptions  in  stereo  [3].  The  out¬ 
put  of  such  processes  is  segmented  and  matched  surfaces 
in  stereo.  Junctions  are  also  labelled  as  either  limb- 
junctions  or  real  junctions,  and  from  them  limb  edges 
are  also  identified.  Finally  we  determine  the  image  apex 
points  of  the  LSHGC  by  extending  the  limb  edges,  and 
recover  the  volumetric  descriptions  of  the  object  in  the 
scene. 

4  Shape  Recovery  of  SHGCs 

Ulupinar  and  Nevatia  [ll]  have  recently  demonstrated 
that  the  shape  of  an  SHGC  can  be  recovered  merely 
from  a  single  line  drawing.  As  the  problem  is  highly 
underconstrained  with  one  view,  the  method  has  to  make 
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Figure  5:  Results  for  a  synthetic  scene  of  an  LSHGC 
(from  top  to  bottom):  the  left  and  right  images;  the  left 
and  right  matched  ribbons  and  identified  junctions;  the 
left  and  right  extracted  image  apex  points;  the  recovered 
volumetric  descriptions  projected  to  the  left  view. 

use  of  seme  perceptual  assumptions  which  are  believed 
to  give  results  consistent  with  human  perception.  It  is 
also  restricted  to  using  orthographic  projection  as  its 
projection  model. 

In  the  following  we  show  that  if  we  have  a  second 
“eye”,  there  is  a  more  direct  method  to  compute  the 
shape  as  well  as  the  pose  of  an  SHGC  without  making 
those  perceptual  assumptions.  In  addition,  our  method 
can  also  handle  images  projected  under  perspective  pro¬ 
jection.  We  do  assume  that  the  cross-section  function 
is  visible  from  at  least  one  of  the  cuts  of  the  SHGC. 
However,  we  do  not  require  the  cross-sections  to  be  or¬ 
thogonal  to  the  axis,  i.e.,  oblique  SHGCs  are  allowed. 

4.1  Hypothesizing  SHGCs 

To  reconstruct  SHGCs  in  a  scene,  we  first  have  to  find  ev¬ 
idence  for  their  existence  in  the  images.  Ponce  et  al.  [?] 
have  shown  an  important  theorem  concerning  the  tan¬ 
gents  to  the  image  contours  of  an  SHGC  under  ortho¬ 
graphic  projection.  Ulupinar  and  Nevatia  [ll]  extended 
it  to  perspective  projection  and  give  a  more  intuitive 
proof  for  the  theorem.  The  theorem  can  be  stated  as 
this:  If  two  image  contour  points  of  an  SHGC  are  from 
the  same  cross-section,  the  tangents  to  the  contours  at 
these  points  when  extended  intersect  on  the  projection  of 
the  axis.  Using  this  property  of  SHGCs,  we  can  hypoth¬ 
esize  the  existence  of  an  SHGC  by  establishing  pairwise 
correspondences  between  points  on  the  image  contours 


in  each  image  such  that  their  tangents  intersect  on  the 
same  straight  line,  with  tlie  straight  line  so  derived  being 
the  projection  of  its  axis. 

The  pairwise  correspondences  and  the  |)rojectcd  axis 
in  each  image  can  be  estimated  using  Hough  Transform 
as  in  [7],  and  confirmed  by  checking  whether  all  corre¬ 
sponding  pairs  of  points  follow  the  same  order  and  are 
continuous  along  the  contours. 

After  we  collect  enough  strong  evidence  of  the  exis¬ 
tence  of  an  SHGC,  the  next  step  is  to  reconstruct  it. 
The  idea  is  simple.  Matching  the  projections  of  the  axis 
in  the  stereo  images  will  automatically  recover  the  3-D 
position  of  the  axis.  Similarly,  matching  the  projections 
of  the  terminator  boundaries  will  recover  the  3-D  shape 
of  the  cross-section  function.  What  is  left  is  to  scale  the 
cross-section  function  in  3-D  so  that  it  touches  the  lines 
of  projections  from  the  same  cross-section  of  the  solid. 

The  problem  is  that,  we  have  to  first  set  up  corre¬ 
spondences  across  the  stereo  images  so  that  we  know 
which  pair  of  points  in  the  left  image  and  which  pair  of 
points  in  the  right  image  are  projected  from  the  same 
cross-section.  This  is  nontrivial  as  in  case  if  the  contour 
generators  are  limbs,  the  corresponding  points  in  the  im¬ 
ages  are  indeed  projections  from  four  different  points  on 
the  surface  of  the  SHGC.  Here  let  us  call  the  correspond¬ 
ing  points  in  the  stereo  images  projected  from  any  given 
cross-section  pn,  pn,  and  pri,  Pr2  respectively. 

4.2  Setting  Up  Correspondences  across  Stereo 
Images 

The  correspondence  problem  is  simpler  when  the  contour 
generators  are  creases  instead  of  limbs,  in  which  case  pn 
and  Pri  are  projections  of  the  same  point  in  3-D  and 
will  fall  on  corresponding  epipolar  lines,  and  so  will  p/o 
and  Pr2  (see  figure  6).  This  shows  the  importance  of 
identifying  the  nature  of  the  visible  surface  boundaries 
as  being  creases  or  limbs. 


Figure  6:  Corresponding  points  on  the  image  contours  of 
an  SHGC  which  belong  to  the  same  cross-section,  when 
the  contour  generators  are  creases. 

To  solve  the  correspondence  problem  when  the  edges 
are  limbs  edges,  we  first  examine  a  number  of  proj^ei  ties 
of  SHGCs  in  their  projections.  Let  us  first  define  the 
tangent  to  a  surface  at  a  point  P  in  the  direction  of  a 
line  L  in  3-D  be  the  tangent  which  lies  on  the  (Wane 
containing  the  point  P  and  the  line  L. 

Lemma  2  (Shafer  and  Kanade  [lO])  Given  points 
on  the  surface  of  an  SHGC  that  belong  to  (be  same  cross- 
section,  the  tangents  to  the  surface  at  these  points  in 
the  direction  of  the  axis  ivhen  extended  intersect  at  a 


covnnon  point  on  the  axis  (see  figure  7). 

Following  Shafer  and  Kanade,  we  call  the  common 
point  of  intersection  on  the  axis  the  apex  point,  and  the 
tangents  in  the  direction  of  the  axis  the  apex  tangents 
of  the  given  cross-section.  We  also  call  the  2-D  projec¬ 
tions  of  the  apex  point  and  the  apex  tangents  on  any 
image  plane  the  image  apex  point  and  image  apex  tan¬ 
gents.  Notice  that  different  cross-sections  of  an  SHGC 
generally  have  different  sets  of  apex  tangents  and  differ¬ 
ent  apex  points  on  the  axis. 


Figure  7:  The  apex  point  of  a  given  cross-section  of  an 
SIIGC. 

Ulupinar  and  Nevatia  [ll]  have  also  discovered  a  prop¬ 
erty  of  the  limb  edge  of  any  curved  surface: 

Lemma  3  (Ulupinar  and  Nevatia  [ll])  All  the  tan¬ 
gent  lines  to  a  surface  at  a  point,  say  P,  which  is  on  a 
limb  edge  of  the  surface  under  any  given  projection  ge¬ 
ometry,  project  as  the  same  line  on  the  image  plane. 

This  property,  in  combination  with  lemma  2,  imply 
that  tangents  to  the  limb  edges  at  points  which  belong 
to  the  same  cross-section,  say  pn  and  p/2,  are  in  fact 
equivalent  to  the  2-D  projections  of  the  apex  tangents 
at  those  points.  As  a  result  their  point  of  intersection  in 
the  image  will  also  be  the  2-D  projection  of  the  point  of 
intersection  of  the  apex  tangents.  This  has  been  included 
in  the  work  of  [7]  and  [l  l]  when  they  prove  that  tangents 
to  the  limb  edges  intersect  on  the  projection  of  the  axis. 
We  rephrase  it  as  in  below: 

Lemma  4  (Ponce  et  al  [7],  Ulupinar  and  Nevatia 
(llj)  Given  two  points  on  the  limb  edges  of  an  SHGC 
that  belong  to  the  same  cross-section,  the  point  of  inter¬ 
section  between  the  tangents  at  those  points  is  the  image 
apex  point  of  that  particular  cross-section. 

Combining  lemmae  2  and  4,  we  get  the  following  the¬ 
orem: 

Theorem  2  If  four  points  on  the  limb  edges  of  an 
SIIGC  in  a  stereo  pair  of  images  belong  to  the  same 
cross-section,  the  points  of  intersections  among  the  tan¬ 
gents  at  those  points  in  the  two  images  fall  on  corre¬ 
sponding  epipolar  lines  (An  example  illustrating  the  the¬ 
orem  is  given  tn  figure  8). 

Proof  By  lemma  4  tangents  at  the  image  points  in¬ 
tersect  at  the  image  apex  point  of  that  cro.ss-section  in 


each  image.  Since  the  four  points  are  from  the  .same 
cross-section,  and  by  lemma  2  apex  point  is  uiiiiiuely  de¬ 
fined  for  each  cross-section,  the  two  image  apex  points 
in  the  stereo  images  are  in  fact  projections  of  the  same 
point.  As  a  result  the  two  image  apex  points  fall  on 
corresponding  epipolar  lines.  □ 


Figure  8:  Corresponding  points  on  the  image  contours  of 
an  SHGC  which  belong  to  the  same  cross-section,  when 
the  contour  generators  are  limbs. 

Theorem  2  gives  one  way  to  establish  correspondences 
among  points  on  the  limb  edges  which  belong  to  the 
same  cross-section.  What  is  remaining  is  to  .scale  the 
cross-section  function  in  3-D  to  project  to  the  four  corre¬ 
sponding  points.  This  is  still  nontrivial  as  the  four  points 
correspond  to  four  lines  of  projection  which  in  general 
are  not  coplanar.  The  following  describes  a  non-iterative 
method  of  how  we  can  compute  the  cross-section  in  3- 
D  for  each  set  of  correspondences.  Remember  that  the 
cross-section  function  in  3-D  can  be  computed  by  match¬ 
ing  the  terminator  contours  in  stereo.  Similarly,  the  axis 
of  the  SHGC  in  3-D  can  be  recovered  by  matching  the 
image  axes. 

4.3  Fitting  the  Cross-Sections 
We  have  obtained  for  every  cross-section  slice  of  the 
SHGC  two  corresponding  image  apex  points  and  four 
image  apex  tangents  in  the  stereo  images.  Here  we  treat 
the  recovery  problem  as  one  to  recover  a  virtual  LSIIGC 
whose  apex  is  the  apex  point  of  the  cross-section,  whose 
meridians  are  the  apex  tangents  of  that  cross-section, 
and  whose  cut  is  the  cross-section  itself.  Following  the 
scheme  in  section  4,  we  first  derive  the  LSIIGC  and  then 
determines  the  proper  cut  that  coincides  with  the  scene 
data. 

By  matching  the  image  apex  points  we  can  recover 
the  apex  A  of  the  cone  (see  figure  9).  We  then  move 
the  cross-section  function  down  the  axis  in  3-D  to  some 
distance,  say  t  from  the  apex.  We  call  the  new  axis 
point  C(t).  Pick  one  of  the  four  image  apex  tangents, 
say  the  one  at  point  pn-  The  image  apex  tangent  and 
the  optical  center  of  the  corresponding  camera  form  a 
plane  of  projection  which  we  call  %,  to  which  the  virtual 
cone  should  touch.  For  each  point  s  along  the  boundary 
of  the  cross-section  function,  we  compute  two  mea.siircs: 
the  distance  r(s)  from  C{t)  to  the  boundary  point  s, 
and  the  distance  R(s)  from  C(t)  through  the  |)oint  s 
to  the  plane  n.  The  fraction  R{s)/r{s)  is  the  scaling 
of  the  cross-section  such  that  the  point  s  touches  the 
tangent  plane  tt.  The  proper  scaling  of  the  cross-section 
function  to  touch  with  the  plane  tt,  regardle.ss  of  whether 
the  edges  are  limb  edges  or  real  edges,  can  therefore  be 
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Figure  9:  Recovering  a  cross-section. 


computed  as: 

scale{t)  =  min  i?(s)/r(s) 

The  apex  point  A,  the  axis,  and  the  scale  function 
scale{t)  at  distance  t  along  the  axis  uniquely  define  a 
virtual  LSHGC  which  gives  rise  to  the  image  apex  tan¬ 
gents. 

The  next  step  is  to  recover  the  proper  cut  of  the  cone 
to  project  to  the  four  corresponding  points.  From  the 
above  process  we  can  easily  recover  the  point  of  con¬ 
tact  P{t)  between  the  projection  plane  ir  and  the  scaled 
cross-section  of  the  cone  at  distance  t  from  the  apex. 
The  line  AP(t)  then  defines  the  contour  generator  on 
the  cone  which  projects  to  the  given  image  apex  tan¬ 
gent.  Notice  that  the  contour  generator  of  an  LSHGC 
has  to  be  a  straight  line.  Let  us  say  that  the  proper  cut 
is  at  a  distance  ti  from  the  apex  along  the  axis,  and  the 
proper  scaling  of  the  cross-section  function  on  that  plane 
is  scalc{ti).  The  point  P(ti)  on  the  surface  of  the  cone 
which  projects  to  point  pn  is  then  given  by  the  intersec¬ 
tion  of  two  lines:  the  line  of  projection  through  point  pu , 
and  the  contour  generator  AP(t).  Finally,  using  prop¬ 
erty  of  similar  triangles  we  can  get: 

tl  =  (AP(iy)/AP(i)).t 
8cale{ti)  =  {AP{ti)/AP{t)).scale(t) 

The  above  cone  and  cut  recovery  processes  can  be  ap¬ 
plied  to  any  one  of  the  four  image  apex  tangents.  In 
principle  they  should  all  return  the  same  cross-section, 
unless  the  object  is  in  fact  not  an  SHGC  but  merely 
looks  like  an  SHGC,  or  the  cut  being  used  as  the  cross- 
section  function  is  actually  not  along  one  of  the  cross- 
sections.  This  serves  as  an  additional  verification  from 
the  reconstruction  process.  In  csise  if  the  object  is  in¬ 
deed  an  SHGC,  our  implementation  is  to  compute  ti 
and  8cale{ti)  separately  from  each  of  the  image  apex 
tangents,  and  average  them  out. 

4.4  Experimental  Results 

Results  on  a  stereo  pair  of  real  images  of  a  typical  desk 
lamp  are  shown  in  figure  10.  The  cameras  were  con¬ 
figured  such  that  the  optical  axes  were  parallel  with  a 


base  line  of  approximately  25cm  long.  The  lamp  was 
about  75cm  away  from  the  cameras.  Both  cameras 
have  a  spatial  resolution  of  512  by  480  with  8  bits  of 
grey  scale.  Here  we  show  more  details  about  the  inter¬ 
mediate  steps  used  in  the  hierarchical  stereo  matching 
system  proposed  in  [3].  Edges  are  detected  from  each 
image  using  Canny’s  edge-detector  [2],  and  are  linked 
into  edge-contours  based  on  eight-neighbor  connectivity. 
Edge-contours  are  segmented  into  curves  at  curvature 
extremas  so  that  every  curve  is  smooth  in  itself,  and 
curves  are  grouped  into  contours  based  on  continuity. 
Symmetnes  are  then  detected  from  each  pair  of  approx¬ 
imately  symmetrical  contours,  and  they  form  ribbons  if 
they  have  proper  closures  at  both  ends  of  the  symme¬ 
tries.  The  closure  at  the  end  of  a  symmetry  can  be  com¬ 
posed  of  a  curve,  a  set  of  multiple  curves,  or  the  ends 
of  other  symmetries.  Very  small  symmetries  are  ignored 
to  save  computation  time.  Still  a  large  number  of  sym¬ 
metries  Me  left  and  they  form  many  conflicting  ribbons. 
The  ribbons  then  go  through  a  selection  process  based 
on  a  number  of  constraints  among  the  ribbons.  The 
selected  ribbons  and  the  hierarchies  of  descriptions  in 
the  two  images  are  then  used  for  stereo  correspondence. 
Junctions  are  extracted  from  the  matched  ribbons  and 
labelled  as  limb-junctions  or  real  junctions  from  their 
behavior  across  the  stereo  images.  Limb  edges  are  also 
identified  during  this  step.  The  details  of  such  percep¬ 
tual  grouping  and  stereo  matching  processes  is  given  in 

[3]. 

We  then  group  neighboring  ribbons  which  share 
smooth  boundaries  into  objects.  Notice  that  the  lamp 
object  is  a  little  bit  complex  as  it  consists  of  two  neigh¬ 
boring  sections  of  curved  surfaces.  Using  the  Hough 
Transform  method  mentioned  in  section  4.1,  we  are  able 
to  derive  the  SHGC  axes  of  both  curved  sections  and 
identify  that  they  both  share  the  similar  axis.  We  then 
treat  the  two  curved  sections  as  one  single  SHGC  and 
derive  the  volumetric  descriptions  as  in  the  previous  ex¬ 
ample.  In  figure  10  we  overlay  the  recovered  descriptions 
with  the  left  image  to  illustrate  the  performance  of  our 
method.  Notice  that  the  perspective  distortion  in  the 
images  are  significant;  the  eccentricities  of  the  projected 
ellipses  change  gradually  along  the  axis  from  one  end  to 
the  other. 

5  Conclusion 

In  this  paper  we  examined  the  problem  of  deriving  vol¬ 
umetric  shape  from  stereo  images.  We  emphasize  that 
intermediate  2^-D  dense  or  wire-frame  descriptions  may 
not  be  always  available  from  stereo,  which  is  basically 
why  shape  from  stereo  cannot  be  treated  as  merely  an  in¬ 
tegration  of  two  modules:  depth  from  stereo,  and  shape 
from  range  data.  As  a  result,  the  volumetric  reconstruc¬ 
tion  may  have  to  be  computed  directly  from  stereo  cor¬ 
respondences.  We  have  described  how  volumetric  shape 
can  be  reconstructed  from  stereo  by  using  some  shape 
primitives  such  as  LSHGCs  and  SHGCs.  The  meth¬ 
ods  are  based  on  some  invariant  properties  of  the  shape 
models  in  their  2-D  projections.  Such  properties  are 
not  all  monocular;  we  have  proposed  some  properties  in 
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stereo  which  further  help  in  confirming  and  reconstruct¬ 
ing  LSHGCs  and  SHGCs  from  stereo  images. 

Our  technique  allows  dense  surface  descriptions  to  be 
recovered  even  for  objects  without  any  texture  at  all 
(texture  helps  provide  dense  data  in  stereo),  and  it  is 
not  restricted  to  narrow  stereo  angles  or  low  resolution 
images.  Our  technique  can  also  handle  objects  at  close 
range,  which  is  in  fact  where  stereo  is  most  effective, 
without  being  affected  by  any  possible  perspective  dis¬ 
tortion  in  the  projected  images.  We  have  shown  results 
for  objects  with  circular  cross-sections,  but  our  method  is 
not  limited  to  that.  In  addition,  our  method  can  even  al¬ 
low  LSHGC  objects  with  arbitrary  cuts  across  the  cones, 
as  well  as  SIIGC  objects  with  oblique  cross-sections. 
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Figure  10:  Results  of  hierarchical  stereo  matching  .nid 
volumetric  shape  recovery  for  a  scene  of  a  lamp  (from 
top  to  bottom):  the  left  and  right  images;  the  left  and 
right  edges;  the  left  and  right  symmetries;  the  left  and 
right  matched  ribbons  and  identified  junctions;  the  left 
and  right  extracted  image  axes  and  the  correspondences 
across  stereo  views;  the  recovered  volumetric  descrip¬ 
tions  projected  to  and  overlaid  with  the  left  view. 
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Part  I:  Theory 

M.  Okutomi  and  T.  Kanade 

Abstract 

This  paper  presents  a  stereo  matching  method  which  uses  mul¬ 
tiple  stereo  pairs  with  various  baselines  to  obtain  precise  distance 
estimates  without  suffering  from  ambiguity. 

In  stereo  processing,  a  short  baseline  means  that  the  estimated 
distance  will  be  less  precise  due  to  narrow  triangulation.  For 
more  precise  distance  estimation,  a  longer  baseline  is  desired. 
With  a  longer  baseline,  however,  a  larger  disparity  range  must  be 
searched  to  find  a  match.  As  a  result,  matching  is  more  difficult 
and  dure  is  a  greater  possibility  of  a  false  match.  So  there  is  a 
trade-off  between  precision  and  accuracy  in  matching. 

The  stereo  matching  method  presented  in  Ms  paper  uses  mul¬ 
tiple  sureo  pairs  with  different  baselines  generated  by  a  lateral 
displacement  of  a  camera.  Matching  is  performed  singly  by  com¬ 
puting  the  sum  afsquared-diffetmce  (SSD)  values.  The  SSD  func¬ 
tions  for  individual  stereo  pairs  are  represented  with  respect  to 
the  inverse  disumce  ( rather  than  die  disparity,  as  is  usually  doru), 
and  then  are  singly  added  to  produce  du  sum  ofSSDs.  This  re¬ 
sulting  function  is  called  the  SSSD-in-inverse-distatKe.  We  show 
that  the  SSSD-in-inverse-distance  function  exhibits  a  unique  and 
clear  minimum  at  the  correct  matching  position  even  when  du 
underlying  intensity  patterns  of  the  scene  include  ambiguities  or 
repetitive  patterns.  An  advantage  of  this  method  is  that  we  can 
eliminate  false  matches  and  increase  precision  without  any  search 
or  sequential  filtering. 

This  paper  first  defaus  a  sureo  algorithm  based  on  du  SSSD- 
in-inverse-ttistarue  and  presents  a  mathematical  analysis  to  show 
how  the  edgorithm  can  remove  ambiguity  and  increase  precision. 
Then,  a  few  experimental  results  widt  real  stereo  images  are  pre- 
sented  to  demonstrate  the  effectiveruss  ofdu  algorithm. 

1  IntrodnctkMi 

Stereo  is  a  useful  technique  for  obtaining  3-D  information  from 
2-D  images  in  computer  vision.  In  stereo  matching,  we  measure 
the  disparity  d,  which  is  the  difference  between  the  corresponding 
points  of  left  atxl  right  images.  The  disparity  d  is  related  to  the 
distance  z  by 

d  =  BFi  (1) 

where  B  and  F  are  baseline  and  focal  length,  respectively. 
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This  equation  indicates  that  for  the  same  distance  the  disparity 
is  proportiona]  to  the  baseline,  or  that  the  baseline  length  B  acts 
as  a  magnification  factor  in  measuring  d  in  order  to  obtain  z.  That 
is,  the  estimated  distance  is  more  precise  if  we  set  the  two  cam¬ 
eras  farther  apart  from  each  other,  which  means  a  longer  baseline. 
A  longer  baseline,  however,  poses  its  own  problem.  Because  a 
longer  disparity  range  must  be  searched,  matching  is  mote  diffi¬ 
cult  and  thus  diere  is  a  greater  possibility  of  a  false  match.  So 
there  is  a  trade-off  between  precision  and  accuracy  (correctness) 
in  matching. 

One  of  the  most  common  methods  to  deal  with  the  problem  is  a 
coarse-to-fine  control  strategy  (1]  -  (S).  Matching  is  done  at  a  low 
resolution  to  reduce  false  matches  and  then  the  result  is  used  to  limit 
the  search  range  of  matching  at  a  higher  resolution,  where  mote 
precise  disparity  measurements  ate  calculated.  Using  a  coarse 
resolution,  however,  does  not  always  remove  false  matches.  This 
is  especially  true  when  there  is  inherent  ambiguity  in  matching, 
such  as  a  repeated  pattern  over  a  large  part  of  the  scene  (eg.,  a  scene 
of  a  picket  fence).  Another  approach  to  remove  false  matches 
and  to  increase  precision  is  to  use  multiple  images,  especially  a 
sequence  of  densely  sampled  images  along  a  camera  path  [6]  -  [9] . 
A  short  baseline  between  a  pair  of  consecutive  images  mal^  the 
matching  or  tracking  of  features  easy,  while  the  structure  imposed 
by  the  camera  motion  allows  integration  of  the  possibly  noisy 
individual  measurements  into  a  precise  estimate.  The  integration 
has  been  performed  either  by  exploiting  constraints  on  the  EPI 
[6i]{7]  or  by  a  sequential  Kalman  filtering  technique  [8][9]. 

The  stereo  matching  method  presetited  in  this  paper  belongs  to 
the  second  approach:  use  of  multiple  images  with  Afferent  base¬ 
lines  obtained  by  a  lateral  displacement  of  a  camera.  The  matching 
technique,  however,  is  based  on  the  idea  that  global  mismatches 
can  be  reduced  by  adding  the  sum  of  squared-difference  (SSD) 
values  from  multiple  stereo  pairs.  That  is,  the  SSD  values  are 
computed  first  for  each  pair  of  stereo  images.  We  represent  the 
SSD  values  with  respect  to  the  inverse  distance  1/z  (rather  than 
the  disparity  d,  as  is  usually  done).  The  resulting  SSD  functions 
from  aU  stereo  pairs  are  added  together  to  produce  the  sura  of 
SSDs,  which  we  call  SSSD-in-inverse-distance.  We  show  that 
the  SSSD-in-inverse-distance  function  exhibits  a  unique  and  clear 
minimum  at  the  correct  matching  position  even  when  the  underly¬ 
ing  intensity  patterns  of  the  scene  include  ambiguities  <x  repetitive 
patterns. 

There  have  been  stereo  techniques  that  use  multiple  image  pairs 
taken  by  cameras  which  are  arranged  along  a  line  [10][11][12], 
in  the  form  of  a  triangle  [13][14][1S]  (called  trinocular  stereo), 
or  in  the  other  formation  [16].  However,  all  of  these  techniques, 
except  [10]  and  [16],  decide  candidate  points  for  correspondence 
in  each  image  pair  and  then  search  for  the  correct  combinations  of 
correspondences  among  them  using  the  geometrical  consistencies 
that  they  must  satisfy.  Since  the  intermeidiate  decisions  on  cone- 
spondences  ate  inherently  noisy,  ambiguous  and  multiple,  finding 
the  correct  combinations  requites  sofdiisticated  consistency  checks 
and  search  or  filtering.  In  contrast,  our  method  does  not  make  any 
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decisions  about  the  correspondences  in  each  stereo  image  pair,  in¬ 
stead,  it  simply  accumulates  the  measures  of  matching  (SSDs)  fiom 
all  the  stereo  pairs  into  a  single  evaluation  fimction,  ie.,  SSSD-in- 
inverse-distaiKe,  and  then  obtains  one  corresponding  point  from  it 
In  other  words,  our  method  integrates  evidence  for  a  final  decision, 
rather  than  filtering  intermediate  decisions.  In  this  sense,  Tsai  [16] 
employed  strategy  very  similar  to  ours:  he  used  multiple  images 
to  sharpen  the  peaks  of  his  overall  similarity  measures,  which  he 
called  JMM  and  WVM.  However,  the  relationship  between  the 
improvement  of  the  similarity  measures  and  the  camera  baseline 
arrangement  was  not  analyzed,  nor  was  the  method  tested  with  real 
imagery.  In  this  paper  we  show  both  mathematical  analysis  and 
experimental  results  with  real  indoor  and  outdoor  images,  which 
demonstrate  how  the  SSSD-in-inverse-distance  function  based  on 
multiple  image  pairs  from  different  baselines  can  greatly  reduce 
false  matches,  while  improving  precision. 

In  the  next  section  we  present  the  method  mathematically  and 
show  how  ambiguity  can  be  removed  and  precision  increased  by 
the  method.  Section  3  provides  a  few  experimental  results  with  real 
stereo  images  to  demonstrate  the  effectiveness  of  the  algorithm. 
Section  4  presents  conclumons. 

2  Matbematical  Analysb 

The  essence  of  stereo  matching  is,  given  a  point  in  one  image,  to 
find  in  another  image  the  corresponding  point,  such  that  the  two 
points  ate  the  projections  of  the  same  physical  point  in  space.  This 
task  usually  requires  some  criterion  to  measure  similarity  between 
images.  Ihe  sum  of  squared  differences  (SSD)  of  the  intensity 
values  (or  values  of  preprocessed  images,  such  as  bandpass  filtered 
images)  over  a  wirxlow  is  the  simplest  and  most  effective  criterion. 
In  this  section,  we  defiiw  the  sum  of  SSD  with  respect  to  the  inverse 
distance  (SSSD-in-inverse-distance)  for  multiple-baseline  stereo, 
and  mathematically  show  its  advantage  in  removing  ambiguity 
and  increasing  precision.  For  this  analysis,  we  use  1-D  stereo 
intensity  signals,  but  the  extension  to  two  dimensional  images  is 
straightforward. 

2.1  SSD  FtincdoB 

Suppose  that  we  have  cameras  at  positions  Po,  Ft , . . . ,  Pn  along  a 
line  with  their  optical  axes  perpendicular  to  the  line  and  aresulting 
set  of  stereo  pairs  with  biuelines  as  shown  in 

figure  1.  Let  /o(x)  and  /i(x)  be  the  image  pair  at  the  camera 
positions  i\i  and  Pi,  respectively.  Imagine  a  scene  point  Z  whose 
distance  is  z.  Its  disparity  tfr(i)  for  the  image  pair  taken  from  Po 
and  Pi  is 

A  -  m 

“rti)  =  W 

We  model  the  image  intensity  functions  /a(x)  and  /i(x)  near  the 
matching  positions  for  .Z  as 


assuming  constant  distance  near  Z  and  independent  Gaussian 
white  noise  such  that 

no(x),ni(x)  ~  JV(0,<Ti).  (4) 

The  SSD  value  e^(i)  over  a  window  W  at  a  pixel  position  x  of 
image  /o(x)  for  the  candidate  disparity  d(i)  is  defined  as 

ea(i){*,d(i))  =  y^(/o(x  + j)-/i(x-bd(i)  -1-j))^,  (5) 

jew 


where  the  rocarrs  summation  over  the  window.  The  d(i) 

that  gives  a  minimum  of  ej(i)  (x,  d(i) )  is  determined  as  the  estimate 
of  the  disparity  at  x.  Since  the  SSD  measurement  eni)(x,  d(i))  is 
a  random  variable,  we  will  compute  its  expected  value  in  order  to 
analyze  its  behavior 


=  E\^(f(x+j)-fix  +  d^i)-dr(i)+j) 
Lew 

+*»o(x  -I-  j)  -  n,(x  -I-  d(i)  +  j))*] 


=  ^(/(*  +  j)  -  /(*  +  <^(0  ~  ^r(i)  +  j))*  +  2Af«<ri, 

jew 

(6) 


where  is  the  number  of  the  points  within  the  window.  For 
the  rest  of  the  paper,  £[  ]  denotes  the  expected  value  of  a  raitdom 
variable.  In  deriving  the  above  equation,  we  have  assumed  that 
dr^i)  is  constant  over  the  window.  Equation  (6)  says  that  naturally 
the  SSD  function  ei(i)(x,d(i))  is  expected  to  take  a  minimum 
when  d(i)  =  dr(i).  i.e.,  at  the  right  disparity. 

Let  us  examine  how  the  SSD  function  e^^i){x,d^i))  behaves 
when  there  is  ambiguity  in  the  underlying  intensity  function.  Sup¬ 
pose  that  the  intensity  sigrud  /(x)  has  the  same  pattern  aroutid 
pixel  positions  x  and  x  + a, 

f(x+j)  =  f{x  +  a+j),  jew  (7) 


where  a  /  0  is  a  constant  Then,  from  equation  (6) 

E[ci{i){x,  dr(i))l  =  E[e4(^i){x,  dr(t)  +  «))  =  "i-N^aX.  (8) 

This  means  that  ambiguity  is  expected  in  matching  in  terms  of 
positions  of  minimum  SSD  values.  Moreover,  the  false  match  at 
dr(i)  -1-  a  appears  in  exactly  the  same  way  for  all  i;  it  is  separated 
from  the  correct  match  by  a  for  all  the  stereo  pairs.  Using  multiple 
baselines  does  not  help  to  disambiguate. 


22  SSD  with  raqwct  to  Inverse  Distance 
Now,  let  us  introduce  the  inverse  distance  (  such  that 

C  =  (9) 

>From  equation  and  (2), 

dr(i)  =  BiFCr  (10) 

=  BiFC  (11) 

where  Cr  <>>k1  C  are  the  teal  and  the  candidate  inverse  distance, 

respectively.  Substituting  equation  (11)  into  (5),  we  have  the  SSD 
Witt  respect  to  the  inverse  distance. 


/o(*) 

/<(*) 


(3) 


ec(0(x.C)  S  +i)  -  fi{x  +  B*PC  +»)*,  (12) 

j€W 


/(x)  +  no(x) 
/(x-d,(j))-bni(x), 
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at  position  x  for  a  candidate  inverse  distance  C-  Its  expected  value 
is 

^?(ec{0(*.Ol  =  J2^fi*+j'^-f(x+BiF(C-<r)+j))^+2N„<rl 

jew 

(13) 

Finally,  we  define  a  new  evaluation  function  C)>  fi>c 

sum  of  SSD  functions  with  respect  to  the  inverse  distaitce  (SSSE>- 
in-inverse-distance)  for  multiple  stereo  pairs.  It  is  obtained  by 
adding  the  SSD  functions  e((j)(x,  Q  for  individual  stereo  pairs: 

n 

ec(i2..»)(*.  C)  =  53  0-  (1^) 

«sl 

Its  expected  value  is 

n 

^lcc(i2  0]  =  C)1 

•=i 

« 

•=l  j€W 

+2nJV„<Ti.  (15) 

In  the  next  three  subsections,  we  will  analyze  the  characteristics  of 
these  evaluation  functions  to  see  how  ambiguity  is  removed  and 
precision  is  improved. 

2,3  Etbninatfcm  of  Ambignity  (1) 

As  before,  suppose  the  underlying  intensity  pattern  f(x)  has  the 
same  pattern  around  x  and  x  +  a  (equation  (7)).  Th^  according 
to  equation  (13),  we  have 

f?lcC(0(*.Cr)l  =  +  ;^)1  =  2N^<tI.  (16) 

We  still  have  an  ambiguity;  a  minimum  is  expected  at  a  false 
inverse  distance  However,  an  important  point 

to  be  observed  here  is  that  this  minimum  for  the  false  inverse 
distance  (/  changes  iu  position  as  the  baseline  Bi  changes,  while 
the  minimum  for  the  correct  inverse  distance  <r  does  not.  This 
is  the  (noperty  that  the  new  evaluation  function,  the  SSSD-in- 
inverse-distance  (14),  exploits  to  eliminate  the  ambiguity.  For 
example,  suppose  we  use  two  baselines  Bi  and  Bz  (B|  ^  Bi). 
>From  equation  (IS) 

^^(cc(«)(*>  01 

j€W 

j6W 

+  4JV.<ri.  (17) 

We  can  prove  that 

01  >  4JV«<ri  =  B[ec(i2)(s:>Cr)]  for^jt^,. 

(18) 

(refertoappradix  A)  In  words,  e(^)(x,  ()  is  expected tolave  the 
smallest  ^ue  at  the  correct  Cr.  Tltat  is,  the  ambiguity  is  likely 
to  be  eliminated  by  use  of  the  new  evaluation  function  with  two 
different  baseliites. 


Figure  2:  Expected  values  of  evaluation  functions:  (a)  Underlying 
function;  (b)  (c)  Ble^d)];  (d)  B[c((i)l;  (c)  ^^[ec(2)l:  (0 

^le<(i2)l 


We  can  illustrate  this  using  synthesized  data.  Suppose  the  point 
whose  distance  we  want  to  determine  is  at  X  =  0  and  the  underlying 
function  /(x)  is  given  by 


/(*) 


co»{jx)  +  2 


if-4<x<  12 
if  X  <  —4  or  12  <  X. 


(19) 


Hgure  2  (a)  shows  a  plot  of  /(x).  Assuming  that  <1^(1)  =  5, 
a\  =  0.2,  and  the  window  size  is  S,  the  expected  values  of  the 
SSD  function  e^(i)(x,  d(i))  are  as  shown  in  figure  2  (b).  We  see 
that  there  is  an  ambigui^:  the  minima  occur  at  the  correct  match 
d(i)  =  5  and  at  the  false  match  d^l)  =  13.  Which  match  will 
be  selected  will  depend  on  the  noise,  search  range,  and  search 
strategy.  Now  suppose  we  have  a  longer  baselitK  Bz  such  that 
=  1.5.  >From  equations  (6)  and  (10),  we  obtain  B[e^(2)]  as 
slmwn  in  figure  2  (c).  Again  we  etKOunter  an  ambiguity,  and  the 
separation  of  the  two  minima  is  the  same. 

Now  let  us  evaluate  the  SSD  values  with  respect  to  the  inverse 
distance  C  rather  than  the  disparity  d  by  using  equations  (12) 
tiuxNigh  (IS).  The  expected  values  of  the  SSD  measurements 
£{«((!)]  and  B[ec(2)l  with  baselines  B\  and  Bi  ate  shown  in 
figures  2  (d)  and  (e),  respectively  (the  plot  is  normalized  such 
that  B\F  =  1).  Note  that  the  minima  at  the  correct  inverse 
distaiKe  (C  =  5)  does  not  move,  while  the  minima  for  the  false 
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W  (b) 

Figure  3:  "Town"  data  set:  (a)  ImageO;  (b)  Iinage9 


inugeO 
imagel 
iinage2 
imageB 
imagM 
images 
images 
image? 
images 
imaged 

Bajeline  b  2b3b4b5b6b7b8b9b 

Figure  4:  "Town"  data  set  image  se¬ 
quence 


match  changes  its  position  as  the  baseline  changes.  When  the  two 
functions  are  added  to  produce  the  SSSD-in-inverse-distance.  its 
expected  values  £[e((i2)]  are  as  shown  in  figure  2  (0-  We  can  see 
that  the  ambiguity  has  b^n  reduced  because  the  SSSD-in-inverse- 
distance  has  a  smaller  value  at  the  correct  match  position  than  at 
the  false  match. 

2.4  Eliminatioa  of  Ambiguity  (2) 

An  extreme  case  of  ambiguity  occurs  when  the  underlying  function 
/(x)  is  a  periodic  function,  like  a  scene  of  a  picket  fence.  We  can 
show  that  this  ambiguity  can  also  be  eliminated. 

Let  fix)  be  a  periodic  function  with  period  T.  Then,  each 
e((i)(x,  ()  is  expected  to  be  a  periodic  function  of  <  with  the 
period  This  means  that  there  will  be  multiple  minima  of 
C)  (>•£•>  ambiguity  in  matching)  at  intervals  of  in 

When  we  use  two  baselines  and  add  their  SSD  values,  the  renting 
«C(iz)  (^<  0  will  be  still  a  periodic  function  of  C,  but  its  period  Tn 
is  increased  to 

=  (20) 

where  LCAf  ()  denotes  Least  Gmimon  Multiple.  That  is,  the  pe¬ 
riod  of  the  expected  value  of  the  new  evaluation  function  can  be 
made  longer  than  that  of  the  individual  stereo  pairs.  Furthermore, 
it  can  be  controlled  by  choosing  the  baselines  B|  and  B2  appro¬ 
priately  so  that  the  expected  value  of  the  evaluation  function  has 
only  one  minimum  within  the  search  range.  This  means  that  using 
multiple-baseline  stereo  pairs  simultaneously  can  elimiiuue  ambi¬ 
guity,  although  each  individual  baseline  stereo  may  suffer  from 
ambiguity. 

We  illustrate  this  by  using  real  stereo  images.  Figure  3(8)  shows 
an  image  of  a  sample  scene.  At  the  top  of  the  scene  there  is  a  grid 
board  whose  intensity  function  is  nearly  periodic.  We  took  ten 
images  of  this  scene  by  shiffing  the  camera  vertically  as  in  figure  4. 
The  actual  distance  between  consecutive  camera  positions  is  0.0S 
inches.  Let  this  distaitce  be  6.  figure  3  shows  the  first  and  the  last 
images  of  the  sequence.  We  selected  a  points  within  the  repetitive 
grid  board  area  in  image9.  The  SSD  values  e((i)(z,  <)  over  5-by- 
^pixel  windows  are  plotted  for  various  baseline  stereo  pairs  in 
figures.  The  horizontal  axis  of  all  the  plots  is  the  inverse  (Stance, 
normalized  such  that  86F  =  1.  Rgure  S  illustrates  the  trade-off 
between  precision  and  ambiguity  in  terms  of  baselines.  That  is, 
for  a  shofter  baseline,  there  are  fewer  minima  (i.e.  less  ambiguity), 
but  the  SSD  curve  is  flatter  (i.e.  less  precise  localization).  On  the 


other  hand,  for  a  longer  baseline,  there  are  more  minima  (i.e.  more 
ambiguity),  but  the  curve  near  the  minimum  is  sharper,  that  is,  the 
estimated  distance  is  more  precise  if  we  can  find  the  correct  one. 

Now,  let  us  take  two  stereo  image  pairs:  one  with  B  =  51  and 
the  other  with  B  =  8h.  In  figure  6,  the  dashed  curve  and  the  dotted 
curve  show  the  SSD  for  B  =  S6  and  B  =  86,  respectively.  Let 
us  suppose  the  search  range  goes  from  0  to  20  in  the  horizontal 
axis,  which  in  this  case  corresponds  to  12  to  00  inches  in  distance. 
Though  the  SSD  values  take  a  minimum  at  the  correct  answer  near 
(  =  S,  there  are  also  other  minima  for  both  cases.  The  solid  curve 
^ows  the  evaluation  function  for  the  multiple-baseline  stereo, 
which  is  the  sum  of  the  dashed  curve  and  the  dotted  curve.  The 
solid  curve  shows  only  one  clear  minimum;  that  is,  the  ambiguity 
is  resolved. 

So  far,  we  have  considered  using  only  two  stereo  pairs.  We  can 
easily  extend  the  idea  to  multiple-baseline  stereo  which  uses  more 
than  two  stereo  pairs.  Corresponding  to  equation  (20),  the  period 
of  E[e(^i2-.-n)ix,  0)  becomes 

. m 

where  Bi ,  Bz, . . . ,  B.  are  baselines  for  each  stereo  pair. 

We  will  demonstrate  how  the  ambiguity  can  be  further  reduced 
by  increasing  the  number  of  stereo  pairs.  >From  the  data  of 
figure  4,  we  first  choose  imagel  and  image9  as  a  long  baseline 
stereo  pair,  ie.  (1)  B  =  86.  Then,  we  increase  the  number  of 
stereo  pairs  by  dividing  the  baseline  between  imagel  and  image9, 
i.e.  (2)  B  =  46  and  86,  (3)  B  =  26, 46, 66  and  86,  (4)  B  =  6, 26. 
36, 4b,  36, 66, 76  and  86.  Figure  7  demonstrates  that  the  SSSDs-in- 
inverse-distance  shows  the  minimum  at  the  correct  position  more 
clearly  as  more  stereo  pairs  are  used. 

2,5  PrecUoa 

We  have  shown  that  ambiguities  can  be  resolved  by  using  the 
SSSD-in-inverse-distance  computed  from  multiple  baseline  stereo 
pairs.  The  technique  also  increases  precision  in  estimating  the  true 
inverse  distance.  We  can  show  this  by  analyzing  the  statistical 
characteristics  of  the  evaluation  functions  near  the  correct  match. 

>From  equations  (3),  (10),  and  (12),  we  have 

«<(0(*.C)  =  Y^ifix  +  j)-fix  +  BiFiC-Cr)  +  j) 

i€W 

-Fno(x -F  j)  —  ni(x -F  BjFC -F  j))^.  (22) 
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Hguie  S:  SSD  values  vs.  inverse  depth:  (a)  B  =  b;Q>)B  =  2b-, 
(c)  B  =  36;  (d)  B  =  46;  (e)  B  =  56;  (0  B  =  66;  (g)  B  =  76;  (h) 
B  =  86.  The  horizontal  axis  is  normalized  such  that  86F  =  1. 


Figure  7:  Combining  multiple  baseline  stereo  pairs 


By  taking  the  Taylor  expansion  about  (  =  (r  up  to  the  linear 
terms,  we  obtain 

/(z+B<F(C-Cr)+i)  «  /(x+i)+BiF(C-(r)/'(x+j).  (23) 

Substituting  this  into  equation  (22),  we  can  approximate  e((i)(z,  Q 
near  Cr  by  a  quadratic  form  of  (: 

«c(o(*.0 

iiw 

+no(*  +  j)  -  ni(x  +  BiFC  +  i))* 

=  B?FVx)(C  -  Cr)*  +  2B<F6,(x)(C  -  Cr)  +  ci(x), 

(24) 


«(*)  =  S(/'(*  +  i))*  (25] 

i€W 
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Figure  6:  Combining  two  stereo  pairs  with  different  baselines 


c<(*)  =  '^{ni(x  +  BiF<:+j)-no{x+j))^.  (27) 


The  estimated  inverse  distance  ^r(,-)  is  the  value  (  that  makes 
equation  (24)  minimum; 

B.FalxY 

Since  B(6i(x)]  =  0,  the  expected  value  of  the  estimate  ^r(i)  is  the 
correct  ^ue(r,  but  it  varies  due  to  the  noise.  The  variance  of  this 
estimate  is: 


Var((rii)) 


Var(bi(x)) 

B?F»(a(*))* 

= 

B?F*o(x)‘ 


Basically,  this  equation  states  that  for  the  same  amount  of  image 
noise  tri,  the  variance  is  smaller  (the  estimate  is  more  precise)  as 
the  baseline  Bi  is  longer,  or  as  variation  of  intensity  signal, 
a(x),  is  larger. 


We  can  follow  the  same  analysis  for  e((i2...n)(z,  C)  of  (U),  the 
new  evaluation  function  with  multiple  baselines.  Near  Cr.  it  is 

e((«...»)(*,C)  «  BfJ  fV*)(C  -  Cr)* 

+2F  Bibi(x)j  (C  -  Cr)  +  ^c,(x).  (30) 

The  variance  of  the  estimated  inverse  distance  ^r(i2--.n)  that  min¬ 
imizes  this  function  is 

v.Ke..,.. <3') 

>From  equations  (29)  and  (31),  we  see  that 


Var($,(,2...,)) 


Var(4r(i))' 


The  inverse  of  the  variance  represents  the  precision  of  the  estimate. 
Therefore,  equation  (32)  means  that  by  using  the  SSSD-in-inverse- 
distance  with  multiple  baseline  stereo  pairs,  the  estimate  becomes 
more  precise.  We  can  confirm  this  characteristic  in  figures  6  and 
7  by  observing  that  the  curve  around  the  correct  inverse  distance 
becomes  sharper  as  more  baselines  are  used. 

3  Experimental  Results 

This  section  presents  experimental  results  of  the  multiple-baseline 
stereo  based  on  SSSD-in-inverse-distance  with  real  2D  images.  A 
complete  description  of  the  algorithm  is  included  in  Appendix  B. 

TIm  first  result  is  for  the  Town"  data  set  that  we  showed  in 
figure  3.  Figures  8  (a)  and  (b)  are  the  distance  mtq>  and  its  isometric 
plot  with  a  short  baseline,  B  s  36.  The  result  with  a  single  long 
baseline,  B  =  9b,  is  shown  in  figure  9.  Comparing  these  two 
results,  we  observe  that  the  distance  map  computed  by  using  the 
long  baseline  is  smoother  on  flat  surfaces,  i.e.,  more  precise,  but 
has  gross  errors  in  matching  at  the  top  of  the  scene  because  of 
the  repeated  pattern.  These  results  illustrate  the  trade-off  between 
ambiguity  and  precision.  Figure  10,  on  the  other  hand,  shows  the 
distance  map  a^  its  isonretric  plot  obtained  by  the  new  algorithm 
using  three  different  baselines,  36,  66,  and  96.  For  comparison, 
the  corresponding  oblique  view  of  the  scene  is  shown  in  figure  11. 
We  can  note  that  the  computed  distance  map  is  less  ambiguous 
and  more  precise  than  those  of  the  single-baseline  stereo. 

Figure  12  shows  another  data  set  used  for  our  experiment  Fig¬ 
ures  13  and  14  compare  the  distance  maps  computed  from  the  short 
baseline  stereo  and  the  long  baseline  stereo:  the  longer  baseline  is 
five  times  longer  than  the  short  one.  For  comparison,  the  actual 
oblique  view  roughly  c(»ie^nding  to  the  isometric  plot  is  shown 
in  figure  IS.  Though  no  repetitive  patterns  are  apparent  in  the  im- 
a^,  we  can  still  observe  gross  errors  in  the  distance  map  obtained 
with  the  long  baseline  due  to  false  matching.  In  contrast,  the  result 
from  the  multiple-baseline  stereo  shown  in  figure  16  demonstrates 
both  the  advantage  of  unambiguous  matching  with  a  short  baseline 
and  that  of  precise  matching  vtith  a  long  baseline. 

4  Coodasioiis 

In  this  paper,  we  have  presented  a  new  stereo  matching  method 
which  uses  multiple  bareline  stereo  pairs.  This  method  can  over¬ 
come  the  trade-off  between  precision  and  accuracy  (avoidance  of 
false  matches)  in  stereo.  The  method  is  rather  straightforward:  we 
represent  the  SSD  values  for  individual  stereo  pairs  as  a  function 


of  the  inverse  distance,  and  add  those  functions.  The  resulting 
function,  the  SSSD-in-inverse-distance,  exhibits  an  unambiguous 
and  sharper  minimum  at  the  correct  matching  position.  As  a  result 
there  is  no  need  for  search  or  sequential  estimation  procedures. 

The  key  idea  of  the  method  is  to  relate  SSD  values  to  the  inverse 
distance  rather  than  the  disparity.  As  an  afterthought,  this  idea  is 
natural.  Whereas  di^>arity  is  a  function  of  the  baseline,  there  is 
only  one  true  (inverse)  distance  for  each  pixel  position  for  all  of 
the  stereo  pairs.  Therefore  there  must  be  a  single  minimum  for 
the  SSD  vdues  when  they  are  summed  and  plotted  with  respect 
to  the  inverse  distance.  We  have  shown  die  advantage  of  the 
proposed  method  in  removing  ambiguity  and  improving  precision 
by  analytical  and  experimental  results. 
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A  SSSD-in-inverse-distance  for  Ambiguous  Pattern 

Proposition:  Suppose  that  there  are  two  and  only  two  repetitions 
of  the  same  pattern  arouixl  positions  x  and  x -ha  where  a  ^  0  is 
a  constant  That  is,  forj  e  W 

f(x-hj)  =  f(i  +  j),  if  and  only  if  ^  =  a:  or  f  =  x  +  a. 

(33) 

•Ihcn,ifB,?tB2,forVC.C/Cr. 

=  +j)-n^+SiFic-<r)+j)f 

i€W 

iew 

>  =  B(e((i2)(a;,Cr)I-  (34) 


Proof:  Tentatively  suppose  that  for  3(f,  C/  #  Cr, 

+ j)  -  /(* + ^'^(0  -  C-) + J))' 

>€W 

iew 

=  0.  (35) 

Then,  it  must  be  the  case  that 

f(x-hj)  =  f{x-hai+j) 

/(*+»  =  /(*  +  02  +  j),  (36) 

forj  €  W,  where 

a,  =  B,F(C/-Cr) 

02  =  B2F(C/-Cr). 

Since  Bi  ^  Bz  and  Cr  ^  C/, 

Oi  #  02.  (37) 

So,  we  have 

/(*  + j)  = /(C  + j),  forC  =  *,  X -bai,orx -1-02.  (38) 

Since  this  contradicts  assumption  (33),  equation  (35)  does  not  hold. 
Its  left  hand  side  must  be  positive.  Hence  (34)  holds. 


(a) 

Figure  9:  Result  with  a  long  baseline,  B  =  9b:  (a)  Distance  map;  (b)  Isometric  plot  The'matching  is  less  noisy  when  it  is  correct 
However,  there  are  many  gross  mistakes,  especially  in  the  top  of  the  image  where,  due  to  a  repetitive  pattern,  the  matching  is  completely 
wrong. 
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(a) 


Figure  12:  "Coal  mine”  data  set,  long-baseline  pair 


(b) 


(a)  (b) 

Figure  13:  Result  with  a  short  baseline:  (a)  Distance  map;  (b)  Isometric  plot  of  the  distance  map  viewed  from  the  lower  left 
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(a) 

Figure  14:  Result  with  a  long 


(a) 


(b) 

Figure  16:  Multiple  baselines;  (a)  Distance  map:  (b)  Isomeuic  plot 


B  Multipie-Baseliae  Stereo  Algorithm 

We  present  a  complete  description  of  the  stereo  algorithm  using 
multiple-baseline  stereo  pairs.  The  task  is,  given  n  stereo  pairs, 
hnd  the  ( that  minimizes  the  SSSO-in-inverse-distancefunction, 

fl 

SSSDix,0  =  E (/o(*+i) -/<(*  + 5. (39) 

ist 


We  will  perform  this  task  in  two  steps;  one  at  pixel  resolution 
by  minimum  detection  and  the  other  at  sub-pixel  resolution  by 
iterative  estimation. 

Minimum  of  SSSD  at  Pixel  Resolution 

For  convenience,  instead  of  using  the  inverse  distance,  we  nor¬ 
malize  the  disparity  values  of  individual  stereo  pairs  with  different 
baselines  to  the  corresponding  values  for  the  largest  baseline.  Sup¬ 
pose  B\  <  Bi  <  ■  ■  ■  <  define  the  baseline  ratio  Ri  such 

that 

ft  =  |i-.  (40) 

On 

Then, 

BiFC  =  ftB„FC  =  ft(/(,),  (41) 

where  d(n)  is  the  disparity  for  the  stereo  pair  with  baseline  Bn. 
Substituting  this  into  equation  (39), 

fl 

SSSD(x,  tf(n))  =  ~  /'(*  +  ft'^tn)  +»)*• 

«»l  iiW 

(42) 

We  compute  the  SSSD  function  for  a  range  of  disparity  values 
at  the  pixel  resolution,  and  identify  the  disparity  that  gives  the 
minimum.  Note  that  pixel  resolution  for  the  image  pair  with  the 
longest  baseline  (ft)  requires  calculation  of  SSD  values  at  sub¬ 
pixel  resolution  for  other  shorter  baseline  stereo  pairs. 


Iterative  Estimation  at  Sub-pixd  Resolution 

Once  we  obtain  disparity  at  pixel  resolution  for  the  longest  baseline 
stereo,  we  improve  the  disparity  estimate  to  sub-pixel  resolution 
by  an  iterative  algorithm  presented  in  [12][17].  For  this  iterative 
estimation,  we  use  only  the  image  pair  fo{x)  and  /n(x)  with  the 
longest  baseline.  This  is  due  to  a  few  reasons.  First,  since  the 
pixel-level  estimate  was  obtained  by  using  the  SSSD-in-inverse- 
distance,  the  ambiguity  has  been  eliminated  and  only  improvement 
of  precision  is  intended  at  this  stage.  Second,  using  only  the 
longest- baseline  image  pair  reduces  the  computational  requirement 
for  SSD  calculation  by  a  factor  of  n,  and  yet  does  not  degrade 
precision  too  significantly. 

In  the  experiments  shown  in  section  3,  we  used  the  following 
algorithm  for  sub-pixel  estimation:  Let  d^n)  be  the  initial  dis¬ 
parity  estimate  obtained  at  pixel  resolution.  Then,  a  more  precise 
estimate  b  computed  by  calculating  the  following  two  quantities: 


+  i)  -/»(*  +  <fo(n)  +  i))/A(^  +  4o(n)  +  j) 

(43) 


'L,  ^ 


2ui 


(44) 


The  value  Adj,)  is  the  estimate  of  the  correction  of  the  disparity 
to  further  mimmi/e  the  SSD.  and  is  its  variance.  We  iterate 
this  procedun'  hy  'placing  do(ii)  by 

•—  do(i«)+Adjn)  (45) 

until  the  esiini'iir  i  oiwerges  or  up  to  a  certain  maximum  number 
of  iterations. 
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Part  II:  Experiments  on  Outdoor  Scenes 
T.  Nakahara  and  T.  Kanade 

In  “Part  I:  Theory,"  we  explained  how  multiple  stereo  pairs 
with  various  baselines  were  used  to  obtain  precise  depth  esti¬ 
mates  without  suffering  from  ambiguity.  The  algorithm  was 
tested  with  indoor  images  which  were  taken  under  well  con¬ 
trolled  conditions  in  the  Calibrated  Imaging  Laboratory. 

This  algorithm  is  applied  to  outdoor  scenes  including  variable 
lighting  conditions  and  large  depth  range.  While  Okutomi  and 
Kanade  used  stereo  pairs  acquired  by  moving  a  camera  horizon¬ 
tally,  we  use  stereo  pairs  taken  by  moving  a  camera  in  both  hori¬ 
zontal  and  vertical  orientations.  Taking  stereo  images  with  two 
orthogonal  baseline  orientations  removes  ambiguity  and 
increases  precision  without  suffering  from  the  orierUation  of  the 
features  in  a  scene.  And  we  also  show  that  the  shapes  cf  the  sum 
of  squared-difference  (SSD)  values  near  the  estimate  may  indi¬ 
ctee  the  reliability  of  the  match,  and  suggest  a  method  to  classify 
matches  into  various  types,  such  as  good  matches  and  mis¬ 
matches  with  occlusion  or  sparse  features. 

1.  Horizontal  baselines  Experiment 

The  experimental  setup  for  acquiring  stereo  pairs  is  illustrated 
in  fig.  1.  The  images  are  acquired  by  moving  a  camera  horizon¬ 
tally.  The  distance  between  adjacent  camera  positions  is  constant 
Table  1  describes  the  image  acquisition  parameters.  Typically, 
the  distance  from  the  camera  to  the  nearest  object  is  19  m  and  the 
baseline  length  ranges  from  19.05  mm  for  the  closest  camera  pair 
to  114.1  mm  for  the  farthest. 

As  illustrated  in  fig.  2,  first  the  input  images  are  preprocessed 
with  Laplacian  of  Gaussian  (UXi)  filter  to  reduce  photometric 
distortion.  A  5x5  window  is  used  for  Gaussian  and  a  3x3  window 
is  used  for  Laplacian.  Then  the  multiple-baseline  stereo  is  used  to 
compute  the  inverse  depth  with  a  9x9  window  for  SSD  computa¬ 
tion.  Typically,  the  number  of  the  stereo  pairs  is  6,  the  image  size 
is  240x256,  and  the  total  disparity  range  is  9  pixels,  as  summa¬ 
rized  in  table  2. 

1.1.  Results 

We  experiment  with  three  data  sets,  “Shrubbery,"  “Parking 
meters,"  and  “Sand"  for  the  horizontal  experiments. 

object 

o 


I _ I 

1 _ I 
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Fig.  1  Setup  for  horizontal  baselines 


1.1.1.  Shrubbery 

Fig.  4  shows  ihe  “Shrubbery”  data  set  which  consists  of  six 
stereo  pairs.  The  maximum  disparity  between  the  adjacent 
images  is  around  two  pixels.  Fig.  S  is  a  LOG  preprocessing  result 
of  one  of  the  images.  Fig.  6  is  the  isometric  plot  of  the  resultant 
depth  map.  We  observe  that  the  shrubberies  at  the  left  and  in  the 
center  are  well  separated,  and  the  depth  jump  around  the  sign 
board  and  the  top  of  the  signpost  are  clearly  distinct  from  the 
wall.  We  can  see  a  round  shrubbery  at  the  right  and  some  pebbles 
on  the  road.  Some  mismatches  are  observed  at  the  curb,  because 
the  features  in  this  area  are  almost  parallel  to  the  epipolar  line. 

1.1.2.  Parking  meters 

This  data  set  includes  seven  stereo  pairs.  Fig.  8  is  the  isometric 
plot  of  the  depth  map.  The  following  portions  in  the  scene  are 
well  estimated:  the  three  parking  meters  in  front  of  the  shrubber¬ 
ies,  the  side  view  of  the  sign  board  which  is  between  the  second 
and  the  third  parking  meters,  and  the  large  depth  gap  between  the 
front  and  Ibc'back  building.  There  are  some  mismatches  at  the 
back  door  of  the  car  because  of  sparse  features  in  this  part 

1.1  J.  Sand 

This  scene  contains  natural  rough  surfaces  like  sand  and  a  rock 
as  shown  in  fig.  9.  Five  stereo  pairs  are  used  for  this  data  set.  Fig. 
10  is  the  isometric  plot  of  the  dq>th  map.  We  observe  that  the  two 
rocks  and  the  sand  are  well  estimated.  Many  mismatches,  how¬ 
ever,  occur  at  the  border  between  the  black  wall  and  the  white 
curtain.  The  features  in  this  ponion  are  parallel  to  the  epipolar 
line  and  are  low  in  density. 

1.2.  Shapes  of  SSD  and  SSSD  Curves 

In  this  section  we  show  that  the  shapes  of  the  SSDs-in-inverse- 
depth  may  indicate  the  reliability  of  a  match  and  suggest  the 
cause  of  a  mismatch.  For  this  purpose,  we  examine  the  shapes  of 
the  SSD  and  the  sum  of  the  SSD  (SSSD)  in  three  typical  cases:  a 
good  match,  a  mismatch  with  occlusion,  and  a  mismatch  with 
sparse  features. 

First,  we  examine  the  shapes  of  the  SSD  and  the  SSSD  for  a 
point  whose  depth  is  precisely  and  accurately  estimated,  such  as 
a  point  ®  on  the  sand  in  fig.  9.  Fig.  1 1  plots  12  curves  of  individ¬ 


ual  SSDs  and  the  resultant  SSSD  for  this  point.  We  observe  that 
the  minimum  of  the  SSD  of  each  baseline  takes  place  at  the  same 
position  and  the  curvature  of  the  SSD  near  the  minimum  of  the 
SSSD  becomes  sharper  as  the  baseline  becomes  longer.  The 
SSSD  exhibits  a  unique  and  clear  minimum  at  the  conect  match¬ 
ing  position. 

Let  us  approximate  individual  SSD’s  curves  by  a  quadratic 
equation  near  the  minimum  position.  From  equations  (22)  -  (29) 
in  “Part  I:  Theory,”  we  expect  the  following: 

•  The  inverse  depth  at  which  the  SSD  values  take  the  mini¬ 
mum  is  expected  to  be  the  same  over  the  various  base¬ 
lines. 

•  The  curvature  is  proportional  to  the  square  of  the  baseline 
length. 

•  Variance  of  differences  between  the  inverse  depth  at  the 
minimum  position  of  each  SSD  and  the  final  estimated 
inverse  depth  is  inversely  proportional  to  the  square  of 
the  baseline  length. 

Fig.  12  (a),  (b),  and  (c)  show  the  above-mentioned  theoreti¬ 
cally  expected  values  and  experimental  measurements  for  the 
case  of  a  good  match  shown  in  fig.  11.  The  measurements  coin¬ 
cide  well  with  the  theoretical  values. 

Second,  we  look  into  the  occlusion  case,  such  as  a  point  ®  at 
the  right  of  the  first  parking  meter  head  in  fig.  7.  The  correspon¬ 
dence  points  exist  in  shorter  baselines.  As  the  baseline  becomes 
longer,  occlusion,  however,  occurs  and  matching  is  not  possible. 
The  SSD  and  the  SSSD  for  the  point  ®  are  shown  in  fig.  13  (a). 
The  inverse  depth  at  the  minimum  of  the  SSD  of  each  baseline 
gradually  shifts  from  the  true  position  to  a  false  position.  The 
SSSD  docs  not  show  a  clear  minimum.  As  shown  in  fig.  13  (b), 
(c),  and  (d),  the  theoretically  expected  values  and  the  measure¬ 
ments  coincide  where  the  baselines  are  short  but  differ  greatly 
where  the  baselines  are  long. 

The  third  case  is  a  point  with  sparse  features  like  a  point  d)  at 
the  black  wall  in  fig.  9.  As  shown  in  fig.  13  (e),  the  SSD  curve  of 
each  baseline  is  almost  flat  over  the  inverse  depnh  range  with  no 
obvious  minimum.  Consequently  the  SSSD  does  not  have  the 
minimum. 


Another  observation  for  the  part  of  a  depth  map  with  mismatch 
or  noisy  measurements  is  the  problem  of  the  orientation  of  the 
features  in  a  scene.  We  can  not  obtain  good  depth  estimates  near 
the  curb  portion  in  “Shrubbery”  or  the  border  between  the  black 


Fig.  2  Procedure 


Fig.  3  Setup  for  horizontal  and  vertical  baselines 


wall  and  the  curtain  in  “Sand,”  because  the  image  contains  only 
horizontal  features.  The  solution  is  to  use  additional  stereo  image 
pairs  taken  by  cameras  aligned  in  vertical  direction.  Combining 
the  information  of  the  vertical  baseline  with  the  information  of 
the  horizontal  baseline  is  straightforward  in  the  multiple-baseline 
algorithm,  because  this  algorithm  simply  adds  the  SSD-in- 
inverse-depth  instead  of  the  disparity. 

Next  section,  we  demonstrate  the  effectiveness  of  using  both 
horizontal  and  vertical  baselines. 

2.  Horizontal  and  vertical  baselines 
Experiment 

Fig.  3  illustrates  the  experimental  setup.  The  procedure  is  the 
same  as  the  one  in  the  horizontal  baselines  experiment,  except 
images  are  taken  by  moving  a  camera  horizontally  and  vertically. 
The  acquisition  parameters  are  shown  in  the  last  three  rows 
(“Shrubbety2,”  “Comer,”  and  “Guide”)  of  table  1.  The  baseline 
length  ranges  from  20  mm  for  the  closest  camera  pair  to  60  mm 
for  the  farthest,  which  is  somewhat  shorter  than  the  horizontal 
baselines  case. 

As  shown  in  the  last  three  rows  in  table  2,  the  number  of  the 
stereo  pairs  is  3  for  each  baseline  and  the  total  disparity  range  is  6 
pixels. 

2.1.  Results 

2.1.1.  ‘‘Corner”  data  set 

Fig.  14  shows  the  data  set,  together  with  an  illustration  of  the 
arrangement  of  the  camera  and  the  objects. 

Fig.  16  is  the  isometric  plot  of  the  depth  map  using  three  stereo 
pairs  for  each  baseline  orientation  (six  pairs  in  total).  We  observe 
that  the  building  wall,  especially  the  slanting  part  of  the  wall  at 
the  right  is  well  estimated.  The  curb  is  separated  from  the  shrub¬ 
beries  in  the  back  and  the  road  in  the  front.  We  can  see  the  dis¬ 
tances  between  the  curb  and  the  shrubberies. 

For  comparison,  a  depth  map  is  computed  using  six  stereo 
pairs  in  only  horizontal  orientation.  Fig.  15  is  the  result.  Many 
mismatches  are  observed  at  the  wall  and  the  curb,  because  the 
main  features  of  these  portions  are  horizontal. 

2.1.2.  SSD  and  SSSD  in  inverse  depth 

We  examine  the  SSD  and  the  SSSD  of  a  point,  such  as  a  point 
on  the  wall  or  at  the  curb.  Depth  estimate  of  the  point  is  correct 
using  both  horizontal  and  vertical  baselines,  though  the  estimate 


is  incorrect  using  only  horizontal  baselines.  Fig.  17  (a)  and  (b) 
show  the  SSD  and  the  SSSD  of  the  points  ©  and  @  in  fig.  16 
respectively.  Though  the  SSDs  of  the  horizontal  baselines  do  not 
show  the  clear  minimum,  the  SSDs  of  the  vertical  baselines  hav¬ 
ing  a  unique  minimum  at  the  correct  point  contribute  to  the  deter¬ 
mination  of  the  correct  minimum  position  of  the  total  SSSD. 

3.  Comments 

Parallelism 

The  computation  of  this  algorithm  is  simple  and  local,  which 
is  suited  for  implementation  on  a  massively  parallel  machine.  We 
are  implementing  this  algorithm  on  a  MasPar,  a  Single  Instruc¬ 
tion  Multiple  Data  (SIMD)  machine  with  4096  processors.  At 
this  moment,  the  processing  time  on  MasPar  is  0.9  second  for 
producing  a  depth  map  with  the  240x256  image  size  and  the  dis¬ 
parity  range  of  10  pixels,  while  it  takes  51  seconds  to  do  the  same 
for  SUN  4/40  (16  MIPS). 

It  is  possible  to  implement  this  algorithm  by  a  dedicated  hard¬ 
ware  or  even  on  a  chip  for  a  real-time  depth  sensor. 

Classification  of  depth  measurements 

We  showed  that  the  shapes  of  the  SSD  in-inverse-depth  near 
the  minimum  of  the  SSSD  may  be  useful  to  estimate  the  cause  of 
mismatches  like  the  occlusion  and  the  sparse  features  cases.  We 
expect  that  we  can  similarly  analyze  mismatches  caused  by  a  ter¬ 
minal  edge  or  a  highlight. 
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Table  2  Image  processing 


Name 

Number  of 
stereo  pair 

Image 

size 

Disparity 

range 

Town 

3 

256*240 

4-14 

Coal 

5 

256x240 

30-40 

6 

4-13 

Paiking  meten 

7 

240*256 

1-15 

Sand 

S 

240*256 

1-5 

Shnibbery2 

H:3 

V:3 

240*256 

1-7 

Comer 

H.3 

V:3 

240x256 

1-7 

Guide 

H;3 

V:3 

240*256 

0-8 

Table  1  Image  acquisition 


1  Diaunce  I 

1  Baseline  length  I 

Focal 

the  nearest 

the  fartheal 

unit 

Icngest 

length 

Town 

0.51m 

1.02m 

t.27nim 

11.43mm 

Coal 

1  II 

1 

7.62mm 

38.10mm 

Shrubbery 

19m 

- SiS - 

1 9.05mm 

114,10mm 

SONY  XC57 

3(h7vn 

Puking  meten 

I2m 

34m 

10.16mm 

71.12mm 

SONY  XC57 

50mm 

Sand 

6m 

10m 

254mm 

1270mm 

SONY  SSC-D7 

50mm 

S)mibbefy2 

19m 

28m 

20.0nim 

60.00mm 

SONY  XC57 

50mm 

Comer 

19m 

28m 

20.0mm 

60.00mm 

SONY  XC57 

50mm 

Guide 

16m 

90m 

20.Chnm 

60.00mm 

SONV  XC57 

50mm 
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(a)  1  St  (left  most) 


(e)  7  th  (right  most) 


(b)2nd 

Fig.  4  “Shrubbery”  data 


Fig.  6  Isometric  plot  of  depth  (“Shrubbery”) 
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(D  Point  of  occlusion 


Fig.  7  “Parking  meters”  data 


Fig.8  Isometric  plot  of  depth  (“Parking  meters”) 


®  Point  of  good  match 

Point  of  sparse  featuies 


Fig.  9  “Sand”  data 


Fig.  10  Isometric  plot  of  depth  (“Sand”) 
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SSO.SSSD  I  10^ 


Fig.  11  SSD  and  SSSD  values  vs.  inverse  depth 


m  looB 

(a)  Inverse  depth 


(b)  Curvature  of  SSD 


(c)  Variance  of  differences 


Fig.  12  Inverse  depth,  variance  oj 
estimate,  and  curvature  of 


pnces^tw§en 


rom  I 


lividual 


the  miniinum  position  .of  each  SSD  ^ 
stereo  pair  neu  die  minimum  of  SSSD 


(Case  of  a  good  match) 


the  final 
vs.  baseline 


(t)  Occtucion 


Fig.  13  SSD  and  SSSD  values  vs.  inverse  depth,  inverse  depth,  variance  of  differences  between  the  minimum 
position  of  each  SSD  and  the  final  estimate,  and  curvature  of  SSD  from  individual  stereo  pair  near  the 
minimum  of  SSSD  vs.  baseline 
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(a)  Up  most 


19ftt  shrubbery 


TV  camera 


(b)  Left  most 


(c)  Right  most 


Fig.  14  “Comer’  data 


Fig.  15  Isometric  plot  of  depth  resulted  from 
horizontal  baselines 


$SD.S$SDil(P 


Fig.  16  Isometric  plot  of  depth  resulted  from 
horizontal  and  vertical  baselines 


SSD^DiU^ 


Fig.  17  SSD  and  SSSD  values  vs.  inverse  depth 
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Abstract 

In  this  paper  we  design  a  feature  based  stereo  match¬ 
ing  system.  We  propose  a  hierarchical  grouping  pro¬ 
cess  that  groups  line  segments  into  more  complex  struc¬ 
tures  that  are  easier  to  match.  The  hierarchy  consists 
of  lines,  vertices,  edges  and  surfaces.  Matching  starts  at 
the  highest  level  of  the  hierarchy  (surfaces)  and  proceeds 
to  the  lowest  (lines).  Higher  level  features  are  easier  to 
match,  because  they  are  fewer  in  number  and  more  dis¬ 
tinct  in  form.  These  matches  then  constrain  the  matches 
at  lower  levels.  Perceptual  and  structural  relations  are 
used  to  group  matches  into  islands  of  certainty.  A  Truth 
Maintenance  System  (TMS)  is  used  to  enforce  grouping 
constraints  and  eliminates  inconsistent  match  groupings. 
The  TMS  is  also  used  for  reasoning  in  presence  of  un¬ 
certainty  and  to  carry  out  belief  revisions  necessitated  by 
additions,  deletions  and  confirmations  of  hypotheses. 

1  Introduction 

Stereo  matching  is  the  process  of  fusing  two  images  taken 
from  different  viewpoints  to  recover  depth  information 
in  the  scene.  The  process  involves  identifying  corre¬ 
sponding  features  in  two  views  and  using  their  relative 
displacements  together  with  camera  geometry  to  esti¬ 
mate  their  depth.  Line  segments  have  ginned  popularity 
as  features  to  be  matched  across  scenes.  The  epipolar 
constraint  [Barnard  and  Fischler,  1982]  is  used  to  re¬ 
strict  the  number  of  line  matches.  Lines  that  can  be 
matched  must  span  the  same  set  of  epipolar  lines.  How¬ 
ever,  much  ambiguity  remains  as  a  line  in  one  scene 
can  match  several  lines  in  the  other  scene,  in  spite  of 
the  epipolar  constraint.  So  there  is  a  necessity  to  con¬ 
sider  a  more  global  context  to  disambiguate  line  matches. 
Medioni  and  Nevatia  [1985]  use  a  minimum-differential- 
disparity  criterion  for  disambiguation.  Line  matches 
that  are  adjacent  and  have  similar  disparity  support  each 
other  in  a  relaxation  framework.  Ayache  and  Faver- 
jon  [l987]  cluster  adjacent  matches  with  similar  dispar¬ 
ity.  Both  [Medioni  and  Nevatia,  1985]  and  [Ayache  and 
Faverjon,  1987]  are  based  on  the  idea  that  adjacent  line 
segments  should  have  similar  disparity.  This  is,  how¬ 
ever,  a  weak  constraint  because  adjacent  lines  in  the 
scene  need  not  have  similar  depth.  This  constraint  fa¬ 
vors  frontoparallel  surfaces  [Hoff  and  Ahuja,  1989].  In 
this  paper,  we  attempt  to  derive  stronger  constraints 
based  on  the  topology  of  the  scene  to  augment  this  weak 


constraint.  Horaud  and  Skordas  [1989]  reduce  matching 
ambiguity  by  also  considering  feature  relations  between 
line  segments  (collinear-with,  same-junction-as,  left-of, 
etc.)  in  the  matching  process.  They  construct  an  as¬ 
sociation  graph  [Ballard  and  Brown,  1982],  whose  nodes 
correspond  to  line  matches  and  arcs  are  compatibility  re¬ 
lations.  Maximal  cliques  are  extracted  from  this  graph 
using  the  algorithm  of  Bolles  and  Cain  [l982]  and  eval¬ 
uated  with  a  goodness  measure  to  choose  the  best  one. 
This  procedure,  however,  is  of  exponential  complexity. 
In  this  paper,  we  describe  a  polynomial  time  algorithm 
that  forms  local  cliques  instead  of  global  ones.  Local 
cliques  are  sufficient  for  disambiguation  because  of  the 
hierarchical  grouping  process  that  we  use. 

The  basic  idea  in  our  approach  is  to  extract  as  much 
structure  as  possible  from  each  scene  before  matching. 
This  includes  both  topological  and  perceptual  structure. 
We  use  a  hierarchical  grouping  process  to  group  line  seg¬ 
ments  into  complex  structures.  Relations  between  the 
structures  are  also  computed.  The  hierarchy  consists 
of  lines,  vertices,  surface  edges  (edges,  for  short)  and 
surfaces.  The  relations  include  parallel,  colinear,  left-of 
and  right-of.  The  objective  is  to  match  the  hierarchi¬ 
cal  relational  graphs  extracted  in  both  scenes.  Matching 
starts  at  the  highest  level  (surfaces)  and  proceeds  to  the 
lowest  (lines).  At  each  level  matches  are  grouped  into 
strong  clusters  based  on  structural  and  perceptual  rela¬ 
tions.  These  groupings  are  then  confirmed.  Lim  and 
Binford  [Los  Angeles  CA  1987,  Cambridge  MA  1988] 
also  match  a  hierarchy  of  features  in  their  stereo  sys¬ 
tem.  Their  hierarchy  consists  of  edgels,  curves,  junc¬ 
tions,  surfaces  and  bodies.  But  only  those  features  that 
are  grouped  into  bodies  are  matched  and  no  ambigu¬ 
ity  is  assumed  during  matching.  However,  not  all  fea¬ 
tures  can  be  grouped  into  higher  level  structures.  For 
example,  roads  are  characterized  by  parallel  lines  which 
cannot  be  joined  at  a  junction.  Also,  our  experience 
with  several  images  shows  that  ambiguity  is  a  serious 
problem,  in  spite  of  using  a  feature  hierarchy.  To  deal 
with  ambiguity  as  to  the  existence  of  both  the  features 
and  their  matches,  we  treat  them  as  hypotheses  whose 
beliefs  can  be  later  revised  (to  true  or  false)  as  more  evi¬ 
dence  accumulates.  When  these  hypotheses  are  grouped, 
they  form  contexts.  Constraints  are  specified  by  the  user 
as  nogoods.  These  constraints  identify  certain  contexts 
as  contradictory.  A  Truth  Maintenance  System  (TMS) 
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manages  this  search  space  made  up  of  hypotheses,  tlieir 
groupings  (contexts)  and  constraints.  The  TMS  ensures 
that  the  final  solution  contains  no  contradictions.  Chung 
and  Nevatia  [l99l]  also  match  a  similar  hierarchy.  A  con¬ 
straint  satisfaction  network  [Mohan  and  Nevatia,  1989] 
is  used  to  deal  with  ambiguities.  However  this  relaxation 
approach  is  only  an  approximation  technique  and  there 
is  a  danger  of  getting  stuck  in  local  minima. 

Both  numerical  and  symbolic  approaches  are  known 
in  the  literature  to  deal  with  uncertainty.  For  a  re¬ 
view,  see  [Bhatnagar  and  Kanal,  1988].  Symbolic  ap¬ 
proaches  include  TMSs  [Doyle,  1979,  de  Kleer,  1986, 
Martins,  199l].  In  TMSs,  uncertainty  is  dealt  with  by 
making  assumptions  (hypotheses)  whose  beliefs  can  be 
later  revised  as  new  evidence  accumulates.  Unlike  nu¬ 
merical  methods,  symbolic  methods  cannot  represent 
degrees  of  conflict  (but  see  [Laskey  and  Lehner,  1989 
90],  which  integrates  Dempster-Shafer  formalism  into  a 
TMS),  but  they  permit  explicit  reasoning  about  the  as¬ 
sumptions  that  actually  led  to  the  conflict  (conflict  res¬ 
olution).  Since,  various  constraints,  like  the  uniqueness 
constraint,  ordering  constraint  etc.,  are  strictly  (and  not 
to  a  degree)  enforced  in  most  stereo  algorithms  (includ¬ 
ing  the  one  presented  in  this  paper)  there  is  more  of  a 
need  for  an  approach  that  can  reason  about  the  conflicts 
produced  by  these  constraints  than  one  that  can  deal 
with  degrees  of  conflict.  So  we  choose  to  use  a  TMS  to 
deal  with  uncertainty. 

Constraint  satisfaction  techniques  like  backtrack¬ 
ing  [Crimson,  1990b,  Crimson,  1990a],  discrete  re¬ 
laxation  [Mackworth,  1977,  Mack  worth  and  Frcuder, 
1985,  Medioni  and  Nevatia,  1984]  or  continuous  re¬ 
laxation  [Mohan  and  Nevatia,  1989,  Price,  1985]  can 
also  be  used  to  sift  through  the  space  of  competing 
and  supporting  matches.  The  advantage  in  using  a 
TMS  for  the  image  matching  problem  is  that  binary, 
ternary  or  higher  order  constraints  are  easily  incorpo¬ 
rated.  Also,  the  problem  solver  has  more  control  of  the 
path  to  solution  and  there  is  a  possibility  of  avoiding 
local  minima.  Further,  no  backtracking  is  involved  in 
the  assumption-based  TMS  (ATMS)  [Inf,  1987]  we  use 
in  our  system.  Instead  we  do  a  simultaneous  search  in 
a  context  lattice.  The  maladies  associated  with  back¬ 
tracking  have  been  well  documented  [de  Kleer,  1986, 
Mackworth,  1977].  In  particular,  inconsistencies  are  re¬ 
peatedly  rediscovered,  as  there  is  no  concept  of  remem¬ 
bering  mistakes. 

2  A  Feature  Hierarchy 

The  proposed  hierarchy,  shown  in  Figure  1,  consists  of 
lines  (straight  line  segments),  vertices  (junctions  of  line 
segments),  edges  (collections  of  colinear  lines  with  vertex 
terminations)  and  surfaces  (contiguous  sets  of  edges,  i.e. 
edge-rings).  Surfaces  can  be  open  or  closed.  Currently 
objects  arc  not  included  in  our  hierarchy,  mainly  because 
in  most  of  the  images  that  we  encountered,  surfaces  were 
the  most  complex  structures  visible.  But  the  extension 
is  straight  forward.  The  domain  of  applicability  is  any 
scene  whose  contours  can  be  approximated  with  a  series 
of  straight  tine  segments  (they  need  not  lie  on  a  plane). 
This  includes  urban,  indoor  and  factory  scenes. 


First,  edgels  (edge  pixels)  in  the  image  are  extracted 
using  the  Canny  edge  operator  [Canny,  1986].  A  modi¬ 
fied  connected-components  algorithm  [Venkateswar  and 
Cliellappa,  1990]  is  used  to  link  the  edgels  into  straight 
line  segments.  Perceptual  relations  between  the  straight 
line  segments  (lines,  for  short)  are  then  determined.  For 
any  line,  only  those  lines  within  a  local  neighborhood 
(typically  5%  the  width  of  the  picture)  are  tested  for 
perceptual  relations.  These  include  parallel,  colinear  and 
proximate  relationships.  For  each  line,  only  the  closest 
parallel,  colinear  and  proximate  line  relationships  (one 
on  each  side)  are  retained.  Bucketing  techniques  [Ayache 
and  Faverjon,  1987,  Knuth,  1973]  are  used  to  reduce  the 
number  of  comparisons  necessary  to  compute  these  rela¬ 
tions  between  lines.  Two  lines  whose  end  points  are  close 
to  each  other  are  possibly  connected  in  the  scene.  A  ver¬ 
tex  is  hypothesized  to  join  the  two  lines  in  an  L-junction. 
The  position  of  the  vertex  is  at  the  intersection  that  is 
obtained  if  the  lines  were  extended.  Edges  are  sets  of 
colinear  line  segments  with  vertex  terminations.  They 
represent  surface  boundaries  in  the  scene  that  are  frag¬ 
mented  into  lines  (because  of  poor  contrast,  etc).  For 
extracting  edges;  if  two  vertices  in  the  scene  have  lines 
that  are  colinear  and  face  each  other,  an  edge  is  hypoth¬ 
esized  to  join  the  vertices.  This  procedure  generates  sev¬ 
eral  spurious  edges,  because  of  accidental  alignment  of 
vertices.  To  eliminate  such  edges,  all  lines  that  over¬ 
lap  each  edge  are  accumulated.  Then  the  percentage 
overlap  of  each  edge  with  these  lines  is  calculated.  This 
gives  the  line  segment  support  for  each  edge.  All  the 
edges  with  insufficient  line  segment  support  (less  than 
90%)  are  deleted.  A  few  spurious  edges  may  still  re¬ 
main.  However,  it  is  unlikely  that  such  edges  will  have 
a  corresponding  match  in  the  other  scene.  A  surface 
(edge-ring)  is  an  ordered  set  of  connected  edges  (at  least 
two)  that  cannot  be  extended  any  further. 

The  initial  line  data  given  to  the  system  is  noisy.  The 
end  points  are  not  well  localized  and  some  lines  are  frag¬ 
mented  into  segir?nts.  This  is  a  direct  result  of  the  noisy 
nature  of  real  images  and  the  fact  that  low  level  tech¬ 
niques,  including  edge  pixel  detection  and  line  extrac¬ 
tion,  are  imperfect.  This  leads  to  uncertainties  when 
grouping  lines  into  higher  level  features.  To  deal  with 
these  uncertainties,  features  in  the  hierarchy  (other  than 
lines)  are  conjectured  eis  hypotheses.  These  features  may 
be  confirmed  when  more  evidence  becomes  available,  in¬ 
cluding  evidence  of  a  matching  feature  in  the  other  im¬ 
age.  The  hypotheses  and  their  groupings  define  a  con¬ 
text  graph  (a  directed  acyclic  graph).  This  is  illustrated 
in  Figure  2  which  shows  the  context  graph  for  a  simple 
stereo  pair.  The  context  graph  is  the  search  space  of  the 
problem.  Nodes  in  the  graph  correspond  to  collections 
of  hypotheses.  The  objective  is  to  incrementally  build 
this  graph  and  identify  those  nodes  that  correspond  to 
partial  solutions. 

3  Matches  and  Match  Groupings 

The  feature  hierarchy  together  with  feature  relationships 
forms  a  relational  graph  (or  a  semantic  network).  The 
objective  is  to  match  the  graphs  derived  from  a  stereo 
pair.  We  proceed  by  finding  matches  for  each  feature  and 
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Vertices  Lines 

Figure  1:  A  feature  hierarchy 


(a)  A  stereo  pair. 

(b)  Context  graph 

Figure  2:  Hypotheses  and  their  groupings  form  a  context 
graph,  va,  vb,  vc,  vjj,  vp,  vq,  vb  and  vs  are  vertex  hy¬ 
potheses.  e^B,  ecDi  ^PQ  «ind  6^5  are  edge  hypotheses. 
{^aBi^pq)  and  (cyiBiCRs)  are  edge-match  hypotheses. 


grouping  these  matches  into  local  cliques.  The  matching 
is  hierarchical,  we  start  by  matching  at  the  surface  level 
and  proceed  to  the  line  level.  For  two  features  to  match, 
their  component  features  must  match,  recursively.  For 
example;  two  surfaces  match  only  if  their  edges  match  (in 
the  same  order),  their  edges  match  only  if  their  vertices 
match  and  so  on.  At  each  level  those  features  that  were 
not  matched  at  previous  levels  (as  component  features) 
participate  in  the  matching  process.  Below  we  detail 
the  unary  constraints  used  before  hypothesizing  a  match 
between  features. 

epipolar  constraint;  For  each  feature  in  one  image, 
we  search  for  a  matching  feature  in  the  other  im¬ 
age,  within  the  epipolar  lines  spanning  the  feature. 
To  account  for  uncertainty  in  the  positions  of  the 
features,  we  tolerate  a  deviation  of  about  4  pixels 
perpendicular  to  the  epipolar  lines.  Lines  are  some¬ 
times  fragmented,  so  a  small  overlap  (25%)  of  epipo¬ 
lar  extents  is  considered  sufficient  for  matching. 
area  and  length  constraints;  We  expect  the  areas  of 
matching  surfaces  to  be  within  75%  of  each  other. 
The  lengths  of  matching  edges  must  be  within  75% 
of  each  other  and  those  of  lines  must  be  within  25% 
(to  account  for  their  fragmentation). 
orientation  constraint;  Since  the  two  views  in  a 
stereo  pair  are  not  very  far  apart  we  expect  the  ori¬ 
entations  of  features  to  be  similar  [Arnold  and  Bin- 
ford,  1980].  So  we  bound  the  orientation  difference 
between  matching  features  at  30°.  Similar  bounds 
have  been  used  in  other  work  [Medioni  and  Nevatia, 
1985,  Horaud  and  Skordas,  1989].  For  two  vertices 
that  satisfy  the  epipolar  constraint,  the  orientation 


constraint  applies  to  their  matching  lines.  Further, 
the  matched  lines  must  be  in  the  same  relative  order 
(clockwise  or  anti-clockwise). 
contrast  constraint;  For  two  linear  features  (lines  or 
edges)  to  match,  their  contrast  must  be  similar. 
Each  line  and  edge  marks  a  boundary  between  a 
dark  and  a  bright  region.  If  the  bright  region  is  on 
the  same  side  of  the  linear  features  (to  the  left  or 
right)  then  they  can  be  matched. 

For  each  feature  in  one  image  there  may  be  several 
match  hypotheses  in  spite  of  the  unary  constraints  de¬ 
tailed  above.  A  more  global  context  is  needed  to  disam¬ 
biguate  between  competing  matches.  Ambiguity  is  least 
at  the  surface  level  where  features  have  the  most  struc¬ 
ture  and  is  maximum  at  the  line  level,  where  typically 
a  line  has  several  possible  matches.  So  we  start  by  first 
matching  at  the  surface  level.  To  reduce  ami  uty  even 
further,  we  group  compatible  matches  into  k  cliques. 
Compatibility  is  based  on  consistent  structural  and  per¬ 
ceptual  relations  across  the  scenes  (Figure  3).  Structural 
groupings  are  based  on  unmatched  features  at  the  previ¬ 
ous  level  of  the  hierarchy.  These  unmatched  features  are 
used  cis  foci-of-attention  when  matching.  For  example, 
consider  the  groupings  when  matching  at  line  level.  Fig¬ 
ure  3  shows  a  typical  case.  Vertex  vab  has  no  match  in 
the  other  image.  But  its  lines  I  a  and  Is  match  Ip  and  Iq 
respectively.  Further,  these  lines  when  extended  inter¬ 
sect  on  the  epipolar  line  of  vab-  This  is  consistent  with 
the  presence  of  a  vertex  at  this  location.  This  vertex  was 
not  detected  earlier  during  the  feature  grouping  process 
because  the  lines  Ip  and  Iq  were  not  closer  than  a  thresh¬ 
old.  Unmatched  vertices  serve  as  foci-of-attention  when 
matching  at  the  line  level.  The  system  focuses  on  these 
vertices  and  creates  structural  groupings.  The  matches 
(^A,lp)  and  (Ib^Iq)  together  with  vertex  vab  form  a 
structural  grouping.  Perceptual  groupings  are  based  on 
consistent  perceptual  relations  [Lowe,  1987]  across  the 
scenes  (in  case  of  surfaces  and  vertices,  perceptual  re¬ 
lations  are  based  on  relations  of  their  constituent  edges 
and  lines  respectively).  These  include  parallel,  colinear 
and  proximate  relations.  In  Figure  3,  Ic  and  Id  as  well 
as  their  matches  Ir  and  I5  are  connected  through  the 
paroi/el relation.  So  the  match  pairs  and  {Id,  h) 

form  a  parallel  perceptual  grouping. 


Those  groupings  that  share  a  match  are  formed  into 
super-groups.  Groupings  are  currently  restricted  to 
a  size  of  three  consistent  matches.  These  groupings 
correspond  to  partial  solutions  to  the  matching  prob- 
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lem.  So  our  search  graph  is  only  three  levels  deep. 
This  reduces  matching  complexity.  We  have  found 
that  these  groupings  (together  with  the  hierarchical  fea¬ 
ture  grouping  process)  are  usually  sufficient  to  com¬ 
pletely  disambiguate  between  competing  matches,  i.e. 
it  is  hard  to  find  consistent  groupings  for  an  erro¬ 
neous  match  hypothesis.  At  each  level  structural  group¬ 
ings  are  confirmed  first,  followed  by  perceptual  group¬ 
ings.  Among  perceptual  groupings,  proximate  group¬ 
ings  are  confirmed  last.  Proximate  groupings  are  those 
matches  that  have  features  that  are  proximate  in  both 
scenes  and  have  a  similar  disparity.  However,  the 
constraint  that  adjacent  features  in  the  scene  should 
have  similar  disparity  is  a  weak  constraint,  so  these 
groupings  get  the  least  priority.  Other  researchers 
have  chosen  to  use  the  similar  disparity  criterion  aa 
their  primary  constraint  [Medioni  and  Nevatia,  1985, 
Ayache  and  Favetjon,  1987].  Larger  groupings  are  con¬ 
firmed  before  smaller  groupings  (groups  of  three  first, 
followed  by  groups  of  two  and  then  ungrouped  matches) 
because  they  have  a  more  global  context.  Within  each 
size  the  confirmation  is  completely  random. 

When  a  grouping  of  match  hypotheses  is  confirmed  the 
beliefs  of  several  hypotheses  are  affected  and  must  be 
revised.  The  process  of  belief  revision  is  handled  by  the 
TMS.  First,  all  underlying  hypotheses  (both  feature  and 
match  hypotheses)  of  the  confirmed  grouping  must  be 
converted  to  truths.  This  is  done  by  collapsing  the  por¬ 
tion  of  the  context  graph  that  supports  the  grouping  into 
the  root  context.  These  root  hypotheses  then  are  made 
visible  in  all  contexts  where  they  are  grouped  with  the 
other  hypotheses.  This  may  lead  to  some  contradictory 
groupings.  The  contexts  corresponding  to  such  group¬ 
ings  are  eliminated  by  the  TMS.  The  strategy  of  con¬ 
firming  partial  solutions  and  eliminating  contradictory 
contexts  has  the  effect  of  preventing  unbridled  growth 
in  the  size  of  the  context  graph.  For  example;  con¬ 
sider  the  context  graph  shown  in  Figure  2.  Suppose  the 
match  hypothesis  (single  grouping)  (c^b,cpq)  is  con¬ 
firmed.  Then,  the  following  actions  take  place. 

1.  The  portion  of  the  context  graph  supporting  this 
grouping  is  collapsed  into  the  root.  The  result  is 
shown  in  Figure  4(a).  The  root  hypotheses  va,  vg, 
fPt  ^Qt  ^ABi  ^PQ  snd  (cABt^PQ)  ar®  now  visible 
in  all  contexts  by  inheritance  (but  for  clarity  are 
not  shown  explicitly  in  the  non-root  contexts  cf  the 
figure). 

2.  This  creates  two  contradictory  contexts.  The  con¬ 
text  with  match  hypothesis  {cab  >  ^RS )  is  invalid  be¬ 
cause  it  violates  the  match  [cabi^pq)  (uniqueness 
constraint,  see  Section  4).  Similarly,  the  context 
with  edge  hypothesis  eeg  is  invalid  because  this 
edge  intersects  true  edge  cab  Both  contexts  are 
eliminated,  resulting  in  the  context  graph,  shown  in 
Figure  4(b). 

The  effect  is  that  when  a  grouping  is  confirmed  those 
hypotheses  that  violate  its  elements  (either  feature  or 
match)  are  eliminated.  In  the  final  set  of  matches  in 
the  root  context  there  will  be  no  contradictions.  These 
matches  correspond  to  a  solution  to  the  stereo  matching 
problem. 
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Figure  4:  Confirming  a  context. 


As  described  earlier,  matching  is  defined  recursively, 
so  that  when  two  features  at  a  level  are  matched,  so  are 
their  constituent  features  at  the  lower  levels.  Match¬ 
ing  at  one  level  of  the  hierarchy  affects  matching  at 
lower  levels.  For  example,  consider  the  effect  of  sur¬ 
face  matches  on  line  matches.  When  two  surfaces  are 
matched,  the  line  segments  constituting  their  boundaries 
are  also  matched.  When  the  surface  match  is  confirmed 
so  are  the  corresponding  line  matches.  This  affects  the 
matching  at  the  line  level  in  three  ways.  First,  there 
are  fewer  lines  left  to  be  matched  at  the  line  level.  Sec¬ 
ond,  any  new  line  match  that  violates  confirmed  line 
matches  is  eliminated  by  the  TMS.  Third,  the  confirmed 
line  matches  serve  as  seeds  for  match  groupings  at  the 
line  level. 

4  Grouping  Constraints 

Two  classes  of  groupings  result  during  the  grouping  and 
confirming  procedure  described  in  the  previous  section. 
Direct  groupings  are  those  that  result  when  hypotheses 
are  grouped  directly  based  on  structural  and  perceptual 
relations.  Indirect  groupings  are  those  that  result  when 
contexts  are  confirmed.  The  underlying  hypotheses  of 
confirmed  contexts  are  asserted  into  the  root  and  are  in¬ 
herited  down  the  context  hierarchy  into  all  the  contexts. 
These  hypotheses  arc  then  grouped  with  the  other  hy¬ 
potheses  in  these  contexts. 

Certain  groupings  of  features  or  matches  are  contra¬ 
dictory.  Our  objective  is  to  eliminate  all  contradictory 
groupings  (whether  direct  or  indirect).  As  an  example  of 
a  contradictory  grouping,  consider  the  intersecting  edge 
hypotheses  cab  and  ecD  shown  in  Figure  2.  Physical 
edges  in  the  scene  cannot  intersect,  so  these  edge  hy¬ 
potheses  are  mutually  contradictory.  Of  course,  this  is 
not  to  say  that  two  edges  in  a  scene  cannot  cross  each 
other  when  viewed  from  some  viewpoint.  But  if  they  did, 
part  of  one  edge  will  be  occluded  by  the  surface  formed 
by  the  other  and  this  edge  will  not  be  detected  by  the  fea¬ 
ture  grouping  process.  Any  context  with  a  grouping  of 
intersecting  edge  hypotheses  is  therefore  a  nogood.  Such 
nogood  contexts  are  specified  to  the  system  explicitly  us¬ 
ing  nogood  pattern  combinations  (nogoods,  for  short). 
These  are  sets  of  patterns  that  identify  contradictory 
groupings  of  hypotheses.  This  is  an  elegant  way  of  spec- 
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ifying  constraints.  Any  context  with  propositions  that 
match  all  the  patterns  of  a  nogood  is  eliminated  by  the 
TMS.  The  effect  is  that  in  the  final  solution  there  are  no 
contradictions.  Below  we  illustrate  some  of  the  grouping 
constraints  used  in  the  system. 

uniqueness  constraint;  Any  feature  can  match  at 
most  one  feature.  This  follows  from  the  fact  that  the 
image  of  a  point  in  the  scene  will  project  onto  a  single 
point  in  each  view.  The  uniqueness  constraint  for  edge 
matches  is  illustrated  in  Figure  5.  This  is  specified  as 
the  following  nogood. 


(nogood 

(edge-match  lex  ?cy)  ;;  if  ?ejf  matches  ?ey 
(edge-match  lex  ;;  and  also  matches  lez 


This  constraint  needs  to  be  modified  for  lines,  which  tend 
to  be  fragmented.  In  this  case,  if  a  line  matches  more 
than  one  line,  we  expect  those  matches  to  be  colinear. 


Figure  5:  Uniqueness  and  ordering  constraints  for  edges: 
The  edge  match  hypotheses  {ex,  cp)  and  (e^,  ck)  cannot 
be  grouped  because  ex  can  match  only  one  edge  (unique¬ 
ness  constraint).  {ex,eii)  and  (eg, eg)  are  inconsistent 
because  ex  is  to  the  left-of  eg,  but  eg  is  to  the  right-of 
eq.  This  violates  the  ordering  constraint. 

ordering  constraint:  This  constraint  was  originally 
used  by  Baker  and  Binford  [Baker  and  Binford,  Palo 
Alto  CA  1982]  and  precludes  any  order  reversal  when 
matching  along  an  epipolar  line  (Figure  5).  It  is  satisfied 
in  most  natural  scenes.  Wires  and  overhanging  surfaces 
may,  however,  create  problems  [Baker  and  Binford,  Palo 
Alto  CA  1982).  Most  stereo  systems  enforce  this  con¬ 
straint  strictly  [Baker  and  Binford,  Palo  Alto  CA  1982, 
Hsieh  et  al.,  1990,  Horaud  and  Skordas,  1989].  The 
advantage  is  that  the  constraint  reduces  the  search 
space  drastically  at  the  expense  of  missing  a  few  correct 
matches. 

topological  constraints;  The  topology  of  the  feature 
hierarchy  is  viewpoint  insensitive  and  we  do  not  expect 
it  to  change  between  scenes.  Topological  constraints 
rule  out  several  contexts.  As  one  example,  consider  the 
matches  {ex,  eg)  and  (eg,  eg)  in  Figure  6.  The  matches 
are  connected  in  one  scene  and  disconnected  in  the  other. 

So  their  grouping  is  inconsistent.  Two  more  examples 
are  shown  in  the  same  figure.  The  last  two  examples 
show  ternary  constraints  involving  three  hypotheses. 

The  constraints  described  in  this  section  are  binary  (in 
some  cases  ternary)  constraints  [Grimson,  1990a]  that 
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Figure  6:  Topological  constraints:  The  edge  match 
hypotheses  (evi,ep)  and  (eg, eg)  are  incompatible  be¬ 
cause  they  are  connected  in  one  image  and  not  in  the 
other.  Similarly,  the  edge  match  hypothesis  (ec,eg)  is 
incompatible  with  the  edge  hypotheses  eg  and  es  be¬ 
cause  these  edge  hypotheses  do  not  match  and  we  ex¬ 
pect  topology  to  be  preserved  across  scenes.  The  vertex 
hypothesis  vxb  and  the  match  hypotheses  {lx,^p)  and 
(^B,^(?)  are  incompatible  because  Ip  and  Ig  do  not  in¬ 
tersect  on  the  epipolar  line  of  vxb  when  extended. 


prohibit  certain  groupings  of  matches.  The  epipolar,  ori¬ 
entation,  length,  area  and  contrrist  constraints  described 
in  Section  3  are  unary  constraints  that  help  in  reducing 
the  number  of  matches  that  are  hypothesized  and  later 
subjected  to  the  grouping  process. 

4.1  Matching  Complexity 

In  this  section  we  do  an  approximate  analysis  of  the 
matching  complexity.  We  start  by  analyzing  the  com¬ 
plexity  at  one  level  of  the  hierarchy  and  later  analyze 
the  multi-level  case.  Suppose  there  are  N  features  at  a 
particular  level  of  the  hierarchy  and  Ei  epipolar  lines  in 
each  image.  The  number  of  features  per  epipolar  line 
is  Features  are  matched  across  epipolar  lines.  So 
the  number  of  feature  pairs  to  be  tested  for  matching 
across  each  epipolar  line  is  Since  there  are  Ei  of 
these  epipolar  lines,  the  number  of  feature  pairs  to  be 
tested  for  the  entire  image  is 


El' 


(1) 


Suppose  a  fraction  /  of  these  satisfy  the  orientation  and 
contrast  constraints.  Then  the  final  number  of  matches 
will  be  M  =  /^. 

We  now  group  these  match  hypotheses  based  on  struc¬ 
tural  and  perceptual  relations.  For  simplicity  of  analy¬ 
sis,  assume  that  there  are  ^  groupings  of  size  3.  Binary 
and  ternary  constraints  are  evaluated  for  each  grouping. 
The  complexity  of  these  constraint  evaluations  is  pro¬ 
portional  to 

M  fN^ 

—  =  - - •  2) 

3  3£',  '  ' 

These  groupings  are  confirmed  at  random.  When  a 
grouping  is  confirmed  all  matches  comprising  the  group¬ 
ing  are  converted  to  truths  and  asserted  into  the  root 
context.  These  matches  are  then  inherited  down  the 
context  tree  and  they  must  be  compared  with  the  other 
matches  outside  the  root  context  to  check  for  violations 


of  grouping  constraints.  The  first  confirmation  leaves 
(M  —  3)  match  hypotheses  outside  the  root  context,  the 
second  (M  —  6)  and  so  on,  till  all  groupings  are  confirmed 
(assume  for  simplicity  that  no  grouping  is  eliminated  be¬ 
cause  of  constraint  evaluations).  The  complexity  of  the 
confirmation  process  is  proportional  to 

M 

(M-3)-l-(M-6)-KM-9)  +  ...=  — -y  (3) 

From  (1),  (2)  and  (3)  it  can  be  seen  that  the  time  for 
matching  is  essentially  decided  by  (3),  the  confirmation 
process.  This  was  also  empirically  observed. 

The  above  analysis  shows  that  the  complexity  of 
matching  at  each  level  is  0{N*),  where  N  is  the  number 
of  features  at  that  level.  So  if  only  lines  were  matched 
(i.e.  there  was  no  hierarchy)  the  complexity  is  0(1*), 
where  /  is  the  number  of  lines  in  each  frame.  We  now  ex¬ 
amine  how  the  hierarchical  matching  procedure  reduces 
the  proportionality  constant  of  this  complexity  (com¬ 
plexity  will  still  be  proportional  to  1'^).  Let  the  reduction 
in  number  of  feature  hypotheses  between  consecutive  lev¬ 
els  be  ki-  Let  the  fraction  of  reduction  in  the  number  of 
features  (not  hypotheses)  between  consecutive  levels  be 
k2-  ki  is  smaller  than  fcj  (ki  <  fcj).  The  inequality  is  be¬ 
cause  not  all  feature  hypotheses  may  participate  in  the 
feature  grouping  process.  The  table  in  Figure  7  assists  in 
computing  the  complexity  reduction.  The  first  column 
shows  the  names  of  features.  The  second  column  shows 
the  number  of  features  at  each  level,  assuming  there  are 
I  lines  at  the  line  level  and  a  fractional  reduction  of  ki. 
The  third  column  shows  the  number  of  features  that  are 
actually  considered  for  matching.  For  example,  consider 
the  Sm  surfaces  matched  at  the  surface  level.  These  cor¬ 
respond  to  «dge  matches,  vertex  matches 

and  line  matches.  These  matched  features  will 

not  be  considered  for  matching  again.  This  leads  to 
a  reduction  in  the  number  of  features  participating  in 
matching  at  lower  levels.  The  third  column  shows  the 
original  number  of  features  decremented  by  the  features 
already  matched  at  previous  levels.  So  the  hierarchi- 


complexity  analysis 

feature 

no. 

no.  to  he  matched 

lines 

1 

-  -  -j^em  - 

vertices 

kil 

Rm  =  fej/  — 

edges 

ki^l 

~  h\  1  — 

surfaces 

ki^l 

Sm  =  k^^l 

Figure  7:  Table  for  computing  matching  complexity 

cal  matching  strategy  reduces  complexity  by  a  factor  of 
<i>..+Via  +<m  ^  where  the  numerator  sums  the  com¬ 

putations  performed  at  different  levels  of  the  hierarchy. 
As  a  typical  example,  if  /  =  0.6  (60%  of  features  at  each 
level  matched),  ki  =  0.4  (each  level  has  0.4  times  the 


number  of  hypotheses  at  its  lower  level)  and  k2  =  0.5 
(fractional  reduction  in  number  of  features  across  two 
levels  in  the  hierarchy)  the  complexity  reduction  factor 
is  0.01.  So  the  hierarchical  matching  procedure  signifi¬ 
cantly  reduces  matching  complexity.  Further,  it  also  en¬ 
sures  that  the  resultant  matches  are  more  correct.  Both 
factors  are  important. 

5  Examples 

The  stereo  algorithm  is  illustrated  first  for  the  building 
scene  shown  in  Figure  8.  The  lines  and  vertices  (dark 
circles)  extracted  from  these  images  are  shown  in  Fig¬ 
ure  9.  The  vertices  are  grouped  into  edges  (shown  in 
Figure  10  as  dark  lines).  No  surfaces  derived  from  the 
images  match  exactly.  However  the  unmatched  surfaces 
are  useful  as  they  are  used  as  foci  of  attention  when 
matching.  Results  after  matching  edges  are  shown  in 
Figure  11.  The  vertices  and  lines  of  matched  edges  are 
also  matched.  These  matches  constrain  the  matches 
at  lower  levels.  For  example,  any  line  match  hypoth¬ 
esis  that  violates  the  ordering  constraint  with  respect  to 
these  confirmed  line  matches  will  be  negated.  Figure  12 
shows  the  feature  matches  after  matching  at  the  ver¬ 
tex  level.  To  demonstrate  the  use  of  focus  of  attention, 
we  show  the  results  after  matching  lines  by  focusing  on 
unmatched  vertices  (Figure  13).  The  final  result  after 
matching  lines  is  shown  in  Figure  14.  Figures  15  and  16 
show  labelled  vertex  and  line  matches.  Matching  vertices 
and  lines  in  these  figures  are  given  the  same  label. 

Next,  we  illustrate  the  matching  process  for  a  stereo 
pair  of  images  of  a  portion  of  LAX  airport  (Figure  17). 
Figures  18  and  19  show  the  feature  hierarchy.  Results 
of  matching  surfaces  are  shown  in  Figure  20.  Final  edge 
matches  are  shown  in  Figure  21.  Vertex  matches  are 
shown  in  Figure  22.  Figure  23  shows  the  final  matches 
at  the  line  level.  Figures  24  and  25  show  labelled  vertex 
and  line  matches. 
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Figure  8;  Right  view  and  left  view  of  a  building  image. 
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Figure  9:  Lines  and  vertices. 


Figure  14:  Final  matches. 


Figure  13:  After  matching  lines  by  using  unmatched  ver¬ 
tex  features  as  foci  of  attention. 


Figure  15:  Vertex  and  line  matches  with  vertex  labels 
(a)  Right  image,  (b)  Left  image. 


Figure  18:  Lines  and  vertices. 
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Figure  19:  Edges  (dark  lines). 


Figure  20:  After  matching  at  surface  level. 


Figure  16:  Vertex  and  line  matches  with  line  labels,  (a) 
Right  image,  (b)  Left  image. 
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Figure  17:  Right  view  and  left  view  of  a  portion  of  LAX 
airport. 
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Figure  21:  Final  matches  at  edge  level. 
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Figure  22:  After  matching  at  vertex  level 


Figure  24:  Vertex  and  line  matches  with  vertex  labels, 
(a)  Right  image,  (b)  Left  image. 
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Abstract 

Diffuse  leflection  caused  by  subsurface  multiple  scatter¬ 
ing  from  inhomogeneous  dielectric  materials  is  a  very 
important  phenomenon  in  computer  vision.  Such  mate¬ 
rials  are  quite  prevalent  and  usually  most  of  the  sensed 
reflecting  area  from  inhomogeneous  dielectric  materials 
results  from  diffuse  reflection.  Analysis  and  experimen¬ 
tation  involving  diffuse  reflection  in  computer  vision  has 
typically  assumed  that  it  is  Lambertian  in  nature.  Non- 
Lambertian  reflectance  has  previously  implied  the  pres¬ 
ence  of  specular  reflection  from  a  smooth  or  rough  sur¬ 
face  interface,  and  this  has  been  recently  studied  in  some 
detail  even  though  the  diffuse  reflection  component  has 
still  been  assumed  to  be  Lambertian.  In  fact  we  show 
that  diffuse  reflection  from  many  common  types  of  in¬ 
homogeneous  dielectrics  not  only  can  significantly  devi¬ 
ate  from  Lambertian  behavior  but  that  the  shape  of  the 
angular  distribution  for  diffuse  reflection  can  be  quite 
diverse,  dependent  upon  material  and  surface  parame¬ 
ters.  In  this  paper  we  give  serious  consideration  to  the 
physical  analysu  of  multiple  subsurface  scattering  and 
propose  diffuse  reflection  models  based  upon  both  radia¬ 
tive  transfer  theory  and  geometric  optics. 

As  the  behavior  of  the  reflectance  map  for  diffuse  re¬ 
flection  with  respect  to  surface  orientation  can  be  quite 
variable  for  different  surface  materials  and  roughnesses 
this  can  present  a  problem  for  determining  accurate  3-D 
shape  from  current  shape-from-shading  and  photometric 
stereo  methodolopes  which  require  accurate  knowledge 
of  the  reflectance  map.  We  propose  a  novel  technique 
for  determination  of  3-D  shape  from  multiple  point  light 
source  illumination  without  requiring  knowledge  of  the 
diffuse  reflectance  map.  This  technique  is  based  upon  a 
photometric  partitioning  scheme  which  assumes  that  dif¬ 
fuse  reflectance  from  any  surface  point  u  monotonic  with 
respect  to  angle  of  incidence  between  the  light  source 
vector  and  the  surface  normal.  The  monotonicity  as¬ 
sumption  for  diffuse  reflection  allows  determination  of 
the  relative  sise  of  angles  of  incidence  from  multiple 
known  incident  source  orientations  from  comparison  of 
the  relative  magnitude  of  respective  diffuse  reflected  val¬ 
ues,  regardless  of  how  the  unknown  diffuse  reflectance 
map  may  vary  from  point  to  point.  Depending  upon 
how  many  light  sources  are  used  the  surface  orientation 
for  a  normal  can  be  constrained  to  lie  within  a  partition 
of  orientation  space  with  decreasing  sise  as  the  number 


of  light  sources  gets  larger.  We  suggest  a  method  for 
how  this  scheme  can  be  extended  to  simultaneously  de¬ 
termine  shape  and  the  diffuse  reflectance  map. 

1  INTRODUCTION 

In  the  past  decade  the  computer  vision  community  has 
become  increasingly  aware  of  accurate  modeling  of  the 
reflectance  properties  of  materials  for  extraction  of  visual 
features.  The  works  by  Torrance  and  Sparrow  [Torrance 
and  Sparrow,  1967],  and,  Beckmann  and  Spissichino 
[Beckmann  and  Spissichino,  1963]  have  been  amongst 
the  most  popular  in  providing  vision  researchers  with  an 
accurate  modeling  of  the  specular  component  of  reflec¬ 
tion  from  rough  materials  [Healey,  1987],  [Wolff,  1987], 
[Tkgare  and  deFigueiredo,  1989],  [Nayar  et  of.,  1990]. 
While  elaborate  specular  reflection  modek  have  been 
incorporated  into  computer  vision  methods,  the  diffuse 
component  of  reflection  is  still  largely  considered  to  be 
Lambertian.  An  exception  is  the  paper  by  Thgare  and 
deFigueiredo  [Tagare  and  deFigueiredo,  1989]  who  pro¬ 
pose  as  part  of  theu  m-lobed  reflectance  model  for  ma¬ 
chine  vision  a  functional  approximation  to  a  general¬ 
ised  Lambertian  diffuse  intensity  distribution.  Apart 
from  analysis  of  reflected  intensity  distributions  for  dif¬ 
fuse  reflection,  Shafer  [Shafer,  1985]  proposed  a  color 
reflectance  model,  and  Wolff  [Wolff,  1991b]  proposed  a 
polarisation  reflectance  model  involving  the  diffuse  com¬ 
ponent  for  inhomogeneous  dielectrics. 

The  study  of  the  precise  nature  of  diffuse  reflection 
from  inhomogeneous  dielectrics  is  very  important  for 
computer  vision  problems.  For  one,  inhomogeneous  di¬ 
electrics  such  as  plastics,  paints,  glasses,  ceramics,  rub¬ 
ber,  etc.  are  quite  prevalent.  Given  non-extended  light 
source  illumination,  diffuse  reflection  from  these  materi¬ 
als  is  far  more  abundant  compared  to  the  visible  image 
area  over  which  specular  reflection  occurs.  From  the 
analysis  given  in  this  paper,  computer  vision  methods 
that  rely  on  the  assumption  that  the  intensity  distribu¬ 
tion  for  diffuse  reflection  is  Lambertian  are  likely  to  not 
be  applicable  to  many  of  these  common  materials. 

Our  analyris  of  diffuse  reflection  relies  on  the  theory 
of  radiative  transfer  developed  by  Chandrasekhar  [Chan¬ 
drasekhar,  I960]  for  multiple  scattering  of  incident  light 
upon  stellar  and  planetary  atmospheres.  The  impor¬ 
tant  problem  of  diffuse  reflection  and  transmission  from 
plane  parallel  atmospheres  in  astrophysics  has  a  num- 
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bei  of  similarities  with  diffuse  reflection  from  inhomoge¬ 
neous  dielectric  materials.  Incident  light  strikes  gaseous 
molecules  within  an  atmosphere  whereupon  some  of  the 
light  is  absorbed,  and  some  of  the  light  is  scattered  with 
an  assumed  intensity  dbtribution  respective  to  each  in¬ 
dividual  molecule.  Similarly,  particles  forming  discon¬ 
tinuities  in  re&active  index  within  a  dielectric  absorb 
and  scatter  light  that  has  penetrated  the  surface  bound¬ 
ary,  and  this  process  can  be  quantified  using  radiative 
transfer  theory.  In  this  paper,  we  propose  diffuse  re¬ 
flection  models  for  dielectric  materials  combining  ra¬ 
diative  transfer  theory  for  multiple  scattering  with  air- 
dielectric  boundary  surface  effects.  Radiative  transfer 
theory  has  already  been  applied  to  analysing  diffuse  re¬ 
flection  and  transmission  from  colloidal  suspensions  [Re- 
ichman,  1973],  [Orchard,  1969]. 

Subsequent  to  the  analysis  of  diffuse  reflection,  we 
concentrate  on  the  general  application  of  photometric 
stereo  to  diffuse  reflecting  surfaces.  Photometric  stereo 
was  ori^ally  pioneered  by  Robert  Woodham  [Wood- 
ham,  1978].  As  originally  proposed  photometric  stereo 
obtains  local  surface  orientation  using  multiple  light 
source  incident  orientations,  assuming  accurate  knowl¬ 
edge  of  the  reflectance  map  for  the  surface.  Since  then 
there  has  been  considerable  study  and  experimentation 
with  this  method,  primarily  for  Lambertian  diffusing 
surfaces  [Silver,  1978],  [Bruckstein  and  Kelley,  1983], 
[Onn  and  Bruckstein,  1990],  [Koteta,  1991].  There  has 
also  been  study  of  photometric  stereo  applied  to  spec¬ 
ular  surfSsces  [ikeuchi,  1981],  and  surfaces  that  possess 
both  a  diffuse  and  specular  component  [Coleman  and 
Jain,  1982],  [Nayar  et  at,,  1988],  but  again,  the  diffuse 
component  of  reflection  is  assumed  to  be  Lambertian 
in  these  cases.  Ibgare  and  deFigueiredo  [Tagare  and  de- 
Figueiredo,  1989]  propose  a  theory  of  photometric  stereo 
applied  to  a  general  class  of  reflectance  maps  with  re¬ 
spect  to  uniqueness  of  solution  for  surface  orientation 
and  completeness  of  shape  reconstruction.  A  functional 
approximation  to  the  diffuse  component  is  used  based 
upon  the  final  result  of  [Chandrasekhar,  I960].  This  re¬ 
sult  however  does  not  take  into  account  the  variety  of 
material  surface  boundary  effects  that  can  strongly  in¬ 
fluence  the  nature  of  the  diffuse  reflection  component. 

Experimentation  with  photometric  stereo  on  diffuse 
reflecting  surfaces  has  been  primarily  performed  on  sur¬ 
faces  coated  with  non-glossy  bright  white  paint  [Silver, 
1978].  No  doubt  white  paint  composed  of  titanium  ox¬ 
ide  or  barium  sulfate  u  the  closest  to  being  a  perfect 
diffuser.  In  general  it  is  recommended  in  [Silver,  1978] 
to  build  up  a  table  relating  surface  orientation  to  quanti¬ 
ties  constructed  Rom  multiple  reflected  intensity  values, 
each  intensity  value  respective  to  a  light  source  incident 
on  a  calibration  object  with  similar  reflectance  proper¬ 
ties.  For  3  light  sources,  the  following  ratios  would  be 
constructed  from  reflected  intensity  values: 

h  h  h 

/i  +  fa  +  fs’  /i  +  /j  +  fs’  A-h/j-h/s’ 
in  order  to  cancel  out  diffuse  '‘albedo”  which  represents 
a  scaling  factor  for  maximum  reflected  diffuse  intensity 
(i.e.,  apparent  grayness).  The  physical  analysis  in  this 


paper  shows  in  fact  that  the  reflected  intensity  distribu¬ 
tions  of  diffusers  with  varying  “albedo”  are  not  always 
scalar  multiples  of  one  another.  The  term  “albedo”  as 
commonly  used  in  computer  vision  to  mean  scalar  mul¬ 
tiples  of  reflection  component  functions  actually  turns 
out  to  be  simplistic  in  the  case  of  diffuse  reflection.  The 
physical  parameter  that  controk  how  much  light  is  ab¬ 
sorbed  by  a  diffusing  inhomogeneous  dielectric  surface  is 
called  the  tingle  tcattering  albedo  which  is  appropriately 
defined  to  be  the  proportion  of  light  energy  that  is  singly 
scattered  from  an  individual  particle  inhomogeneity,  to 
the  amount  of  light  energy  that  was  originally  incident 
on  the  particle.  A  scattering  albedo  of  1.0  implies  con¬ 
servative  scattering  where  aU  light  is  scattered  and  no 
energy  is  absorbed  by  the  particle.  We  will  see  that  the 
case  of  diffuse  reflection  produced  from  isotropic  conser¬ 
vative  multiple  scattering  is  very  close  to  being  Lam¬ 
bertian.  However,  the  shape  of  the  reflected  intensity 
distribution  for  diffuse  reflection  from  nonconservative 
isotropic  multiple  scattering  changes  significantly  with 
scattering  albedos  varying  between  0.0  and  1.0.  The 
point  being  made  is  that  the  shape  of  the  reflected  in¬ 
tensity  distribution  (i.e.,  reflectance  property)  for  diffuse 
reflecting  surfaces  is  frequently  not  uniform  Rom  point 
to  point,  and  this  is  unfortunately  the  case  for  a  sur¬ 
face  pwted  with  diffuse  paints  of  varying  grayness.  The 
reflectance  distribution  at  grayer  points  is  not  simply  a 
factional  multiple  of  the  nearly  Lambertian  distribution 
in  the  conservative  scattering  case.  This  poses  a  po¬ 
tential  practical  problem  for  current  photometric  stereo 
methodologies  on  diffuse  reflecting  surfaces  that  require 
accurate  knowledge  or  make  assumptions  about  the  dif¬ 
fuse  reflectance  map.  How  can  it  be  determined  what 
reflectance  map  to  use  at  each  individual  point  ? 

Presented  in  this  paper  b  a  new  methodology  for 
photometric  stereo  that  we  call  photometric  partition¬ 
ing  which  obviates  any  need  for  precise  knowledge  about 
the  diffuse  reflectance  map  from  point  to  point.  Thb 
methodology  only  relies  on  the  following  2  assumptions 
about  the  behavior  of  reflectance  at  a  surface  point:  (i) 
diffuse  reflectance  b  monotonic  with  respect  to  the  an^e 
of  incidence  formed  by  light  source  incident  orientation 
and  the  surface  normal;  (ii)  diffuse  reflection  b  isotropic 
with  respect  to  asimuth  about  the  surface  normal.  Other 
than  these  2  conditions  for  diffuse  reflection,  our  method 
computes  surface  orientation  equally  well  regardless  of 
the  values  of  the  diffuse  reflectance  map  and  how  thb 
reflectance  map  may  vary  &om  point  to  point.  We  show 
experimentation  with  reconstructing  the  shape  of  part 
of  a  smooth  ceramic  vase  with  3  light  source  photomet¬ 
ric  partitioning.  We  then  show  how  the  photometric 
partitioning  method  can  be  used  to  obtain  simultaneous 
recovery  of  shape  and  the  diffuse  reflectance  map. 

2  THE  PHYSICS  OF  DIFFUSE 
SCATTERING 

The  diffuse  scattering  of  light  as  a  natural  phenomenon 
b  quite  ubiquitous.  The  color  of  a  blue  sky,  an  orange- 
red  sunset,  a  blue-green  sea,  and,  the  color  of  inhomo¬ 
geneous  dielectric  materiab  ranging  from  precious  gems 
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to  otdinaiy  clay  can  all  be  explained  using  the  physics 
of  diffuse  scattering.  Of  interest  to  us  is  the  predicted 
intensity  distribution  of  diffusely  scattered  light  from  in¬ 
homogeneous  dielectrics.  All  of  this  proceeds  axiomati- 
cally  from  the  equation  of  trans/er  for  multiple  scattering 
[Chandrasekhar,  I960]. 

The  fundamentals  of  single  scattering  theory  were  es¬ 
tablished  by  Lord  Rayleigh  for  particles  smaller  than  the 
wavelength  of  incident  light  and  by  Mie  for  spherical 
particles  of  arbitrary  sise  [Hulst,  1957],  [Kerker,  1969]. 
We  assume  that  the  scattered  light  produced  by  a  single 
small  particle  has  an  axially  symmetric  radiance  distri¬ 
bution  about  the  incident  direction  of  light.  The  angular 
distribution  of  the  scattered  radiation  is  described  by  a 
phase  function  P(eoeff)  where  fi  is  the  scattering  angle  of 
deflection  away  from  the  orientation  of  the  incident  light. 
See  Figure  1.  To  be  consistent  with  vision  radiometric 
nomenclature  [Horn  and  Sjoberg,  1979],  the  phase  func¬ 
tion  describes  the  proportion  of  incident  ra^ance  that 
is  scattered  into  a  given  direction.  The  phase  function 
is  defined  so  that  L  is  the  radiance  of  light  incident 
on  a  particle  then  L  x  P(eos0)/4ir  is  the  radiance  scat¬ 
tered  into  a  given  direction  forming  an  angle  6  with  the 
incident  direction.  The  total  irradiance  of  light  that  is 
scattered  is  given  by: 


/. 


sn<t  tphert 


L  X  P(co$9)^  . 

4ir 


If  the  phase  function  is  constant  for  all  angles  0  then 
the  scattering  is  said  to  be  isotropic,  otherwise  it  is 
antsoiroptc.  In  order  for  scattering  to  be  conservative 
we  must  have: 


/  =  1  . 

In  general 

J  P(cos9)^  =  <  1 , 

•  unit  $fk*T»  vw 

and  <r  is  referred  to  as  the  albedo  for  single  scattering  or 
simply  single  scattering  oBtedo.  In  nonconservative  cases 
when  <r  <  1  energy  is  absorbed  by  the  particle.  The 
factor  of  4ir  is  convenient  so  that  for  isotropic  scattering 
with  scattering  albedo  «r,  P{cos9)  =  <r. 

The  scattering  albedo  is  commonly  dependent  on  the 
incident  wavelength  of  light.  The  ^t  that  the  sky  is 
Une  is  evidence  that  the  scattering  albedo  for  blue  light 
in  our  atmosphere  is  significantly  larger  than  for  other 
wavelengths.  The  bright  colors  of  rubyes  and  emeralds 
and  the  colors  of  ceramics  is  also  due  to  the  strong  de¬ 
pendence  of  scattering  albedo  on  wavelength.  Reflected 
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Figure  2 

color  that  is  sensed  from  objects  iUuminated  with  white 
light  attests  to  the  fact  that  nonconservative  absorption 
is  at  work.  Perfectly  conservative  materials  should  ap¬ 
pear  completely  white  when  illuminated  with  this  type 
of  light. 

Figure  2  depicts  light  scattering  amongst  multiple  par¬ 
ticles.  In  this  case  the  radiance  distribution  of  light  that 
is  incident  upon  a  particle  depends  upon  the  past  history 
of  how  this  light  was  created  from  the  previous  scattering 
by  other  particles,  which  in  turn  depend  upon  the  scat¬ 
tering  of  light  by  other  particles  before  that,  and  so  on. 
Figure  2  shows  just  how  complicated  the  situation  gets 
for  only  3  particles,  nothing  compared  to  the  actual  scat¬ 
tering  of  light  that  occurs  amongst  thousands  or  millions 
of  particles  within  an  atmosphere  or  inhomogeneous  ma¬ 
terials.  We  assume  here  that  the  inhomogeneous  medium 
being  discnssed,  whether  it  be  a  gaseous  atmosphere  or  a 
material,  is  plane  parallel  meaning  that  the  medium  can 
be  subdivided  into  a  set  of  parallel  planes  such  that  one 
of  these  planes  represents  the  boundary  of  the  medium, 
and  along  each  individual  plane  the  optical  properties 
(e.g.,  particle  density)  is  uniform.  In  this  section  surface 
effects  at  the  boundary  plane,  such  as  refraction,  are  not 
conridered.  They  are  discussed  in  the  next  section. 

Ultimately  we  want  to  derive  an  equation  involving  the 
radiance  /(r,/i,/io)  ns  a  function  of  optical  depth  from 
the  boundary,  r,  and  incident  and  emergent  directional 
cosines  ft  and  fta  of  light  radiation  with  respect  to  the 
boundary  normal.  We  can  then  evaluate  this  function 
at  the  boundary  plane,  namely,  /(r  =  0,/*,/io),  to  get 
the  proper  law  of  diffuse  reflection  from  multiple  scat¬ 
tering.  Because  there  is  no  dependence  of  I  upon  an 
asimuth  parameter  we  are  assuming  an  axially  symmet¬ 
ric  solution  for  the  radiance  about  the  boundary  normal. 
The  equation  of  transfer  admits  such  a  solution  only  if 
the  single  scattering  phase  function  is  isotropic.  The 
fact  that  a  number  of  common  types  of  inhomogeneous 
dielectric  materials  appear  equally  bright  regardless  of 
rotation  about  the  surface  normal  supports  the  physi¬ 
cal  isotropy  for  single  scattering  in  su^  materials.  The 
equation  of  transfer  equates  the  rate  of  change  of  the  ra¬ 
diance  function  I(T,ft,fto)  with  respect  to  distance  into 
the  medium,  to  the  sum  of  light  radiation  from  3  physical 
processes: 
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•  1.  Light  that  is  intercepted  by  (i.e.,  incident  on) 
particles  in  the  medinm. 

•  2.  Light  that  is  scattered  (i.e.,  reradiated)  as  a  re¬ 
sult  of  physical  process  1. 

•  3.  Light  from  the  incident  beam  that  passes  in- 
between  particles  and  is  never  scattered. 

We  introduce  a  moss  ahtorption  coefficient  ky  depen¬ 
dent  upon  frequency  defined  by  the  equation 


where  the  last  term  is  obtained  by  differentiating  expres¬ 
sion  2.  Iy{s,fi)  expresses  the  radiance  of  light  radiated 
in  a  direction  with  directional  cosine  fi  with  the  bound¬ 
ary  normal,  as  a  result  of  light  incidence  and  multiple 
scattering. 

We  can  eliminate  needing  to  know  the  parameters  ky 
and  p  by  transforming  the  above  equation  expressed  in 
terms  of  s,  into  an  equation  in  terms  of  r,  the  optical 
depth: 


dly  =  —kyplyde  ,  (1) 

where  p  is  the  mass  density  per  unit  volume,  and  d»  is 
the  differential  distance  that  the  light  traverses  in  the 
medium.  The  term  ky  is  proportional  to  the  number 
of  scatterers  that  intercept  light  per  unit  mass  of  the 
medium  at  incident  frequency  t/.  Equation  1  describes 
the  attenuation  of  light  energy  as  a  result  of  being  in¬ 
tercepted  by  particles  while  traveling  distance  ds  in  the 
medinm.  The  solution  to  equation  1  for  ly  ia  a,  simple 
calculus  problem: 


ly  =  ,  (2) 

which  represents  the  proportion  of  the  original  incident 
radiance  L  that  passes  in-between  all  particles  after 
traversing  a  distance  a  in  the  medium. 

The  light  energy  that  is  intercepted  by  particles  is  ei¬ 
ther  scattered  or  absorbed.  From  equation  1  the  radi¬ 
ance  incident  on  particles  per  differential  unit  length,  ds, 
is 


dI{T,p,po) 

^  dr 


+  lo-  j'^I(T,p.\po)d<l>’dp!  -  ,  (5) 

where  we  have  suppressed  the  frequency  dependence,  i/, 
and  the  last  term  of  this  equation  expresses  the  rate  of 
change  of  the  attenuation  of  incident  light  at  radiance 
L  with  directional  cosine  /to  relative  to  the  boundary 
normal.  A  mote  detailed  explanation  of  this  is  given  in 
[Wolff,  1991a]. 

Equation  5  is  the  equation  of  trantfer.  Chandrasekhar 
[Chandrasekhar,  I960]  solves  this  integro-diiferential 
equation  with  over  100  pages  of  theoretical  development. 
A  summary  emphasising  the  salient  parts  of  this  analy¬ 
sis  is  given  in  [Wolff,  1991a].  The  solution  for  /(r,/t,/to) 
at  r  =  0  can  be  expressed  in  terms  of  the  Chandrasekhar 
H-function.  The  reflected  radiance  according  to  the  law 
of  diffuse  reflection  is; 


kyply  . 

The  radiance  of  light  scattered  at  s  into  a  vector  di¬ 
rection  having  directional  cosine  p  with  respect  to  the 
boundary  normal  and  asimuth  ^  is  given  by: 

Kp^  • 

The  integration  u  performed  over  all  incident  directions, 
/t',0'.  Again,  the  phase  function,  P,  in  general  u  only 
dependent  upon  the  angle  between  the  incident  direc¬ 
tion  p\  4"'  scattering  direction,  /i,  4.  Assuming 

isotropic  scattering  with  sin^e  scattering  albedo,  <r,  the 
phase  function  is 


Therefore  the  light  energy  reradiated  by  scattering  is 

Kp^<r  j  ^Iy{$,p!)dp!  .  (3) 

The  equation  of  transfer  for  isotropic  scattering  is  a 
result  of  combining  expressions  1,  3  and  2  as  follows; 


(Law  of  Diffuse  Reflection) 

mp,Po)  =  {-L-^S{p)H{po),  (6) 

4x  p+po 

where,  ag^,  po  and  p  are  the  directional  cosines  of 
incident  and  reflected  light  with  the  boundary  normal, 
respectively,  a  is  the  single  scattering  albedo,  and  L  is 
the  incident  radiance.  The  Chandrasekhar  H-fnnctions 
are  also  dependent  upon  the  single  scattering  albedo,  <r. 
An  nth  order  approximation  to  the  Chandrasekhar  H- 
function  can  be  expressed; 


B{p)  = 


1  nr=i(p+p») 

Ml  •  •  •#*»  n«i(l+«aM)  ’ 


defined  in  terms  of  the  positive  seros,  pi,  of  the  2nth  Leg¬ 
endre  polynomial,  Ptnip),  and  the  positive  nonvanishing 
roots,  Kq,  of  the  associated  characteristic  equation; 


n 


f=i 


aj<r 

1  -  «Vi  ■ 


[Wolff,  1991a]  describes  more  details  of  the  numerical 
evaluation  of  the  Chandrasekhar  H-function. 


dIy{S,p)  t  ,  X 

—  —kyply{s,p) 

+  kyp\ir  j' ^Iy{s,p')dp>  -  kypU-^'^’'^^  ,  (4) 


S  DIFFUSE  REFLECTION  IN 
COMPUTER  VISION 

We  analyse  the  Law  of  Diffuse  Reflection  stated  by  equa¬ 
tion  6  in  the  last  section  with  respect  to  its  appUcation 
to  phenomena  important  to  computer  vision.  Under  the 
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assumption  that  patticle  inhomogeneities  within  a  di¬ 
electric  isotropically  scatter  incident  visible  light,  equa¬ 
tion  6  describes  the  radiance  of  diffuse  subscattering  just 
beneath  the  surface  boundary  with  air.  As  discussed  in 
the  last  section,  for  many  diffuse  reflecting  materials  this 
is  strongly  supported  by  their  isotropic  reflectance  prop¬ 
erties  with  respect  to  asimuth  about  the  surface  normal. 
The  equation  of  transfer  only  admits  axially  symmet¬ 
ric  solutions  for  an  isotropic  scattering  phase  function, 
P{co$6)  [Chandrasekhar,  I960].  In  inhomogeneous  di¬ 
electrics  there  might  be  different  particle  type  inhomo¬ 
geneities  with  various  single  scattering  albedos.  We  as¬ 
sume  that  we  can  approximate  the  behavior  of  diffuse 
reflection  from  such  materials  by  using  the  Law  of  Dif¬ 
fuse  Reflection  for  a  single  scattering  albedo  represented 
by  the  optical  averaging  over  many  particles.  Also,  the 
single  scattering  alb^o  can  be  wavelength  dependent  in 
which  case  the  Law  of  Diffuse  Reflection  is  summed  over 
all  relevant  wavelengths. 

Equation  6  can  be  directly  applied  to  rough  diffusing 
inhomogeneous  dielectric  surfaces  where  surface  detail 
smaller  than  the  wavelength  of  incident  light  isotrop¬ 
ically  scatters  this  light  to  reasonable  approximation. 
Many  diffuse  paints  create  this  type  of  surface  rough¬ 
ness  upon  application.  Figure  3a  depicts  a  polar  scat¬ 
tering  diagram  for  normal  incident  light  with  unit  radi¬ 
ance  for  various  single  scattering  all^os  ranging  from 
er  =  1.0  to  0*  =  0.3  according  to  the  Law  of  Diffuse 
Reflection.  For  each  point  on  the  polar  plot,  the  dis¬ 


tance  from  the  origin  represents  the  radiance  of  reflected 
radiation  scattered  in  the  angular  direction  relative  to 
the  surface  normal  depicted  by  its  angle  relative  to  the 
vertical  axis.  The  dashed  plot  depicts  ideal  conservar 
tive  Lambertian  reflectance.  For  conservative  isotropic 
scattering  the  predicted  Law  of  Diffuse  Reflection  &om 
multiple  scattering  is  indeed  very  close  to  being  Lamber¬ 
tian.  The  radiance  scattered  parallel  to  normal  is  slightly 
larger  than  radiance  scattered  at  oblique  angles.  Because 
scattering  is  conservative,  the  total  reflected  irradiance 
for  the  Law  of  Diffuse  Reflection  for  <r  =  1.0  is  equal  to 
the  total  incident  irradiance,  just  as  for  the  case  of  con¬ 
servative  ideal  Lambertian  reflectance.  Notice  the  vast 
reduction  in  reflected  radiance  for  all  scattering  angles 
for  single  scattering  albedo  decreasing  from  tr  =  1.0  to 
a  =  0.9.  At  normal  scattering  the  reflected  radiance  for 
O’  =  0.9  is  about  a  third  of  the  reflected  radiance  of  that 
for  <r  =  1.0.  Because  of  multiple  scattering  it  should  be 
clear  that  the  reflected  radiance  should  degrade  nonlin- 
early  with  respect  to  decreasing  tr.  The  geometric  pro¬ 
gression  of  the  proportion  of  light  that  u  absorbed  for 
a  given  single  scattering  albedo  with  respect  to  multiple 
scattering  is  extremely  complicated  to  analyse  combina- 
torially.  The  complete  description  lies  in  the  equation  of 
transfer,  equation  5. 

Looking  more  closely  note  on  the  polar  plot  for  a  —  0.9 
that  the  radiance  scattered  normal  to  the  surface  is  ac¬ 
tually  slightly  smaller  than  for  the  radiance  scattered 
for  most  of  the  oblique  angles.  For  decreasing  a  going 
towards  <r  =  0  radiance  scattered  near  gtasing  to  the 
surface  becomes  more  than  twice  as  much  as  the  radi¬ 
ance  scattered  normal  to  the  surface.  The  shapes  for  the 
scattering  polar  plots  are  significantly  different  over  the 
range  of  single  scattering  albedos.  Figure  3b  shows  the 
graph  of  reflected  radiance  parallel  to  the  surface  nor¬ 
mal  versus  angular  orientation  (in  degrees)  of  incident 
light  of  unit  radiance,  for  the  same  range  of  single  scat¬ 
tering  albedos  as  in  Figure  3a.  The  dashed  plot  shows 
the  ideal  conservative  Lambertian  Law.  Again, 

for  V  —  1.0,  the  Law  of  Diffuse  Reflection  is  very  nearly 
Lambertian.  As  <r  decreases  the  darkening  law  for  re¬ 
flected  light  deviates  more  and  more  from  Lambertian 
behavior,  falling  off  more  slowly  as  a  function  of  angle  of 
incidence. 

Figures  3a  and  3b  for  the  Law  of  Diffuse  Reflection  im¬ 
ply  that  for  diffuse  reflecting  materials  with  diffuse  sur¬ 
face  scattering,  that  points  with  different  levels  of  “gray- 
ness”  can  have  significantly  different  reflectance  distribu¬ 
tion  properties.  Defining  diffuse  “albedo”  in  the  conven¬ 
tional  sense  as  a  scaling  factor  for  a  surface-wide  uni¬ 
form  reflectance  property  is  not  physically  precise.  For 
instance,  while  the  reflected  radiance  has  decreased  by  a 
factor  of  22  going  from  c  =  1.0  to  <r  =  0.3  for  reflection 
normal  to  the  surface  with  light  at  normal  incidence,  for 
reflection  grating  the  surface  with  light  at  normal  in¬ 
cidence  the  reflected  radiance  has  only  decreased  by  a 
factor  of  8.  The  single  scattering  albedo,  <r,  is  the  true 
physical  parameter  defining  diffuse  reflection,  and  varia¬ 
tion  of  this  physical  parameter  can  produce  significantly 
different  reflectance  properties  from  point  to  point  on 
the  same  surface. 
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GEOMETRIC  REFRACTION  OF  INCIDENT  AND  SCATTERED  LIGHT  BY  A 
PLANAR  SURFACE  BOUNDARY 

Figure  4 

There  are  s  number  of  diffuse  reflecting  surfaces  which 
have  an  optically  smooth  surface  boundary.  Glossy 
paints  possess  diffuse  reflecting  properties  on  top  of 
which  there  is  a  glossy  sheen  produced  by  specular  re¬ 
flection  from  (m  optically  smooth  surface  layer.  Many 
plastics  and  ceramics  have  this  same  characteristic.  The 
effect  of  geometric  refraction  and  Fresnel  attenuation 
produced  by  optically  smooth  surface  boundaries  alters 
the  Law  of  Diffuse  of  Reflection  for  diffusing  materials 
having  this  type  of  surface  layer.  Figure  4  shows  the 
geometric  refraction  of  light  incident  on  a  smooth  sur¬ 
face  boundary,  passing  through  the  boundary  and  then 
subject  to  subsurface  multiple  scattering  by  particle  in¬ 
homogeneities.  Light  that  was  originally  incident  at  di¬ 
rectional  cosine  po  with  respect  to  the  surface  normal, 
is  now  incident  for  subsurface  diffuse  scattering  at  direc¬ 
tional  cosine,  pot  where  according  to  Snell’s  Law  [Siegal 
and  HoweU,  1981]: 

sin(cos-^po)  _  -  . 

•  /  — 1— \  ^  I  v^f 

stn(cos  ^po) 

where  a  is  the  simple  index  of  refraction  of  the  dielectric 
boundary.  Accordingly, 

po  =  cos[s*n  ‘( — i - i)]  . 

n 

As  depicted  in  Figure  4,  the  diffusely  reflected  light  from 
multiple  scattering  is  then  geometrically  refracted  once 
again  upon  transmission  back  into  air.  The  light  that 
gets  refracted  into  air  at  directional  cosine,  p,  was  ac¬ 
cording  to  SneU’s  Law  originally  diffusely  reflected  below 
the  boundary  sutfsce  at  directional  cosine  p,  where; 

p  =  cos[stn~‘(stn(cos~*p)  X  n)] . 

Upon  reflection  at  a  planar  surface  boundary  light 
becomes  attenuated  according  to  the  Fresnel  reflection 
coefficients,  F(n,  p),  for  different  components  of  polar¬ 
isation,  as  a  function  of  index  of  refraction,  n,  and  di¬ 
rectional  cosine,  p  [Siegal  and  Howell,  1981).  We  assume 
here  that  incident  light  is  nnpolarised.  For  incident  light 
with  radiance  £,  light  that  is  specularly  reflected  on  the 
opposite  side  of  the  normal  at  the  same  angle  p  has  radi¬ 
ance  F(n,  p)  X  L.  The  energy  transmitted  into  the  ma¬ 
terial  is  therefore  proportionally  1  —  F(n,  p).  For  trans¬ 
mission  of  light  from  the  dielectric  into  air,  since  light 
is  going  from  a  medium  with  index  of  refraction,  n,  to 


index  of  refraction,  1.0,  the  attenuation  for  specular  re¬ 
flection  is  now  F(l/n,/i)  for  incident  directional  cosine, 
p.  The  proportion  of  energy  transmitted  into  air  is  there¬ 
fore  1  —  F(l/n,p)  To  first  order  approximation  the 
Law  of  Diffuse  Reflection  for  inhomogeneous  dielectrics 
with  optically  smooth  surface  boundaries  is: 

{Optically  Smooth  Surfaces) 

[l-F(«,po)] X  ;^L^^R(p)R(p5) x[l-F(l/«,p)] . 

4x  p  -hpo 

(8) 

The  second  order  approximation  accounts  for  light  that 
is  reflected  from  the  surface-air  boundary  back  into  the 
dielectric,  and  then  diffusely  scattered  again.  As  there 
is  total  internal  reflection  for  scattering  angles  greater 
than  the  critical  angl^  stn~^(l/n)  (about  36*  for  n  = 
1.7  for  most  plastics  and  ceramics)  the  amount  of  this 
light  reflected  back  into  the  dielectric  is  significant.  The 
t^d  order  approximation  which  accounts  for  light  that 
is  ^ain  reflected  back  into  the  dielectric  from  a  second 
diffuse  scattering  is  not  significantly  different  from  the 
second  order  approximation. 

Figures  5a  and  5b  plot  the  second  order  approxima¬ 
tion  to  the  Law  of  Diffuse  Reflection  for  diffusers  with 
optically  smooth  boundary  surfaces  for  n  =  1.7  tjrpi- 
c^  of  a  number  of  common  dielectrics  such  as  ceramic 
and  plastic.  The  plots  are  defined  in  the  same  way  as 
for  Figures  3a  and  3b  using  the  same  single  scattering 

*  Due  to  a  light  wave  impedonce  conection  between  media 
with  different  indices  of  refraction,  the  proportion  of  radi¬ 
ance,  L,  measured  in  air,  that  is  transmitted  from  a  medium 
of  index  of  refraction,  1.0,  to  a  dielectric  medium  of  index  of 
refraction,  n,  is  n  x  (1  —  F(n,  p))  x  £  as  meas^ed  in  the  dielec¬ 
tric.  Similarly,  the  proportion  of  radiance,  i  ,  measured  in  the 
dielectric  with  index  of  refraction,  n,  that  is  transmitted  into 
air  with  index  of  refraction,  1.0  is  (1/n)  x  (1  —  F(l/n,  p))  x  L, 
as  measured  in  air.  However,  since  we  are  measuring  incident 
light  radiance  in  air  and  reflected  radiance  also  in  air,  the 
impedance  correction  terms  cancel  out. 

’This  critical  angle  is  determined  by  Snell’s  Law,  equa¬ 
tion  7,  where  po  now  represents  the  cosine  of  the  internal 
angle  of  incidence  within  the  dielectric  and  p«  represents  the 
cosine  of  the  transmitted  angle.  Solving  for  the  critical  an¬ 
gle,  po,  when  the  transmitted  angle,  po,  is  90*  (i.e.,  grasing), 
gives  sin“*(l/n). 
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albedos  <t.  Fot  <r  =  1.0,  the  total  dilTuse  reflected  itradi- 
ance  is  less  than  the  total  incident  iitadiance  since  part 
of  the  reflected  light  energy  is  contributed  to  a  specular 
component.  Interestingly,  the  reflectance  radiance  dis¬ 
tributions  for  different  scattering  albedos,  <r,  for  an  op¬ 
tically  smooth  surface  boundary  are  more  similar  than 
they  were  for  a  diffusing  surface  boundary.  However,  the 
plots  in  Figures  Sa  and  Sb  are  more  deviant  from  ideal 
Lambertian  behavior.  Figure  5a  shows  that  almost  no 
light  is  scattered  in  directions  close  to  grasing  the  sur¬ 
face.  Light  that  transmits  through  the  boundary  surface 
layer  after  subsurface  diffuse  scattering  only  results  from 
diffusely  scattered  light  less  than  or  equal  to  the  criti¬ 
cal  an(^e.  Near  the  critical  angle  of  internal  incidence 
transmitted  light  nearly  grases  the  surface  at  90"  to  the 
boundary  normal.  Also,  near  this  angle  almost  all  light 
is  internally  reflected  and  very  little  light  is  transmitted 
explaining  why  the  polar  plots  in  Figure  5a  depict  little 
or  no  diffuse  reflection  almost  orthogonal  to  the  bound¬ 
ary  normal. 

Figure  5b  shows  the  strong  deviation  of  diffuse  reflec¬ 
tion  from  the  Lambertian  "cos^  Law  for  all  scattering 
albedos,  <r.  The  falling  off  of  reflected  radiance  as  a 
function  of  angle  of  incidence,  starting  from  normal  light 
incidence,  is  much  less  than  predicted  for  Lambertian  be¬ 
havior.  Near  grasing  incidence  the  reflected  radiance  for 
a  diffuser  with  optically  smooth  boundary,  starts  falling 
off  rapidly  going  to  sero  at  90".  Figure  6  shows  the 
Law  of  Diffuse  Reflection  for  a  rough  surface  boundary 
composed  of  a  Gaussian  distribution  of  optically  planar 
microfacets  with  respect  to  orientation,  with  mean  ori¬ 
entation  the  snr&ce  normal  and  a  standard  deviation  of 
10".  Not  only  does  the  reflected  radiance  as  a  function 
of  angle  of  incidence  get  even  flatter,  but  at  grasing  inci¬ 
dence  the  reflected  radiance  is  non-sero.  This  is  easy  to 
explain  since  grasing  incidence  produces  non-sero  trans¬ 
mission  into  the  surface  through  planar  microfacets  ori¬ 
ented  oblique  to  the  surface  normal.  Such  transmission 
is  then  diffusely  scattered  and  transmitted  back  out  into 
air.  Not  surprisingly  the  larger  the  standard  deviation  of 
the  orientation  roughness  distribution  for  the  planar  mi¬ 
crofacets,  the  flatter  is  the  law  of  darkening  as  a  function 
of  angle  of  incidence  and  the  larger  the  diffuse  reflected 
radiance  at  grasing  incidence. 


4  THE  METHOD  OF 

PHOTOMETRIC  PARTITIONING 

4.1  SHAPE  RECOVERY  AND  THE 

MONOTONICITY  ASSUMPTION  FOR 
DIFFUSE  REFLECTION 

The  discussion  in  section  3  about  the  nature  of  diffuse 
reflection  from  typical  surfaces  encountered  in  computer 
vision  illustrates  not  only  how  diffuse  reflection  can  de¬ 
viate  from  Lambertian  behavior,  but  how  diverse  this 
behavior  can  be  according  to  the  optical  properties  of 
the  surface  boundary  and  the  single  scattering  albedo. 
Furthermore,  these  optical  properties  commonly  vary 
between  surface  points  producing  in  turn  variable  dif¬ 
fuse  reflectance  properties  over  the  same  surface.  Unless 
these  optical  characteristics  ate  precisely  known  from 
point  to  point,  there  can  be  a  significant  uncertainty 
in  knowledge  about  the  diffuse  reflectance  properties  at 
each  point.  Methodologies  requiring  precise  knowledge 
of  the  diffuse  reflectance  map  for  shape  recovery  may 
produce  significantly  inaccurate  results  for  a  number  of 
commonly  occuring  surface  materials.  These  include 
such  methodologies  as  shape-from-shading  [Horn,  1975] 
and  photometric  stereo  [Woodham,  1978). 

We  investigate  a  novel  methodology  for  the  recovery 
of  shape  that  utilises  multiple  point  light  source  illu¬ 
mination.  In  this  sense  our  methodology  is  a  variation 
of  photometric  stereo,  but  the  main  conceptual  differ¬ 
ence  is  that  we  do  not  require  knowledge  of  the  diffuse 
reflectance  map.  Such  a  methodology  is  generally  appli¬ 
cable  to  the  shape  recovery  of  inhomogeneous  dielectric 
surfaces  regardless  of  the  possible  variety  of  diffuse  re¬ 
flectance  discussed  above.  We  will  also  see  that,  theoret¬ 
ically  speaking,  by  adding  more  and  more  light  sources 
our  methodology  can  produce  almost  arbitrary  accuracy 
in  surface  orientation  measurement.  This  is  certainly  not 
the  case  for  conventional  photometric  stereo  without  any 
knowledge  of  the  reflectance  map. 

Our  methodology  relies  upon  an  important  assump¬ 
tion  about  diffuse  reflection  stated  as  follows; 

MONOTONICITY  ASSUMPTION  FOR  DIFFUSE 
REFLECTION:  Difftue  reflected  radiance  from  inhomo¬ 
geneous  dielectrics  resulting  from  subsurface  scattering  is 
monoionic  with  respect  to  the  angle  of  incidence  between 
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the  incident  orientation  of  light  and  the  surface  normal. 
This  is  true  for  any  emittance  direction. 

This  assumption  is  robust  for  inhomogeneous  dielectrics 
possessing  isotropic  scattering  inhomogeneities  and  hav¬ 
ing  a  surface  boundary  that  either  scatters  isotropically 
or  according  to  Snell’s  Law  and  the  Fresnel  reflection 
coefficients,  as  supported  by  Figures  3b,  5b,  and  6.  It 
should  be  clear  that  monotonicity  should  consistently 
imply  monotonic  decreasing  with  respect  to  angle  of  in¬ 
cidence.  The  monotonicity  assumption  for  diffuse  re¬ 
flection  is  also  intuitively  appealing.  For  light  incident 
on  a  surface  at  radiance,  and  angle,  respective 
to  the  surface  normal,  the  total  incident  irradiance  is 
LtcosOi  [Horn  and  Sjoberg,  1979]  and  this  is  monotonic 
decreasing  with  respect  to  Si.  Hence  the  total  irradiance 
of  light  incident  on  a  surface  which  is  then  subject  to 
diffuse  scattering  decreases  as  a  function  of  angle  of  inci¬ 
dence.  This  does  not  preclude  some  possibly  pathologi¬ 
cal  case  of  surface  roughness  where  for  some  reason  most 
of  the  planar  microfacet  surface  normals  are  oriented  ex¬ 
tremely  oblique  to  the  surface  normal  and  none  or  very 
few  are  actually  oriented  parallel  to  the  surface  normal. 
It  is  theoretically  possible  in  this  case  for  monotonicity  of 
diffuse  reflection  to  break  down  in  some  ranges  of  angle 
of  incidence,  but  in  such  a  case  it  is  fair  to  call  into  ques¬ 
tion  exactly  what  a  surface  normal  physically  means  at  a 
point.  Nonetheless,  as  seen  in  section  3,  the  monotonic¬ 
ity  assumption  for  diffuse  reflection  b  by  far  more  robust 
than  the  assumption  about  Lambertian  reflectance. 

Because  of  its  directional  nature,  the  specular  compo¬ 
nent  of  reflection  violates  monotonicity  for  reflected  r^- 
ance  with  respect  to  angle  of  incidence  for  both  smooth 
and  rough  surfaces.  Regions  on  inhomogeneous  dielectric 
surfaces  &om  which  there  b  observed  significant  specu¬ 
lar  reflection  must  be  identified  in  order  to  know  where 
monotonicity  b  violated.  Thb  b  actually  very  easy  for 
dielectrics  since  as  presented  in  [Wolff,  1989]  specular 
reflecting  regions  on  smooth  and  rough  surfaces  can  be 
passively  identified  by  placing  a  polarbing  filter  in  front 
of  a  video  camera.  For  photometric  stereo  once  specular 
regions  are  identified  they  automatically  provide  surface 
orientation  information  about  these  regions  because  the 
incident  orientation  of  the  light  source  from  which  the 
specular  reflection  b  produced  b  known  (i.e.,  the  surface 
normal  bisects  the  light  source  and  viewing  duections). 

Consider  the  mutual  illumination  area  of  3  light 
sources  (i.e.,  surface  orientations  that  do  not  lie  in 
shadow  for  any  of  the  light  sources  individually  turned 
on)  as  depicted  in  Figure  7  and  assume  that  the  spec¬ 
ular  regions  produced  by  each  one  of  the  light  sources 
has  been  identified.  Sur^e  points  that  are  mutually  il¬ 
luminated  and  that  are  p-'-t  in  specular  regions,  diffusely 
reflect  light  from  each  one  of  the  light  sources  individu¬ 
ally  into  the  camera  sensor.  At  a  given  surface  point,  the 
diffuse  reflected  radiance  values  Ji,/}  and  1$  respective 
to  each  light  source  are  produced  for  the  same  emit¬ 
tance  angle  (i.e.,  angle  between  the  surface  normal  and 
the  viewing  vector)  but  at  possibly  different  angles  of  in¬ 
cidence.  Given  the  monotonicity  assumption  for  diffuse 
reflection,  a  simple  comparison  of  the  relative  magni¬ 
tudes  of  Iith  and  !»  defines  a  natural  partitioning  of 


PHOTOMETRIC  PAKlTnONING  OF  MOTUAIXY  ILLUMINATED  AREA 
(3  UGHT  SOURCES) 

Figure  8 

the  mutual  illumination  ares  of  surface  orientations  as 
depicted  in  Figure  8.  The  relative  magnitudes  of  the  dif¬ 
fuse  reflected  intensities  are  in  the  exact  opposite  order 
(due  to  monotonic  decreasing)  as  the  relative  magnitudes 
of  the  respective  angles  of  incidence  made  between  each 
respective  light  source  and  the  surface  normal. 

Consider  the  comparison  of  the  angles  of  incidence  for 
just  2  light  sources.  If  the  incident  orientations  for  light 
sources  1  and  2  are  represented  by  points  on  the  unit 
Gaussian  sphere,  the  loci  of  points  on  the  Gauss  sphere 
representing  surface  orientations  that  are  simultaneously 
equi-angular  with  these  two  incident  source  orientations 


PHOTOMETRIC  PARTITIONING  OF  MUTUALLY  ILLUMINATED  AREA 
(4  LIGHT  SOURCES) 

Figure  9 
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is  a  great  circle  perpendicularly  bisecting  the  arc  between 
the  2  points  for  the  two  light  sources.  If  the  angle  of  in¬ 
cidence  for  a  surface  normal  is  smaller  with  light  source 

1  than  for  light  source  2,  then  the  surface  normal  must 
lie  to  the  left  of  the  equi-angular  partition  line  between 
light  sources,  within  the  mutual  illumination  region  as 
depicted  in  Figure  8.  If  the  angle  of  incidence  for  a  sur¬ 
face  normal  is  larger  with  light  source  1  than  for  light 
source  2,  then  the  surface  normal  must  lie  to  the  right  of 
the  equi-angular  partition  line,  within  the  mutual  iUumi- 
nation  region.  Equality  would  place  the  surface  nc.mal 
on  the  partition  line  itself.  The  monotonicity  assump¬ 
tion  for  diffuse  reflection  allows  comparison  of  the  rel¬ 
ative  magnitude  of  the  angle  of  incidence  between  the 
surface  normal  and  each  one  of  the  point  light  sources, 
without  any  knowledge  of  the  diffuse  reflectance  map. 

Figure  8  shows  the  6  partition  regions  within  the  mu¬ 
tual  illumination  area  created  by  equi-angular  partition 
lines  between  3  light  sources.  In  this  example  the  in¬ 
cident  orientations  of  3  light  sources  are  mutually  or¬ 
thogonal  so  that  the  incident  orientations  for  each  one 
of  the  light  sources  comprises  each  of  the  corners  of  the 
mutual  illumination  area.  The  labeling  numbers  ABC 
for  each  of  the  regions  in  Figure  8  represent  that  the  re¬ 
flected  diffuse  intensities  for  the  surface  orientations  in 
this  region  are  in  the  order  A  <  B  <  C  respective  to 
light  sources  A,  B  and  C.  This  implies  that  the  angles 
of  incidence  are  in  the  opposite  order  A  >  B  >  C  respec¬ 
tive  to  light  sources  A,  B  and  C.  Figure  9  shows  the  12 
partition  regions  within  the  same  mutual  illumination 
area  created  by  equi-angular  partition  lines  between  4 
light  sources.  The  4th  light  source  is  added  to  the  center 
of  the  lighting  configuration  depicted  in  Figure  8.  The 
same  labeling  scheme  for  partition  regions  holds. 

For  N  light  sources  the  total  number  of  combinations 
of  comparison  outcomes  that  can  be  made  between  any 

2  diffuse  reflectance  measurements  is 

W2)  =  =  N{N-l). 

This  defines  the  maximum  number  of  partitions  within 
a  mutual  illumination  area  that  can  be  created  with  N 
light  sources.  Figures  8  and  9  depict  lighting  configu¬ 
rations  that  achieve  the  maximum  number  of  partitions 
for  3  and  4  light  sources  respectively.  An  example  of 
suboptimal  partitioning  is  a  configuration  of  3  coplanat 
light  source  orientations  only  defining  4  surface  orienta¬ 
tion  partition  regions. 

We  now  consider  the  average  angular  error  produced 
by  partitioning  of  a  mutual  illumination  surface  orien¬ 
tation  region  the  sise  of  the  octant  of  a  sphere.  Thb  is 
the  same  sise  as  the  mutual  illumination  area  depicted 
in  Figures  8  and  9.  Once  we  know  which  surface  ori¬ 
entation  partition  region  a  surface  normal  belongs  too, 
what  orientation  value  should  be  assigned  to  the  normal 
?  The  best  assignment  of  orientation  value  should  min¬ 
imise  average  angular  error.  Assuming  initially  that  all 
surface  orientations  within  the  mutual  illumination  area 
are  equally  probable,  assigning  the  centroid  orientation 
with  respect  to  the  vertices  of  the  respective  partition 
region  comes  very  close  to  minimising  overall  average 


angular  error. 

For  a  diffuse  reflecting  sphere,  assigning  orientation 
values  in  this  way  to  surface  normals,  for  3  light  sources 
as  in  Figure  8  the  average  angular  error  is  12.1*’,  and 
for  4  light  sources  as  in  Figure  9  the  average  angular  er¬ 
ror  is  8.7*’.  The  portion  of  the  sphere  for  whi^  shape 
is  being  determined  is  shown  in  Figure  10a.  Figure  10b 
shows  a  simulation  of  shape  reconstruction  using  photo¬ 
metric  partitioning  with  3  light  sources,  and  Figure  10c 
shows  a  simulation  of  shape  reconstruction  using  photo¬ 
metric  partitioning  with  4  light  sources  (respective  to 
the  lighting  configurations  in  Figures  8  and  9).  The 
shapes  in  Figures  10b  and  10c  have  the  appearance  of 
toughhewn  unfinished  sculptures  of  the  sphere  in  Figure 
10a.  Using  only  3  light  sources,  the  method  of  photomet¬ 
ric  partitioning  provides  more  rudimentary  shape  infor¬ 
mation,  while  photometric  partitioning  with  additional 
light  sources  provides  more  shape  precision.  Theoreti¬ 
cally,  arbitrary  precision  of  shape  could  be  determined 
with  more  and  more  point  light  sources,  without  any 
knowledge  of  the  diffuse  reflectance  map.  Shape  accu¬ 
racy  is  also  increased  for  a  fixed  number  of  light  sources 
by  decreasing  the  sise  of  the  mutual  iUumination  area 
covered.  If  in  Figures  8  and  9  the  mutual  illumination 
area  were  reduced  to  one-fourth  of  the  octant  of  a  sphere, 
the  respective  average  angular  errors  would  be  reduced 
by  one-half  to  6.0*’  and  4.3*’. 

These  results  are  a  significant  improvement  over  con¬ 
ventional  photometric  stereo  assuming  a  Lambertian  re- 
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flectance  map.  For  an  inhomogeneous  dielectric  with  a 
smooth  surface  with  diffuse  reflectance  dependent  upon 
angle  of  incidence  simulated  in  Figure  5b,  an  assumption 
of  Lambertian  reflectance  would  produce  an  average  an¬ 
gular  error  of  over  15**  for  all  single  scattering  albedos. 
For  an  inhomogeneous  dielectric  with  a  rough  surface 
with  diffuse  reflectance  dependent  upon  angle  of  inci¬ 
dence  simulated  in  Figure  6,  an  assumption  of  Lamber¬ 
tian  reflectance  would  produce  an  average  angular  error 
of  over  20**  for  all  single  scattering  albedos.  These  are 
with  respect  to  simulations  for  reconstruction  of  the  oc¬ 
tant  of  the  sphere  shown  in  Figure  10a. 

Figure  11  shows  an  experimental  result  of  shape  de¬ 
termination  of  part  of  a  smooth  ceramic  vase  from  pho¬ 
tometric  partitioning  with  3  light  sources  configured  or¬ 
thogonal  to  one  other.  The  square  on  the  photograph  de¬ 
picts  where  the  shape  was  reconstructed.  Note  how  the 
shape  reconstruction  in  Figure  11  is  “smoother”  than  the 
reconstruction  of  the  sphere  in  Figure  10b.  This  is  due 
to  additional  surface  orientations  being  assigned  to  pix¬ 
els  from  the  appropriate  orientation  midpoints  of  sides 
of  partition  regions  in  Figure  8  according  to  where  2  of 
the  camera  intensity  values  were  equal  within  gray  level 
repeatability  (determined  to  be  ±2  gray  levels).  When 
all  3  gray  values  were  equal  to  within  repeatability,  the 
orientation  value  is  (0, 0)  in  gradient  coordinates. 

The  shape  reconstruction  in  Figure  12  was  obtained 
from  the  same  camera  data  but  assuming  that  the  re¬ 
flectance  map  is  Lambertian  with  variable  albedo  scale 
factor.  Note  how  this  shape  reconstruction  is  consid¬ 
erably  flatter  than  it  should  be.  This  is  consistent  with 
the  fact  that  diffuse  reflectance  from  the  smooth  ceramic 
vase  does  not  fall  off  as  fast  with  respect  to  angle  of  inci¬ 
dence  as  for  Lambertian  reflectance  (predicted  in  Figure 
5b  !).  If  Lambertian  reflectance  is  assumed,  photometric 
stereo  believes  that  the  relative  angle  of  incidence  be¬ 
tween  the  surface  normal  and  each  of  the  light  sources  is 
not  varying  as  much  as  it  really  is  from  point  to  point. 
Therefore,  the  surface  normals  appear  not  to  vary  nearly 
as  much  as  they  should  be  and  the  reconstructed  surface 
is  correspondingly  flatter. 

4.2  SIMULTANEOUS  DETERMINATION 
OF  SHAPE  AND  DIFFUSE 
REFLECTANCE 

The  method  of  photometric  partitioning  not  only  pro¬ 
vides  a  methodology  for  recovering  shape  from  unknown 
diffuse  reflectance,  but  can  actually  be  used  to  deter¬ 
mine  the  diffuse  reflectance  map  itself  on  a  surface  of 
unknown  shape.  Figure  13  depicts  a  lighting  configura¬ 
tion  that  can  be  used  for  the  simultaneous  determination 
of  2-D  surface  orientation  and  the  diffuse  reflectance  map 
from  photometric  partitioning.  A  total  of  6  point  light 
sources  are  used  here  over  an  incident  2-D  orientation 
range  of  180”.  The  solid  lines  represent  equiangular  par¬ 
tition  lines  for  adjacent  light  sources.  Determining  the 
light  source  that  produces  the  maximum  diffuse  reflected 
intensity  with  respect  to  aU  6  light  sources,  constrains 
the  normal  at  a  surface  point  to  be  within  an  orienta¬ 
tion  range  of  36“  for  normab  oriented  somewhere  within 
the  4  middle  solid  partitions,  and  an  orientation  range  of 
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18“  for  normals  oriented  more  oblique  than  the  extreme 
left  and  right  solid  partition  lines  (since  normals  cannot 
be  more  than  90“  relative  to  camera  viewing). 

The  jagged  lines  in  Figure  13  represent  equiangular 
partition  lines  for  light  sources  that  have  one  light  source 
in  between.  This  is  except  for  the  horisontal  jagged  lines 
which  represent  directions  orthogonal  to  viewing.  The 
jagged  partition  lines  can  be  used  to  constrain  all  2-D 
surface  orientations  to  within  18“.  Consider  now  com¬ 
paring  the  diffuse  reflected  radiances  from  the  2  light 
sources  on  either  side  of  the  light  source  producing  the 
maximum  diffuse  reflectance.  Recall  that  normals  whose 
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maxiinum  diffuse  reflectance  is  produced  from  either  the 
left  or  right  most  extreme  light  sources  are  already  con¬ 
strained  to  within  18',  so  we  are  now  considering  normals 
whose  maximum  diffuse  reflectance  is  produced  from  one 
of  the  4  middle  light  sources.  Depending  upon  which  dif¬ 
fuse  reflected  ra^ance  is  greater,  a  normal  constrained 
within  a  solid  partition  can  now  be  determined  to  lie  on 
one  side  of  the  jagged  partition  line  bisecting  the  solid 
partition.  That  is,  the  orientation  for  a  normal  can  be 
constrained  within  18'.  Unfortunately  due  to  symmetry 
no  additional  equiangular  partition  lines  are  produced 
&om  the  6  light  sources. 

In  brief,  using  the  monotonicity  assumption  and 
isotropy  of  difiuse  reflectance,  the  orientation  range  of  a 
norm^  at  a  surface  point  can  be  restricted  within  18'  us¬ 
ing  the  configuration  in  Figure  13  by  observing  the  light 
sources  producing  the  largest  and  second  largest  diffuse 
reflected  radiance.  Assigning  an  orientation  value  that 
bisects  these  18'  partitions  gives  a  worst  case  error  of 
9'  and  an  average  error  of  4.5',  assuming  all  2-D  orien¬ 
tations  are  equ^y  probable.  In  general  using  an  even 
number,  N,  of  point  light  sources  to  give  room  for  the 
camera  to  look  vertically  down  as  in  Figure  13,  the  av¬ 
erage  2-D  orientation  error  will  be  180/(8  x  (N  —  1)) 
degrees. 

Once  2-D  surface  orientation  is  determined  using  pho¬ 
tometric  partitioning,  the  light  sources  in  Figure  13  pro¬ 
vide  different  diffuse  reflectance  measurements,  sampling 
the  diffuse  reflectance  map  at  up  to  5  different  known  an¬ 
gles  of  incidence  for  each  surface  point.  At  each  visible 
surface  point  the  diffuse  reflectance  map  can  be  sampled 
for  at  least  3  angles  of  incidence  between  0'  and  90'. 
The  maximum  number  of  5  samplings  at  different  an¬ 
gles  of  incidence  occurs  for  points  with  normals  within 
36'  of  the  camera  viewing  orientation,  and  not  paral¬ 
lel  to  any  of  the  solid  or  jagged  equiangular  partition 
lines.  Clearly,  the  number  of  samplings  can  be  increased 
with  more  available  light  sources  and  this  may  be  im¬ 
portant  for  surfaces  with  nonuniform  diffuse  reflectance 
properties  from  point  to  point.  For  a  surface  with  nearly 
uniform  diffuse  reflectance  from  point  to  point  and  hav¬ 
ing  a  diversity  of  visible  surface  normal  orientations,  the 
diffuse  reflectance  map  can  be  very  finely  sampled  not 
only  for  angle  of  incidence,  but  for  emittance  angle  as 
well. 

The  extension  of  the  configuration  in  Figure  13  to  3-D 
surface  orientations  is  a  bit  more  tedious  but  conceptu- 
aUy  simple.  We  will  shortly  report  experimental  results 
for  diffuse  reflection  from  different  dielectrics  using  the 
configuration  in  Figure  13. 

5  FUTURE  WORK  AND 
CONCLUSION 

In  this  paper  we  have  delved  deeply  into  the  physical  na¬ 
ture  of  diffuse  reflection  from  inhomogeneous  dielectric 
surfaces  and  have  found  that  the  Lambertian  description 
can  be  quite  simplistic  for  a  number  of  ordinary  surfaces. 
Diffuse  reflection  is  in  fact  a  very  rich  area  that  has  pre¬ 
viously  not  been  closely  studied  in  computer  vision.  It 
can  take  on  many  different  properties  dependent  upon 


a  number  of  optical  parameters  such  as  single  scattering 
albedo  and  surface  boundary  roughness.  We  have  stud¬ 
ied  diffuse  reflection  for  the  case  of  subsur&ce  isotropic 
scattering  and  this  is  indeed  applicable  to  a  wide  variety 
of  materials. 

There  are  surfaces  for  which  subsurface  anisotropic 
scattering  applies  (e.g.,  types  of  wood)  and  this  topic 
is  currently  being  studied  and  to  be  reported  in  the  near 
future.  Lord  Rayleigh  investigated  the  scattering  of  light 
from  particles  according  to  the  dipole  moment  that  the 
electric  field  of  incident  light  induces  on  the  particle. 
This  is  dependent  upon  the  polarizahUity,  a,  of  the  par¬ 
ticle.  The  Rayleigh  ecatiering  phase  function  is  axially 
symmetric  and  given  by  the  following  proportionality; 

P{co$ff)  cc  a*(l  -I-  eot^O)  . 

Such  an  anisotropic  scattering  phase  function  is  applica¬ 
ble  to  our  own  earth’s  atmosphere  and  is  fairly  accurate 
in  accounting  for  the  radiance  distribution  of  transmit¬ 
ted  sunlight  across  the  sky. 

The  diversity  of  the  nature  of  diffuse  reflection  should 
not  be  considered  a  hindrance  to  computer  vision  re¬ 
search.  On  the  contrary,  because  of  the  prevalence  of  this 
phenomenon  and  its  dependence  upon  important  optical 
parameters  it  provides  a  great  opportunity  for  utilising 
diffuse  reflection  as  a  tool  for  extracting  a  number  of  ob¬ 
ject  features.  This  can  include  determination  of  surface 
roughness,  and  even  material  classification  from  mea¬ 
surement  of  the  diffuse  reflectance  map  to  derive  prop¬ 
erties  such  as  single  scattering  albedo  of  subsurface  par¬ 
ticle  inhomogeneities.  This  requires  the  development  of 
a  new  set  of  vision  methodologies  and  opens  up  a  whole 
new  subarea  of  research. 

For  the  many  varieties  of  diffuse  reflection  properties 
that  can  exist  we  have  identified  an  invariant,  namely 
the  monotonicity  assumption  that  reflected  diffuse  ra¬ 
diance  is  monotonic  with  respect  to  angle  of  incidence. 
We  utilised  this  monotonicity  assumption  to  develop  a 
new  photometric  stereo  methodology  for  shape  deter¬ 
mination  which  is  generally  applicable  to  a  wide  vari¬ 
ety  of  isotropic  diffuse  reflecting  surfaces.  We  have  also 
proposed  using  the  photometric  partitioning  method  for 
simultaneous  determination  of  shape  and  the  diffuse  re¬ 
flectance  map.  Because  photometric  partitioning  deter- 
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mines  shape  without  knowledge  of  the  diffuse  reflectance 
map,  tabulating  the  diffusely  reflected  values  for  known 
computed  normals  derives  the  diffuse  reflectance  map 
from  multiple  known  incident  light  orientations.  We 
are  currently  performing  experimentation  with  this  tech¬ 
nique. 
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Abstract 

The  spatiotemporal  (ST)  surface  has  been  shown  to  be  a 
useful  representation  of  projected  scene  dynamics.  Our 
previous  use  of  this  representation  has  focused  on  geo¬ 
metric  recovery  of  scene  static  structure  from  the  analy¬ 
sis  of  relative  motions  on  the  moving  image  plane.  Thai 
earlier  work  exploited  the  implicit  partitioning  of  motions 
along  epipolar  lines  to  enable  search-free  feature  tracking 
and  position  estimation.  The  ST  manifolds  provide  ex¬ 
plicit  information  about  feature  SD  contiguity,  and  their 
use  leads  to  the  recovery  of  feature  SD  position,  object 
3D  contours,  and  scene  SD  surfaces.  We  have  recently 
turned  our  attention  to  the  task  of  interpreting  non-static 
scenes,  and  track  and  estimate  motions  of  independently 
moving  objects  and  background  by  their  appearance  and 
behavior  on  the  ST  surface.  Selecting  the  most  reliable 
and  discriminating  information  in  the  scene,  the  system 
demonstrates  robust  feature  tracking  over  a  large  range 
of  feature  sizes  and  velocities.  When  coupled  with  the 
more  mature  Epipolar-Plane  Image  Analysis  system,  this 
motion  analysis  capability  will  enable  camera  solving,  dy¬ 
namics  tracking,  and  scene  reconstruction  within  a  uni¬ 
fied  framework. 

1  Image  Sequence  Tracking 

The  problem  of  tracking  particular  objects  through  a  se¬ 
ries  of  images  has  proved  to  be  a  challenging  one.  The 
most  common  tracking  techniques  include  edge  (or  more 
generally,  feature)  tracking,  centroid  tracking,  correla¬ 
tion  tracking,  and  gradient-based  optic  flow  analysis. 
Each  suffers  from  significant  disadvantages.  Edge  track¬ 
ing  is  problematic  because  it  is  difficult  to  make  a  robust 
association  between  a  particular  group  of  edges^  and  the 
object  being  tracked.  Centroid  tracking  is  diflficult  for  re¬ 
lated  reasons:  there  is  no  clear  association  between  scene 
objects  and  computable  centroids.  Correlation  track¬ 
ing  is  problematic  due  to  changing  aspect  of  the  target 

’  Thii  paper  appeared  in  the  IEEE  Motion  Vision  Workshop  held 
in  Princeton  New  Jersey  in  October  1991.  The  research  de¬ 
scribed  here  has  been  supported  by  FVijitsu  Systems  Integration 
Laboratories  most  recently,  with  earlier  support  from  DARPA. 

^some  of  which  may  be  only  artifacts  of  position  or  illumination 


with  respect  to  the  tracker;  the  tracked  object  can  rotate 
while  translating,  changing  its  image  appearance  from 
one  frame  to  the  next.  Gradient-based  optic  flow  re¬ 
lates  differential  changes  in  reflectance  with  orientation 
or  motion  of  surfaces  in  the  scene.  This  relationship  is 
approximate  for  short  spatial  or  temporal  baselines  and 
quite  inappropriate  for  long  baselines.^  Feature  anal¬ 
ysis  has  advantage  in  that  it  focuses  processing  at  the 
most  discriminable  parts  of  the  imagery  with  the  great¬ 
est  localization  and  provides  robustness  through  lower 
sensitivity  to  projective  difficulties  such  as  occlusion  and 
illumination  effects. 

This  paper  describes  our  efforts  at  utilizing  feature 
tracking  on  the  space-time  surface  (2)  for  motion  analy¬ 
sis.  The  principal  distinction  of  this  space-time-manifold 
approach  to  motion  analysis  is  that  it  unifies  the  rep¬ 
resentation  of  scene  features  over  space  and  time.  In 
feature  tracking  this  alleviates  the  major  difficulty  of 
feature-based  analysis  -  the  correspondence  problem. 
For  EPI  analysis  it  will  resolve  independent  motions 
within  the  same  framework  as  solving  for  scene  geometric 
structure.  In  discussing  our  approach  to  motion  analy¬ 
sis  we  will  begin  by  summarizing  our  earlier  research  in 
recovering  scene  structure  from  motion  (Epipolar-Plane 
Image  (EPI)  Analysis,  described  in  (3)  and  (4)),  connect 
the  techniques  used  there  with  the  more  general  problem 
of  unknown  scene  dynamics,  and  then  discuss  our  use  of 
the  ST  surface  for  motion  tracking. 

1.1  ST  Manifolds  for  Scene  Structure 

In  the  scene  reconstruction  task,  our  use  of  the  spa¬ 
tiotemporal  surface  involved  tracking  features  as  they 
moved  under  known  constraints  in  space-time,  and  ap¬ 
proximating  and  maintaining  estimates  of  their  positions 
through  the  sequence.  The  approach  bridged  the  usual 
dichotomy  of  depth  sensing  in  that  its  large  number  of 
images  led  to  a  large  baseline  and  thus  high  accuracy, 
while  rapid  image  sampling  gave  minimal  change  from 
frame  to  frame  and,  with  camera  knowledge,  eliminated 
the  correspondence  problem.  Within  this  framework,  we 
generalized  from  the  traditional  notion  of  epipolar  lines 

^Notice  the  mapping  and  resampling  necessary  in  Heel's  work 

(9)  to  nukke  optic  flow  coherent  across  time. 
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to  that  of  epipolar  planes.  VVe  then  formulated  a  tracking 
process  that  exploited  the  above  constraints  in  determin¬ 
ing  the  position  of  features  in  the  scene. 

Our  tracker  was  a  sequential  linear  estimator,  imple¬ 
mented  as  a  Square  Root  filter  without  the  extrapolation 
phase.  Extrapolation  was  unnecessary  since  the  cam¬ 
era  constraints  and  the  space-time  surface  told  us  where 
each  feature  moved  from  frame  to  frame  (  there  was  no 
‘aperture  problem’).  The  work  of  Matthies(lO)  had  sim¬ 
ilarities  to  ours  in  its  pursuit  of  scene  depth  from  the 
analysis  of  image  sequences,  but  lacked  several  impor¬ 
tant  elements,  including  the  generality  with  respect  to 
view  angle  that  came  with  our  use  of  a  line-of-sight  for¬ 
mulation,  the  explicit  use  of  spatial  connectivity  that 
provides  higher-level  scene  contour  descriptors,  and  our 
match-free  tracking. 

While  simplifying  the  problem  through  the  use  of  three 
assumptions  -  the  camera  movement  was  linear,  its  po¬ 
sition  and  attitude  were  known,  data  capture  was  suffi¬ 
ciently  rapid  that  the  imagery  was  temporally  coherent  - 
we  developed  a  system  that  could  a)  work  for  any  camera 
attitude,  b)  acquire  images  at  varying  rates,  c)  operate 
sequentially  in  time,  and  d)  provide  spatially  coherent 
results  —  3D  contours. 

Critical  to  this  was  the  development  of  a  unique  pro¬ 
cess  that  constructed,  in  parallel  as  the  frames  were 
obtained,  a  3D  space-time  description  of  the  evolving 
imagery.  This  specifies  fully  the  temporal  and  projec¬ 
tive  relationships  between  scene  objects  and  the  sensor. 
When  the  scene  was  stationary  and  observed  by  a  mov¬ 
ing  camera,  the  representation  provided  simple,  direct 
and  robust  estimates  of  scene  structure.  In  extending 
the  analysis  of  the  ST  surface  representation  for  more 
general  dynamic  analysis,  the  major  difference  is  that 
we  cannot  rely  on  known  camera  motion  for  our  track¬ 
ing,  but  must  actually  do  the  matching  -  addressing  the 
correspondence  problem.  One  of  the  benefits  of  the  ST 
manifold  is  that  it  greatly  simplifies  this  problem. 

1.2  ST  Manifolds  for  Scene  Dynamics 

Several  issues  arise  in  the  move  from  static  to  dynamic 
scenes.  Since  we  have  to  decouple  sensor-induced  mo¬ 
tion  from  scene  motion,  we  must  be  able  to  solve  for  the 
camera.  For  distinguishing  moving  from  stationary  ob¬ 
jects,  we  must  be  able  to  discriminate  real  from  sensor- 
induced  motion  (moving  objects  versus  the  background), 
-  we  must  be  able  to  model  the  scene  static  structure. 
Motion  analysis  and  scene  reconstruction  should  oper¬ 
ate  together,  with  the  estimated  scene  geometry  aiding 
in  the  camera  solving  (using  known  stationary  features) 
and  being  used  to  discriminate  object  motion  (by  pro¬ 
viding  a  ‘background’). 

The  approach  we  have  taken  to  motion  tracking  is 
built  on  our  scene  structure  estimation  process  within 
this  unified  framework.  It’s  processing  is  based  upon 


a  multi-stage  scheme  involving  feature  detection,  selec¬ 
tion,  grouping,  and  motion  classification.  First,  we  rep¬ 
resent  the  spatiotemporal  structure  of  scene  dynamics  - 
this  is  handled  by  the  ST  manifold.  Using  a  localiza¬ 
tion  measure  on  the  space-time  surface,  we  then  isolate 
features  of  interest.  Propagating  from  maxima  of  the 
localization  measure,  we  determine  the  paths  of  features 
through  time.  Finally,  a  simple  linear  estimator  charac¬ 
terizes  feature  velocity. 

Feature  ‘edges’  detected  in  2D  images  become  sur¬ 
face  ‘facets’  in  3D.  The  connectivity  of  these  facets  gives 
us  our  tracking  mechanism.  We  described  earlier  (2) 
how  we  locate  and  parameterize  those  individual  3D  el¬ 
ements  —  the  facets  —  and  structure  them  together 
through  time.  In  brief,  we  define  the  manifolds  (sheets  in 
space-time)  that  separate  image  features.  These  mani¬ 
folds  are  2D  surfaces  embedded  in  the  3D  space-time 
dimensions  of  our  data,  and  are  positioned  at  the  ex¬ 
trema  of  smoothed  brightness  gradient  in  the  imagery 
-  zero  crossings  of  the  Laplacian  of  a  Gaussian  (LOG). 
By  following  localizable  ‘features’  on  these  surfaces  we 
track  them  in  time.  The  following  sections  describe  our 
use  of  these  features  and  provide  details  of  our  tracking 
method. 

Although  we  have  processed  a  variety  of  image  se¬ 
quences  with  this  tracking  process,  display  of  detedled 
analyses  of  large  data  sets  is  difficult  in  panchromatic 
reproduction.  Our  displays  here  will  be  limited  to  sim¬ 
ple  local  indications  of  the  processing  -  more  detail  will 
be  presented  at  the  workshop,  including  various  displays 
of  the  moving  data;  space-time  surface  building;  repre¬ 
sentation  of  space-time  surface  localization;  the  trackers; 
the  extracted  ‘interesting’  features;  superposition  of  ret¬ 
icles  grouping  observations  on  individual  images  in  the 
sequences;  and  extraction  of  movement  from  the  ‘back¬ 
ground.’ 

2  Selecting  Tracking  Features 

Figure  1  left  shows  frames  from  a  synthetic  motion  se¬ 
quence  of  a  rotating  square.  The  motion  is  described 
by  the  zero-crossings  of  a  3D  LOG  over  these  data,^ 
as  shown  in  Figure  2,  with  time  progressing  out  of  the 
figure.®  Figure  2  right  shows  a  side  view  of  these  sur¬ 
faces,  oriented  so  that  the  temporal  structure  is  more 
visible.  What  should  be  noticed  is  that  the  connectivity 
captured  by  the  surface-building  process  is  an  explicit 
representation  and  grouping  of  the  motion  in  the  scene. 

*3D  convolution  is  a  standard  means  of  incorporating  temporal 
information  (for  example,  see  (1),  (6)  and  (8)).  In  general, 
others  have  not  attempted  to  utilitize  or  represent  the  temporal 
zero  crossings. 

^It  is  true;  many  of  the  figures  presented  here  are  so  small  and 
detailed  as  to  seem  unintelligible.  Being  of  3D  data,  larger 
single  figures  provide  considerably  less  to  appreciate  and,  in 
fact,  viewing  these  in  stereo  gives  very  good  assessment. 
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A  spatial  cut  through  the  3D  zero  crossings  would  pro¬ 
duce  a  2D  spatial,  single-image  feature  description.  The 
temporal  facets  provide  connectivity  information  for  the 
space-time  dimension  of  the  data. 


Fig.  1:  Frames  of  Rotating  Square 
Tracking  requires  determining  the  correspondence  be¬ 
tween  features  in  successive  views.  If  we  know  the  di¬ 
rection  in  which  an  object  is  moving  (or  conversely,  the 
direction  in  which  the  sensor  is  moving  through  a  static 
scene),  then  we  can  use  this  knowledge  in  determining 
their  positions  in  space  (as  demonstrated  by  our  EPI 
work).  On  the  other  hand,  if  we  have  no  knowledge  of 
the  motion  of  the  sensor,  or  if  the  scene  can  contain 
objects  exhibiting  independent  motion,  then  these  con¬ 
straints  do  not  apply.  To  track  a  feature  we  must  be 
able  to  recognize  it  from  frame  to  frame  and  distinguish 
it  from  the  other  features  around  it  (this  raises  the  aper¬ 
ture  problem).  Only  a  small  percentage  of  the  features 
we  have  selected  with  our  3D  detection  process  can  be 
adequately  distinguished  for  this.  For  example,  if  the  ob¬ 
ject  to  be  tracked  happens  to  have  a  square  shape,  then 
the  only  discriminable  parts  of  it  will  be  the  corners.  We 
must  determine  a  measure  to  use  on  the  images  to  locate 
features  that  are  discriminable  —  features  that  can  be 
reliably  tracked  from  frame  to  frame. 


Fig.  2:  Spatiotemporal  Surfaces 


2.1  The  Autocorrelation  Function 

The  autocorrelation  function  -  convolving  a  small  win¬ 
dow  of  the  image  over  some  larger  subset  -  provides  such 
a  measure.  Where  the  window  and  the  subset  are  identi¬ 
cally  aligned,  the  convolution  will  indicate  a  high  corre¬ 
lation;  elsewhere,  the  correlation  will  be  poorer.  A  uni- 
modal  and  highly  peaked  autocorrelation  distribution  in¬ 
dicates  good  localization,  whereas  a  flat  profile  indicates 


ambiguity.  Autocorrelation  is  quite  expensive  to  com¬ 
pute,  involving  evaluation  of  order  mn  at  every  location 
in  the  image  subset.  Interpreting  the  autocorrelation 
structure  is  problematic. 


2.2  Forstner’s  Measure 

A  variety  of  corner-detecting  analogues  to  autocorrela¬ 
tion  have  been  suggested,  and  we  work  with  one  devel¬ 
oped  by  Fbrstner  (7).  Here,  a  simple  measure  based  on 
a  quotient  of  the  determinant  and  the  trace  of  the  co- 
variance  matrix  related  to  a  planar  fit  to  the  window 
specifies  the  localizability  of  the  feature  at  the  center  of 
the  window.  In  its  full  development,  the  measure  deter¬ 
mines  a  confidence  ellipse  in  which  the  feature  can  be 
expected  to  be  localized.  Three  parameters  of  the  mea¬ 
sure  define  the  major  and  minor  axes  of  the  ellipse  and 
its  orientation.  These  three  parameters  are  mapped  to 
a  single  value  (FM). 

Figure  3  shows  a  square  and,  beside  it,  the  image  of 
its  FM.  Figure  4  shows  an  amplified  sampling  of  the  lo¬ 
calization  measures  for  this  image,  and  similar  measures 
for  the  image  rotated.  These  are  oriented  ellipses  whose 
minor  axis  indicates  the  most  reliable  localization  direc¬ 
tion  and  whose  axis  magnitudes  show  the  quality  of  the 
localization. 

We  have  used  two  modifications  to  Forstner’s  measure 
in  this  work.  We  do  not  perform  the  centcr-of-gravity 
refinement,  and  we  normalize  the  measure  by  the  local 
gradient.  The  center-of-gravity  computation  improves 
the  reliability  of  the  estimation,  especially  in  the  vicinity 
of  sharp  corners  where  the  simpler  measure  can  produce 
dual  peaks  to  the  sides  rather  than  a  stronger  peak  at 
the  vertex.  Since  we  wish  to  develop  a  local  computation 
mechanism  for  tracking,  we  are  forgoing  this  correction 
in  our  initial  studies. 


Fig.  3:  Image  Fbrstner  Measure 


Fig.  4:  Localization  Ellipses 
Figure  5  shows  the  structure  of  the  space-time  surface 
at  the  top  right  corner  of  the  data,  with  dotted  lines  in¬ 
dicating  the  temporal  connectivity.  Figure  6  shows  the 
relative  strengths  of  the  FM  along  this  rotating  corner, 
coded  with  dots  and  solid  lines  in  increasing  strength.  It 
is  clear  that  the  corners  of  the  square  are  the  discrim- 
inable  features,  and  the  sides  are  poorly  localized. 


Fig.  5:  Top  Right  Corner 


2.3  Tracking  and  Velocity  Estimation 

The  display  in  Figure  6  shows  the  features  we  will  track 
—  it  does  not  show  an  actual  tracking  of  features.  Relat¬ 
ing  the  observations  indicated  together  over  time  must 
still  be  demonstrated.  In  forming  these  observations  into 
unified  trackers,  we  connect  local  maxima  of  these  mea¬ 
sures.  The  velocity  estimation  will  then  occur  at  this 
level  of  the  analysis  —  as  FM-meixima  observations  are 
associated  through  time,  a  sequential  filter  will  be  up¬ 
dated  and  refined  with  the  new  information. 

2.3.1  Feature  Ti'ackiiig 

The  maxima  can  move  in  any  direction  between  frames  of 
a  sequence,  and  can  move  an  arbitrary  distance  depend¬ 


ing  on  the  velocity  (translational  and  rotational)  of  the 
objects  of  which  they  are  part.  This  means  that  while  the 
spatiotemporal  surface  can  be  defined  by  feature  connec¬ 
tivity,  and  a  particular  maximum  will  be  seen  to  move 
along  a  single  spatiotemporal  surface,  the  matching  of 
maxima  cannot  be  defined  strictly  on  the  basis  of  prox¬ 
imity.  For  one  thing,  they  need  not  be  adjacent  from 
frame  to  frame;  for  another,  if  maxima  are  fairly  dense, 
then  there  may  be  significant  difficulty  in  unambiguous 
assignment  when  they  come  close  to  one  another.  With 
large  motions  or  repeated  fine  patterns,  accurate  track¬ 
ing  could  be  difficult. 

As  well  as  being  accurate,  our  tracking  mechanism 
must  be  designed  to  work  within  the  framework  of  the 
surface-building  process  -  it  must  be  able  to  operate  at  a 
local  level  and  be  amenable  to  parallel  implementation. 
The  tracking  mechanism  we  have  designed  satisfies  these 
criteria.  It  uses  a  propagation  mechanism,  with  each 
maximum  at  time  T  spreading  itself  forward  to  neigh¬ 
bors  on  the  spatiotemporal  surface  and  each  maximum 
at  time  T  +  I  reaching  back  to  neighbors  on  the  spa¬ 
tiotemporal  surface  to  see  if  there  is  a  maximum  from 
which  it  may  have  descended.  When  only  one  predeces¬ 
sor  can  be  found,  the  tracking  assignment  can  simply 
use  this  pairing,  and  can  deduce  that  whatever  the  his¬ 
tory  of  the  feature  at  time  T,  this  new  observation  at 
time  T  -1-  1  shares  that  history  and  affects  the  estima¬ 
tion  of  that  motion.  When  there  are  multiple  choices 
for  the  assignment  then  an  adjudication  must  be  made 
to  select  the  most  likely.  The  tracking  process  and  the 
adjudication  are  described  in  the  next  section. 

2.3.2  Tracking  Maxima  on  ST  Surfaces 

The  principal  intermediary  in  the  tracking  propagation  is 
the  set  of  temporal  facets.  When  a  contour  is  stationary, 
it’s  spatial  facets  will  be  adjacent  over  time  -  there  will 
be  no  spatial  motion  requiring  temporal  representation. 
However,  with  spatial  motion  of  a  contour,  the  temporal 
facets  serve  to  join  these  observations.  At  time  T  -(-  1, 
the  spatiotemporal  surface  from  T  to  T  4-  1  must  be 
built,  the  local  FM  maxima  be  determined,  and  then 
be  associated  with  previous  FM  maxima  at  T  via  the 
intervening  temporal  facets. 

Several  considerations  are  involved  in  establishing 
these  temporal  associations.  If  there  is  no  ambiguity  in 
the  assignment  (only  one  at  each  time  is  being  consid¬ 
ered  for  matching  with  the  other),  then  the  association 
is  made  and  the  tracking  propagated.  If  more  than  one 
is  in  contention,  then  the  values  of  the  FM,  the  local 
spatial  normal  to  the  ST  surface,®  the  established  veloc¬ 
ity,  and  the  distance  of  motion  are  all  considered.  If  one 
pairing  is  clearly  better  in  position,  orientation,  and  lo¬ 
calization  measure,  then  it  is  chosen,  otherwise  multiple 
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’’Notice  lliat  the  3D  surface  normal  also  suggests  the  direction 
of  feature  motion,  as  would  the  eigenvector  of  the  largest  eigen¬ 
value  of  a  3D  version  of  Fdrstner’s  measure. 


pairings  are  maintained  as  being  possible.  In  the  latter 
case,  all  contending  tracks  will  be  retained,  with  the  fi¬ 
nal  judgement  being  made  later  when  more  knowledge 
of  the  behavior  of  the  features  is  available  and  a  choice 
minimizing  the  ambiguity  is  possible.  Another  measure 
being  evaluated  is  a  correlation  score  (SSD)  evaluated 
at  contending  ST  facets.  Figure  7  shows  the  tracking 
results  for  this  rotating  square. 


2.3.3  Motion  Interpretation 

Given  a  single  feature  moving  in  an  image  sequence,  the 
best  we  can  do,  without  other  information  (such  as  stereo 
or  a  DTM),  is  to  determine  its  velocity  in  an  image-based 
coordinate  system.  In  our  demonstrations  here  we  will 
estimate  only  this  image-plane  velocity.  As  an  initial 
approximation  we  will  model  feature  motion  as  piecewise 
linear  in  time  —  that  is,  piecewise  constant  velocity.^ 

Image-plane  motion  is  determined  by  solving  a  system 
of  two  linear  equations  defining  the  velocity  vector  in 
space-time.  A  sequential  least-squares  filter  is  set  up  for 
this  estimation,  and  velocities  are  determined  for  each 
feature  tracked.  Velocity  is  a  relative  measure,  and  its 
interpretation  depends  on  the  activity  around  it.  We 
must  determine  from  the  image-plane  velocities  which 
features  are  in  motion  with  respect  to  others  and  which 
should  be  considered  of  importance  for  tracking  -  we 
must  establish  a  background  frame  for  velocity  reference. 

Since  observed  velocity  depends  on  range,  the  geomet¬ 
ric  structure  of  the  scene,  if  known,  can  be  used  to  distin¬ 
guish  moving  features  from  the  background.  Our  earlier 
EPI  work  will  provide  the  depth  information  for  static 
components  of  the  scene  when  it  is  integrated  with  this 
tracking  system.  In  the  meantime,  our  determination  of 
the  ‘background’  for  these  studies  has  been  quite  simple 
—  we  presume  it  to  be  planar  and  select  as  ‘interesting’ 
features  those  lying  outside  of  one  standard  deviation 
from  that  estimated  plane. 

In  the  interests  of  demonstrating  some  preliminary 
object-like  groupings,  in  the  video  demonstrations  we 
isolate  features  moving  with  respect  to  the  ‘background’ 
and  group  together  those  which  are  spatially  con¬ 
nected.  The  reticles  displayed  over  the  imagery,  indi- 

^See  (5)  for  inferring  a  rotating  and  translating  3D  model  of 

object  motion. 


eating  tracked  features,  enclose  these  grouped  features. 
This  demonstrates  a  primitive  form  of  object  tracking 
—  features  are  tracked  together  when  they  are  seen  to 
be  spatially  related.  Similar  results  were  obtained  by 
grouping  features  using  their  projective  velocities.  Our 
objective  is  to  use  behavior  (dynamics)  and  shape  (stat¬ 
ics)  to  couple  object  tracking  with  object  identification. 

3  Experimental  Results 

Figure  8  shows  several  frames  of  part  of  a  low  resolution 
IR  sequence  of  a  moving  car*  -  notice  the  vehicle  mov¬ 
ing  down  to  the  left  while  the  background  slides  right 
because  of  a  camera  panning  action.  Figure  9  shows  the 
major-contrast  ST  manifolds  from  these  data.  The  next 
figure  presents  a  side  view,  where  the  dots  show  tempo¬ 
ral  surface  elements  joining  the  solid  lines  of  the  spatial 
zero  crossings.  Figure  11  shows  the  final  estimations, 
depicting  velocities  as  dots  through  dashes  to  solid  lines, 
in  units  of  velocity  standard  deviation. 


Figure  8:  Moving  Car 


Figure  9;  Spatiotemporeil  Surfaces 


Figure  10;  Side  View  of  ST  Surfaces 
*Thcse  data  are  part  of  the  workshop  database. 
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Figure  11:  Velocity  Estimates 


3.1  Tracking  Parameters 

Recall  that  our  first  filtering  of  the  imagery  locates  zero 
crossings  of  the  Laplacian  of  a  Gaussian  (LOG)  of  the 
imagery.  This  operation  selects  the  locally-largest  inten¬ 
sity  differences  —  the  ‘edges’  in  the  images.  The  size 
of  this  filter  determines  the  range  of  contrasts  and  the 
spatial  extent  of  these  edges.  Although  in  this  study 
we  have  selected  large  values  for  the  filter  size  to  facil¬ 
itate  display,  these  should  be  determined  automatically 
by  the  tracking  process  -  and  determined  differentially 
across  the  imagery  depending  on  the  character  of  the 
features  observed.  ® 

From  the  set  of  edges  determined  by  the  LOG  oper¬ 
ation,  we  select  those  whose  gradients  are  greater  than 
one  standard  deviation  from  the  mean  of  gradients  over 
all  contours.  This  ensures  that  the  features  (edges)  have 
‘significance,’  i.e.,  are  not  likely  to  be  artifacts  of  the 
detection  process  or  noise.  From  among  these  gradient- 
selected  features  we  choose  those  which  are  most  localiz- 
able.  In  selecting  from  among  the  mosi  !■  alizable  fea¬ 
tures  that  are  tracked  for  those  we  wi>li  i<>  consider  of 
‘interest,’  we  have  again  used  some  a  priori  assumptions. 
Notable  among  these  is  the  assumption  that  such  fea¬ 
tures  are  moving  with  respect  to  the  background  (our 
EPI  analysis  will  handle  the  converse).  We  also  require, 
for  reliable  tracking,  that  we  have  enough  observations  of 
a  feature  to  enable  an  accurate  and  consistent  estimate 
of  its  velocity  to  be  made.  This  means  that  we  discard 
(for  our  displays)  tracked  features  that  are  not  viewed 
for  a  sufficient  duration. 

3.2  Performance  Issues 

The  current  surface-building  process  constructs  the  spa- 
tiotemporal  representations  of  the  imagery  at  a  rate  of 
about  1000  pixels  per  second.  Evaluation  of  Fdrstner’s 
measure  and  the  tracking  of  FM  maxima  reduce  this  to 
roughly  500  pixels  per  second.  Gaussian  and  Laplacian 
convolution  are  not  included  in  these  figures  since  we 
compute  the  filtered  images  in  an  off-line  fashion  before 
studying  a  data  set.  These  convolution  computations  are 

*We  are  addressing  the  notion  of  scale  filtering  on  the  scale- 
space  manifold,  and  are  finding  that  analysis  over  a  range  of 
resolutions  can  lead  to  selecting  the  ‘best’  filter  size  for  each 
feature. 


quite  simple,  however,  as  they  are  decomposable  into  a 
total  of  eleven  ID  convolutions,  and  could  be  computed 
in  a  realtime  pipeline.  The  tracking  has  been  designed 
with  parallel-processing  in  mind,  and  most  computa¬ 
tions  require  only  a  small  local  support.  Such  paral¬ 
lelism  could  provide  sufficient  performance  for  real-time 
analysis. 

3.3  Concluding  Remarks 

Our  intented  use  of  this  tracking  process  begins  with  es¬ 
timation  of  the  dynamics  of  objects  in  motion  and  their 
subsequent  recognition  based  on  behavioral  and  shape 
characteristics.  We  will  also  be  integrating  the  tracker 
with  the  original  EPI  analysis.  Features  determined  to 
be  stationary  will  be  used  for  camera  solving,  and  this 
will  enable  processing  of  non-linear  camera  motions.  In 
an  intriguing  combination  of  the  two,  we  are  investigat¬ 
ing  use  of  derived  groupings  and  rigid  motion  interpre¬ 
tations  to  run  the  EPI  analysis  in  reverse  over  the  space- 
time  surfaces  (using  inverses  of  the  observed  motion  pa¬ 
rameters),  and  compute  the  3D  shape  of  tracked  objects 
even  while  they  are  undergoing  independent  and  arbi¬ 
trary  motion. 

The  important  element  to  note  in  this  work  is  not  that 
we  can  track  features  through  a  sequence  -  there  are  a 
variety  of  techniques  that  can  do  this,  more  or  less  suc¬ 
cessfully  -  but  that  we  can  utilize  the  ST  surfaces  to 
track  and  estimate  feature  motions,  distinquish  moving 
from  stationary  objects,  and  inform  EPI  analysis  of  fea¬ 
tures  to  use  for  camera  'olving  in  its  geometric  recovery. 
This  is  a  critical  component  of  enabling  EPI  aneilysis  to 
be  used  on  non-linear  motion  trajectories  through  dy¬ 
namic  scenes. 
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Abstract 

Recovering  scene  geometry  and  camera  motion 
from  a  stream  of  images  is  an  important  prob¬ 
lem  in  computer  vision.  If  the  scene  geome¬ 
try  is  specified  by  depth  measurements,  that  is, 
by  camera-to-scene  distances,  noise  sensitivity 
worsens  rapidly  with  increasing  depth. 

In  this  paper,  we  show  that  this  difRculty  can 
be  overcome  by  computing  scene  geometry  di¬ 
rectly  in  terms  of  shape,  that  is,  by  comput¬ 
ing  the  coordinates  of  feature  points  in  the 
scene  with  respect  to  a  world-centered  system, 
without  recovering  camera-centered  depth  as 
an  intermediate  quantity.  More  specifically,  we 
show  that  a  matrix  of  image  measurements  can 
be  factored  by  Singular  Value  Decomposition 
into  the  product  of  two  matrices  that  represent 
shape  and  motion,  respectively. 

After  describing  this  factorization  method,  we 
extend  it  to  deal  with  feature  occlusions.  We 
demonstrate  the  accuracy  and  robustness  of  the 
method  with  a  series  of  experiments  on  laborar 
tory  as  well  as  outdoor  streams,  with  and  with¬ 
out  occlusions. 

1  Introduction 

Recovering  scene  geometry  and  camera  motion  from  a 
stream  of  images  is  an  important  problem  in  computer 
vision.  It  admits  a  solution  [U117^,  [RA79]  for  perfect 
images,  but  is  very  sensitive  to  noise  [6JT90].  In  this  pa¬ 
per,  we  observe  that  this  sensitivity  is  due  in  part  to  the 
representation  of  shape  as  a  depth  map,  and  show  that 
reformulating  the  problem  in  world-centered  coordinates 
leads  to  a  simpler  and  better-behaved  solution. 

In  Ullman’s  proof  of  existence  of  a  solution  [U1179], 
as  well  as  in  the  perspective  formulation  in  [RA79],  the 
coordinates  of  feature  points  in  the  world  are  expressed 
in  a  world-centered  coordinate  system. 

However,  this  choice  has  been  replaced  by  most  com¬ 
puter  vision  researchers  with  that  of  a  camera-centered 
representation  of  shcme  [PraSO],  [BH83},  [TH84],  [Adi85], 
[WW85],  [BBM87],  [HHN88],  [HJ89],  [Hee89],  [MKS89], 
[SA89],  IBCC90].  With  this  representation,  the  position 
of  feature  points  is  specified  by  their  image  coordinates 
and  by  their  depths,  defined  as  the  distances  between  the 
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camera  center  and  the  feature  points,  measured  along  the 
optical  axis. 

Unfortunately,  although  a  camera-centered  represen¬ 
tation  simplifies  the  equations  for  perspective  projection, 
it  makes  shape  estimation  harder  and,  for  increasingly 
distant  scenes,  eventually  impossible.  This  is  due  to  two 
reasons.  First,  the  computation  of  shape  via  depth  is 
very  sensitive  to  noise  for  remote  objects:  since  even 
large  changes  in  depth  produce  small  changes  in  the  im¬ 
age,  computing  small  depth  differences  from  image  vari¬ 
ations  is  virtually  impossible  with  any  amount  of  image 
noise. 

Second,  as  the  camera  moves,  the  camera-centered  fea¬ 
ture  coordinates  change.  This  leads  to  the  difficult  prob¬ 
lem  of  relating  depth  values  in  different  camera  coordi¬ 
nate  systems  to  each  other  in  the  presence  of  motion 
uncertainties  (see  for  instance  [MKS89],  [Hee89]). 

In  this  paper,  we  show  that  both  difficulties  disap¬ 
pear  if  feature  coordinates  are  expressed  with  respect  to 
a  world-centered  frame.  With  this  formulation,  object- 
centered  shape  can  be  linked  to  image  motion  directly, 
without  using  depth  as  an  intermediate  quantity.  Fur¬ 
thermore,  the  mutual  independence  of  shape  and  mo¬ 
tion  in  world-centered  coordinates  makes  it  possible  to 
cast  structure-from-motion  as  the  problem  of  factoring 
a  matrix  of  image  measurements  into  the  product  of  two 
matrices  that  represent  respectively  the  camera  rotation 
and  the  shape  of  the  scene. 

More  specifically,  an  image  stream  can  be  represented 
by  a  2F  x  P  measurement  matrix  which  gathers  the 
horizontal  and  vertical  coordinates  of  P  points  tracked 
through  F  freunes.  If  image  coordinates  are  measured 
with  respect  to  their  centroid,  we  prove  the  following 
rank  theorem:  under  orthography,  the  measurement  ma¬ 
trix  is  of  rank  3.  As  a  consequence  of  this  theorem,  we 
show  that  the  measurement  matrix  can  be  factored  into 
the  product  of  two  matrices  of  size  2F  x  3  and  3  x  P, 
respectively,  where  the  first  matrix  encodes  camera  ro¬ 
tation,  the  second  shape. 

The  rank  theorem  captures  precisely  the  nature  of  the 
redundancy  of  an  image  stream,  and  dlows  dealing  with 
a  large  number  of  points  and  frames  in  a  conceptually 
simple  and  computationally  efficient  way  to  reduce  the 
effects  of  noise.  The  resulting  algorithm  is  based  on 
the  Singular  Value  Decomposition,  which  is  numerically 
well-behaved  and  stable. 


We  first  introduced  this  factorization  method  in 
[TK90],  where  we  treated  the  simple  case  of  single¬ 
scanline  images  in  a  flat,  two-dimensional  world.  We  now 
develop  the  idea  into  a  working  system  for  arbitrary  cam¬ 
era  motion  in  three  dimensions  and  full  two-dimensional 
images. 

An  approach  related  to  ours,  but  using  a  different  for¬ 
malism,  appeared  in  [DA90].  Debrunner  and  Ahuja  sup¬ 
ply  both  closed-form  expressions  for  shape  and  motion 
and  an  incremental  solution  (one  image  at  a  cime).  rhe 
price  they  pay  for  these  advantages  is  to  assume  that  mo¬ 
tion  is  constant.  It  will  be  interesting  to  compare  how 
easily  our  approach  and  theirs  can  be  generalized. 

In  the  next  section  we  show  how  to  build  the  mea¬ 
surement  matrix  from  an  image  stream,  prove  that  the 
measurement  matrix  is  of  rank  3,  and  show  how  to  use 
this  result  to  factor  the  measurement  matrix  into  shap' 
and  camera  rotation.  Section  3  describes  an  experiment 
on  a  real  image  streaun. 

This  stream  was  produced  under  carefully  controlled 
laboratory  conditions.  However,  lab  images  are  too  clean 
to  be  fully  realistic.  Furthermore,  only  features  that  are 
visible  in  all  the  frames  are  used  in  the  stream  of  section 
5.  In  reality,  as  the  camera  moves,  features  can  appear 
and  disappear  from  the  image,  because  of  occlusions. 
This  phenomenon  is  frequent  enough  to  make  a  shape 
and  motion  computation  method  unrealistic  if  it  cannot 
deal  with  it. 

In  sections  4  and  5  we  address  these  concerns  by  ex¬ 
tending  the  factorization  method  to  streams  with  occlu¬ 
sions  and  by  testing  the  method  with  outdoor  streams 
recorded  on  videotape  with  an  amateur  camera.  Specif¬ 
ically,  in  section  4  we  introduce  the  extension  to  occlu¬ 
sions  and  we  test  it  with  another  lab  experiment.  Finally, 
in  section  5  we  show  the  results  for  two  outdoor  video¬ 
tape  streams,  one  in  which  occlusions  are  negligible,  the 
other  in  which  they  are  dominant. 

2  The  Factorization  Method 

In  the  next  subsection  we  show  how  to  represent  an  im¬ 
age  stream  as  a  matrix  of  image  feature  coordinates. 
We  then  introduce  the  main  result  on  the  rank  of  the 
measurement  matrix  in  ^he  absence  (subsection  2.2)  and 
presence  (subsection  2.3)  of  noise.  Subsection  2.4  shows 
that  the  motion  and  shape  result  is  essentially  unique, 
and  subsection  2.5  summarizes  the  factorization  method. 

To  track  features  from  frame  to  frame,  we  used  a 
method  based  on  [LK81]  which  we  extended  to  allow 
for  the  automatic  selection  of  features.  The  description 
of  both  detection  and  tracking  are  beyond  the  scope  of 
this  paper. 

2.1  The  Measurement  Matrix 

If  we  track  P  feature  points  over  F  frames  in  the  im¬ 
age  stream,  we  obtain  a  stream  of  image  coordinates 
{(“/p.^/p)l/=  l,--,F,p=l,...,P}. 

The  horizontal  feature  coordinates  u/p  sue  written  into 
an  F  X  P  matrix  U :  there  is  one  row  per  frame,  and  one 
column  per  feature  point.  Similarly,  an  F  x  P  matrix  V 
is  built  from  the  vertical  coordinates  vjp. 


The  rows  of  the  matrices  U  and  V  are  then  registered 
by  subtracting  from  each  entry  the  mean  of  the  entries 
in  the  ssune  row; 


«/p  = 

Vjp  = 

I  1 

where 

aj  = 

1  A 

p=i 

II 

1  ^ 
p=i 

(1) 


This  produces  two  new  F  x  P  matrices  U  =  [u/p]  and 
V  =  The  matrix 

=  [-f ] 

is  called  the  registered  measurement  matrix.  This  is  the 
input  to  our  shape-and-motion  algorithm. 


2.2  The  Rank  Theorem 

We  now  smalyze  the  relation  between  camera  motion, 
shape,  amd  the  entries  of  the  registered  mea.surement 
matrix  W.  This  analysis  leads  to  the  key  result  that 
W  is  highly  rank-deficient  (the  rank  theorem). 

The  orientation  of  the  camera  reference  system  corre¬ 
sponding  to  frame  number  /  is  determined  by  a  pair  of 
unit  vectors,  i/and  j/,  pointing  along  the  scanlines  and 
the  columns  of  the  image  respectively,  and  defined  with 
respect  to  a  world  reference  system  with  coordinates  x, 
y,  smd  z  (see  Figure  1).  Under  orthography,  all  projec¬ 
tion  rays  are  then  parsillel  to  the  cross  product  of  i/and 

i/‘ 

ky  =  i;  X  J/  . 

The  origin  of  the  camera  reference  system  is  at  the 
center  of  the  image,  while  the  origin  of  the  world  is  the 
centroid  of  the  points  Sp  =  (xp,yp,  Zp)^  in  space,  so  that 

p  H  »p  =  0  • 

p=i 

From  Figure  1  we  see  that  the  projection  («yp,t)/p) 
of  point  Sp  =  (xp,yp,Zp)^  onto  frame  /  is  given  by  the 
equations 

“/p  =  -  */) 

'’Ip  =  j/’’(sp  -  . 

where  ty  =  (ay,6y,cy)^  is  the  vector  from  the  world 
origin  to  the  image  center  of  frame  /. 

We  can  uow  write  expressions  for  the  entries  ujp  and 
Vfp  of  the  registered  measurement  matrix  by  substitut¬ 
ing  the  projection  equations  above  into  the  registration 
equations  (1).  For  the  horizontal  coordinates  we  have 

«/p  =  «/p  - 

=  i/(Sp  - 1/)  -  4  -  */) 

«=i 


460 


Figure  1:  The  systems  of  reference  used  in  our  problem 
formulation. 


We  can  write  a  similar  equation  for  the  registered  ver¬ 
tical  image  projection  w/p.  To  summarize, 


u/p  =  i/sp 

-  j/  Sp  . 


Because  of  the  two  sets  o(  F  x  P  equations  (3),  the 
registered  measurement  matrix  W  can  be  expressed  in  a 
matrix  form: 


where 


W  =  RS 


if 


R  = 


sT 

Jl 


L  jf 


represents  the  camera  rotation,  and 
5  =  [  8i  •  •  •  Sp  ] 


(4) 


(5) 


(6) 


is  the  shape  matrix.  In  fact,  the  rows  of  R  represent 
the  orientations  of  the  horizontal  and  vertical  camera 
reference  axes  throughout  the  stream,  while  the  columns 
of  S  are  the  coordinates  of  the  P  feature  points  with 
respect  to  their  centroid. 

^om  the  first  and  the  last  line  of  equation  (2),  the 
original  unregistered  matrix  W  can  be  written  as 

W  =  RS  +  tel,  (7) 

where  t  =  (ai ,...,  op,  6i , ... ,  6p)^  is  a  2F-dimansional 
vector  that  gathers  the  projections  of  camera  translation 
along  the  image  plane  (see  equation  (2).  In  scalar  form, 

ujp  =  ijsp  +0/ 

v/p=j/Sp  +  bf.  (8) 

In  the  equations  above,  i/  and  j y  are  mutually  orthog¬ 
onal  unit  vectors,  so  they  must  satisfy  the  constraints 

|i/l  =  |j/l  =  l  and  ijjf  =  0.  (9) 


Also,  the  rotation  matrix  R  is  unique  if  the  system  of 
reference  for  the  solution  is  aligned,  say,  with  that  of  the 
first  camera  position: 

ii  =  (l,0,0f  and  ji  =  (0.1,0f.  (10) 

Since  72  is  2F  x  3  and  5  is  3  x  F,  the  matrix  projection 
equation  (4)  implies  the  following  rank  theorem. 

Without  noise,  the  registered  measurement  ma¬ 
trix  W  is  at  most  of  rank  three. 

The  rank  theorem  expresses  the  fact  that  the  2F  x 
P  image  measurements  are  highly  redundant.  Indeed, 
they  could  all  be  described  concisely  by  giving  F  frame 
reference  systems  and  P  point  coordinate  vectors,  if  only 
these  were  known. 

When  noise  corrupts  the  images,  the  registered  mea¬ 
surement  matrix  W  will  not  be  exactly  of  rank  3.  How¬ 
ever,  the  rank  theorem  can  be  extended  to  the  case  of 
noisy  measurements  in  a  well-defined  manner.  The  next 
subsection  introduces  this  extension,  using  the  concept 
of  Singular  Value  Decomposition  [GR7l]  to  introduce  the 
notion  of  approximate  rank. 


2.3  Approximate  Rank 

Assuming  '  that  2F  >  P,  the  matrix  W  can  be  decom¬ 
posed  [GR7l]  into  a2Fx  P  matrix  0\ ,  a  diagonal  Px  P 
matrix  E,  and  a  P  x  P  matrix  O2, 


W  =  0iE02,  (11) 

such  that  of  Oi  =  Of O2  =  020f  =  I,  and  o’!  >  . . .  > 
erp.  Here,  X  is  the  PxP  identity  matrix,  and  the  singular 
values  ffi,.  ..,<Tp  are  the  diagonal  entries  of  E.  This  is 
the  Singular  Value  Decomposition  (SVD)  of  the  matrix 
W. 

If  we  now  partition  the  matrices  Oi,  E,  and  O2  as 
follows: 


Oi 


E 


O2 


we  have 


[  0[  I  O'/  ]  }7F 


■  E' 

0 

0 

3 

P-3 

}3 

}i>-3 


(12) 


O1EO2  =  O'iE'0'2  +  0'/E"0'2'  . 


Let  W  be  the  ideal  registered  measurement  r.  atrix, 
that  is,  the  matrix  we  would  obteun  in  the  absence  of 

noise.  Because  of  the  rank  theorem,  W  has  at  most 

'This  assumption  is  not  crucial:  if  2f  <  P,  everything 
can  be  repeated  for  the  transpose  of  W. 


461 


three  non-zero  singular  values.  Since  the  singular  values 
in  E  are  sorted  in  non-increasing  order,  E'  must  con- 
tain  all  the  singular  values  of  W  that  exceed  the  noise 
level.  As  a  consequence,  the  term  0"E"02  must  be  due 
entirely  to  noise,  and  the  product  OjE'O^ 

possible  rank-3  approximation  to  W  . 

We  can  now  restate  our  rank  theorem  for  noisy 
measurement  s . 

All  the  shape  and  rotation  infr'rmation  in  W  is 
contained  in  its  three  greatest  singular  values, 
together  with  the  corresponding  left  and  right 
eigenvectors. 

Thus,  the  best  possible  approximation  to  the  ideal  reg- 
istered  measurement  matrix  W  is  the  product 

W  =  O'lE'O'j 

where  the  primes  refer  to  the  partition  (12).  With  the 
definitions 

k  =  0',[E']‘/2 
5  =  [Ef/^Oi, 

we  can  also  write 

W  =  kS .  (13) 

The  two  matrices  R  and  S  are  of  the  same  size  as  the 
desired  rotation  and  shape  matrices  R  and  5:  k  is  2Fx3, 
and  5  is  3  X  F.  However,  the  decomposition  (13)  is  not 
unique.  In  fact,  if  Q  is  any  invertible  3x3  matrix,  the 
matrices  kQ  and  Q~^S  are  also  a  valid  decomposition 
of  W,  since 

{kQ){Q~^S)  =  k{QQ-^)S  =  ks  =  w . 

Thus,  k  and  S  are  in  general  different  from  R  and 
S.  A  striking  fact,  however,  is  that  except  for  noise  the 
matrix  F  is  a  linear  transformation  of  the  true  rotation 
matrix  R,  and  the  matrix  5  is  a  linear  transformation  of 
the  true  shape  matrix  S.  Indeed,  in  the  absence  of  noise, 
R  and  k  both  span  the  column  space  of  the  registered 
measurement  matrix  W  =  W  =  W.  Since  that  column 
space  is  three-dimensional  because  of  the  rank  theorem, 
R  and  k  are  different  bases  for  the  same  space,  and  there 
must  be  a  linear  transformation  between  them. 

Whether  the  noise  level  is  low  enough  that  it  can  be 
ignored  at  this  juncture  depends  also  on  the  camera  mo¬ 
tion  and  on  shape.  Notice,  however,  that  the  singu¬ 
lar  value  decomposition  yields  sufficient  information  to 
make  this  decision;  the  requirement  is  that  the  ratio  be¬ 
tween  the  third  and  the  fourth  largest  singular  values  of 
W  be  sufficiently  large. 


2.4  The  Metric  Constraints 
To  summarize,  the  matrix  F  is  a  linear  transformation  of 
the  true  rotation  matrix  R.  Likewise,  5  is  a  linear  trans¬ 
formation  of  the  true  shape  matrix  5.  More  specifically, 
there  exists  a  3  x  3  matrix  Q  such  that 


R  =  RQ 
S  =  Q-'S. 


(14) 


In  order  to  find  Q  we  observe  that  the  rows  of  the 
true  rotation  matrix  R  are  unit  vectors  and  the  first  F 
are  orthogonal  to  corresponding  F  in  the  second  half  of 
R.  These  metric  constraints  yield  the  over-constrained, 
quadratic  system 

=  1 

=  1  (15) 

QQ'^ 3  j  =  0 

in  the  entries  of  Q.  This  is  a  simple  data  fitting  problem 
which,  though  nonlinear,  can  be  solved  efficiently  and 
reliably. 

A  last  ambiguity  needs  to  be  resolved:  if  Q  is  a  solution 
of  the  metric  constraint  problem,  so  is  QN,  where  N  is 
any  orthonormal  matrix.  In  fact, 

i'/(QiV)(A’’Q’’)i‘y  =  i/Q{NN'^)Q'^ij 

=  1, 

and  likewise  for  the  remaining  two  constraint  equations. 
Geometrically,  this  corresponds  to  the  fact  that  the  solu¬ 
tion  is  determined  up  to  a  notation,  since  the  orientation 
of,  say,  the  first  camera  reference  system  with  respect  to 
the  world  reference  system  is  arbitrary.  This  arbitrari¬ 
ness  can  be  removed,  if  desired,  by  rotating  the  solution 
so  that  the  first  frame  is  represented  by  the  identity  ma¬ 
trix. 


2.5  Outline  of  the  Complete  Algorithm 

Based  on  the  development  in  the  previous  sections,  we 
now  have  a  complete  algorithm  for  the  computation  of 
shape  and  rotation  from  the  registered  measurement  ma¬ 
trix  W  derived  from  a  stream  of  images.  To  summarize, 
the  rotation  matrix  R  and  the  shape  matrix  S  defined 
in  equations  (5)  and  (6)  can  be  computed  as  follows. 

1.  Compute  the  singular- value  decomposition  W  = 
OjEOj. 

2.  Define  R  =  0'i(E')‘/^  and  S  =  (E')'/^0'2,  where 
the  primes  refer  to  the  block  partitioning  defined  in 
(12). 

3.  Compute  the  matrix  Q  in  equations  (14)  by  impos¬ 
ing  the  metric  constraints  (equations  (15)). 

4.  Compute  the  rotation  matrix  R  and  the  shape  ma¬ 
trix  S  us  R=  RQ  and  S  = 

5.  If  desired,  align  the  first  camera  reference  system 
with  the  world  reference  system  by  forming  the 
products  RRo  and  RqS,  where  the  orthonormal  ma¬ 
trix  Ro  =  [ii  Ji  ki]  rotates  the  first  camera  reference 
system  into  the  identity  matrix. 

3  An  Experiment 

In  this  section  we  test  the  factorization  method  with 
an  experiment  on  a  real  stream  of  images.  The  images 
depict  a  small  plastic  model  of  a  building.  The  camera  is 
a  Sony  CCD  camera  with  a  200  mm  lens,  and  is  moved 
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by  means  of  a  high-precision  positioning  platform.  Some 
frames  in  the  stream  sure  shown  in  figure  3.  Csunera 
pitch,  yaw,  and  roll  around  the  model  are  all  varied  as 
shown  by  the  dashed  curves  in  figure  4.  The  translation 
of  the  csunera  is  such  as  to  keep  the  building  within  the 
field  of  view  of  the  camera. 

For  feature  trsurking,  we  extended  the  method  de¬ 
scribed  in  [LK81]  to  sdlow  sdso  for  the  automatic  selec¬ 
tion  of  image  features.  The  entire  set  of  430  features  is 
displayed  in  figure  5,  overlaid  on  the  first  frame  of  the 
stream.  Of  these  features,  42  were  abandoned  during 
tracking  because  their  appesursuice  changed  too  much. 
The  remaining  388  features  are  used  in  the  computation 
of  shape  and  motion. 

The  plots  in  figure  4  compare  the  rotation  compo¬ 
nents  computed  by  the  algorithm  (solid  curves)  with  the 
values  measured  mechanically  from  the  mobile  platform 
(dashed  curves).  The  differences  are  magnified  in  fig¬ 
ure  6.  The  errors  are  everywhere  less  than  0.4  degrees. 
The  computed  motion  follows  closely  also  rotations  with 
curved  profiles,  such  as  the  roll  profile  between  frames 
1  and  20  (second  plot  in  figure  4),  and  faithfully  pre¬ 
serves  all  discontinuities  in  the  rotational  velocities;  the 
algorithm  does  not  smooth  the  results. 

Between  frames  60  and  80,  yaw  and  pitch  are  nearly 
constant,  that  is,  the  camera  merely  rotates  about  its 
optical  axis.  This  demonstrates  that  it  is  sufficient  for 
the  stream  as  a  whole  to  be  taken  during  nondegener¬ 
ate  motion.  The  algorithm  can  deal  without  difficulty 
with  streams  that  contain  degenerate  substreams,  be¬ 
cause  the  information  in  the  stream  is  used  all  at  once 
in  our  method. 

The  shape  results  are  shown  qualitatively  in  figure  7, 
which  shows  the  computed  shape  viewed  from  above. 
The  view  in  figure  7  is  similar  to  that  in  figure  8,  in¬ 
cluded  for  visual  comparison.  Notice  that  the  walls,  the 
windows  on  the  roof,  and  the  chimneys  are  recovered  in 
their  correct  positions. 

To  evaluate  the  shape  performance  quantitatively,  we 
measured  some  distances  on  the  actual  house  model  with 
a  ruler,  and  compared  them  with  the  distances  computed 
from  the  point  coordinates  in  the  shape  results.  Figure 

9  shows  the  selected  features.  The  diagram  in  figure 

10  shows  the  distances  between  pairs  of  features  mear 
sured  on  the  actual  model  and  computed  by  our  algo¬ 
rithm.  The  measured  distances  between  the  steps  along 
the  right  side  of  the  roof  (7.2  mm)  were  obtained  by 
measuring  five  steps  and  dividing  the  total  distance  (36 
mm)  by  five.  The  differences  between  computed  and 
measured  results  are  of  the  order  of  the  resolution  of  our 
ruler  measurements  (one  millimeter). 

4  Occlusions 

Two  separate  problems  arise  when  features  are  occluded 
during  the  image  stream:  how  to  detect  occlusions  and 
how  to  deal  with  them  once  they  occur.  A  method  to 
detect  occlusions  was  suggested  in  [TK91b],  based  on 
monitoring  the  photometric  difference  between  a  given 
feature  in  the  first  and  in  the  current  frame.  Work  on 
that  method  is  still  in  progress.  In  this  section,  we  deal 
with  the  second  problem  above  and  assume  that  occlu- 


Figure  2:  The  Reconstruction  Condition.  If  the  dotted 
entries  of  the  measurement  matrix  are  known,  the  two 
unknown  ones  (question  marks)  can  be  reconstructed. 

sion  events  are  given.  In  the  experiments  below,  occlu¬ 
sions  are  marked  by  hand. 

Sequences  with  appearing  and  disappearing  features 
result  in  a  partially  unknown  measurement  matrix  W, 
so  that  the  factorization  method  introduced  in  section  2 
cannot  be  applied  directly. 

However,  there  is  usually  sufficient  information  in  the 
stream  to  determine  all  the  camera  positions  and  all  the 
three-dimensional  feature  point  coordinates.  If  that  is 
the  case,  we  can  not  only  solve  the  shape  and  motion 
recovery  problem  from  the  incomplete  measurement  ma¬ 
trix  W,  but  we  can  even  hallucinate  the  unknown  entries 
of  W  by  projecting  the  computed  three-dimensional  fea¬ 
ture  coordinates  onto  the  computed  camera  positions. 

In  this  section,  we  first  introduce  a  solution  propaga¬ 
tion  method  to  grow  a  shape  and  motion  solution  for  a 
full  submatrix  of  W  into  a  solution  for  all  of  W  (subsec¬ 
tion  4.1).  We  extend  the  propagation  method  to  noisy 
streams  in  subsection  4.2.  In  section  5.1  we  demonstrate 
the  method  on  a  real  stream  of  images,  in  which  a  ping- 
pong  ball  is  rotated  more  than  360  degrees  in  front  of 
the  camera. 

4.1  Solution  for  Noise-EVee  Images 

In  this  subsection,  we  show  how  to  fill  in  the  unknown  en¬ 
tries  of  the  measurement  matrix  in  the  case  of  noise-free 
images.  In  the  next  subsection,  we  extend  the  method 
to  noisy  streams. 

If  there  is  no  noise,  the  occlusion  problem  is  solvable 
if  every  unknown  image  feature  coordinate  belongs  to  an 
otherwise  full  4x4  submatrix  of  W.  Formally,  we  have 
the  following  sufficient  condition. 

Condition  for  Reconstruction  In  the  absence  of 
noise,  an  unknown  image  measurement  pair  (ufp,v/p) 
can  be  reconstructed  if  point  p  is  visible  in  at  least  three 
more  frames  /i ,  /j ,  /a,  and  if  there  are  at  least  three  more 
points  pi,P2,P3  that  are  visible  in  frames  /, /i./j./a- 

Referring  to  Figure  2,  this  means  that  the  dotted  en¬ 
tries  must  be  known  to  reconstruct  the  question  marks. 
In  this  subsection,  we  prove  this  condition.  To  this  end, 
we  notice  that  the  rows  and  columns  of  the  noise-free 
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measurement  matrix  Vf  can  always  be  permuted  so  that 
/i  =  Pi  =  1.  /a  =  P2  =  2,  /a  =  P3  =  3,  /  =  p  =  4.  We 
can  therefore  su'^pose  that  U44  and  V44  are  the  only  two 
unknown  entries  in  the  8  k  4  matrix 


in  space  and 
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Then,  the  factorization  method  can  be  applied  to  the 
first  three  rows  of  U  and  V,  that  is,  to  the  6x4  submatrix 


W^6X4  = 


to  produce  the  paurtial  translation  and  rotation  subma¬ 
trices 
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tsxi  = 


and  f^sxs  = 


and  the  full  shape  matrix 

S  =  [  Si  82  83  84  ]  (18) 

such  that 

14^6x4  =  RsxzS  -t-  texiej 
(04  is  a  row  vector  of  four  ones). 

To  complete  the  rotation  solution,  we  need  to  com¬ 
pute  the  vectors  £4  and  j4.  However,  a  registration  prob¬ 
lem  must  be  solved  first.  In  fact,  only  three  points  are 
visible  in  the  fourth  frame,  while  equation  (18)  yields 
all  four  points  in  space.  Since  the  factorization  method 
computes  the  space  coordinates  with  respect  to  the  cen¬ 
troid  of  the  points,  we  have  si  -b  82  -h  83  -I-  84  =  0,  while 
the  image  coordinates  in  the  fourth  frame  are  measured 
with  respect  to  the  centroid  of  just  three  points  (1,  2,  3). 
Thus,  before  we  can  compute  14  and  j4  we  must  malffi 
the  two  origins  coincide  by  referring  all  coordinates  to 
the  centroid  ^ 

C=  ^(Si  +82 +  83) 

of  the  three  points  that  are  visible  in  all  four  frames.  In 
the  fourth  frame,  the  projection  of  c  has  coordinates 

<*4  =  3^“^* 

=  3(^41  +  i'42  +  >^43)  . 


so  we  can  define  the  new  coordinates 

Sp  =  8p  -  c  for  p  =  1,2,3 


«4r  =  W4p  -  ai 
v4p  =  V4p  -  64 


p=l,2,3 


in  the  fourth  frame.  Then,  £4  and  j4  are  the  solutions  of 
the  two  3x3  systems 

[  «41  “42  “43  ]  =  <  S'2  S4  ] 

[  “41  “42  “43  ]  =  jT[  s'l  s'2  s'a  ]  (19) 

derived  from  equation  (4).  The  second  equation  in  (17) 
and  the  solution  to  (19)  yield  the  entire  rotation  matrix 
R,  while  shape  is  given  by  equation  (18). 

The  components  04  and  64  of  translation  in  the  fourth 
frame  with  respect  to  the  centroid  of  all  four  points  can 
be  computed  by  postmultiplying  equation  (7)  by  the  vec¬ 
tor  »/4  =  (1,1, 1,0)’’: 

Wr)i  =  RSt)4  +  tej  7/4  . 

Since  64  774  =  3,  we  obtain 

t  =  |(W  -  RS)r)^  .  (20) 

In  particular,  rows  4  and  8  of  this  equation  yield  04  and 
64.  Notice  that  the  unknown  entries  U44  and  V44  are 
multiplied  by  zeros  in  equation  (20). 

Now  that  both  motion  and  shape  are  known,  the  miss¬ 
ing  entries  7144,  V44  of  the  measurement  matrix  W  can 
be  found  by  orthographic  projection  (equation  (8)): 

U44  =  £4  S4  +  04 

“44  =  j4®4  +  ^4  • 

The  procedure  thus  completed  factors  the  full  6x4 
submatrix  of  W  and  then  reasons  on  the  three  points 
that  are  visible  in  all  the  frames  to  compute  motion  for 
the  fourth  frame.  Alternatively,  one  can  first  apply  fac¬ 
torization  to  the  8x3  submatrix 


H^8x3  = 


to  produce  the  full  translation  and  rotation  submatrices 


“11 

“12 

“13 

“21 

“22 

“23 

“31 

“32 

“33 

“41 

“42 

“43 

“11 

“12 

“13 

“21 

“22 

“23 

“31 

“32 

“33 

“41 

“42 

“43 

and  R  = 


and  the  partial  shape  matrix 
Sax3  =  [ 
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such  that 


W^8x3  =  f^3x3  "h  • 

The  primes  here  signal  again  that  coordinates  refer  to 
the  centroid  of  only  the  first  three  points.  Then,  this 
partial  solution  can  be  extended  to  by  solving  the 
following  overconstrained  system  of  six  equations  in  the 
three  unknown  entries  of  s^: 


where 


r 

> 

84  + 

Jz 

.  . 

- 

«'^4  = 

V'j4  = 

uj4-a 

Vj4-h 

“2 

«3 

K 

b'3 


“M 

U',4 

^34 

«24 

^34 


for  /=1,2,3 


(24) 


The  matrix  S'  of  the  "primed”  shape  coordinates  can 
now  be  registered  with  respect  to  their  centroid  to  yield 
the  "unprimed”  coordinates: 


Sp  =  Sp  —  ^S''e4  for  p  —  1,2, 3, 4 

and  the  "unprimed”  translation  can  again  be  found  from 
equation  (20). 

In  conclusion,  the  full  motion  and  shape  solution  can 
be  found  in  either  of  the  following  ways: 

1.  factor  14^6x4  find  a  partial  motion  and  full  shape 
solution,  and  propagate  it  to  include  motion  for  the 
remaining  frame  (equations  (19)); 

2.  factor  (^'sxs  to  find  a  full  motion  and  partial  shape 
solution,  and  propagate  it  to  include  the  remaining 
feature  point  (equation  (24)). 


4.2  Solution  in  the  Presence  of  Noise 


The  solution  propagation  method  introduced  in  the  pre¬ 
vious  subsection  can  be  extended  to  2Fx  P  measurement 
matrices  with  F  >  4  and  P  >  4.  In  fact,  the  only  dif¬ 
ference  is  that  the  propagation  equations  (19)  and  (24) 
now  become  overconstrained. 

If  the  measurement  matrix  W  is  noisy,  this  redun¬ 
dancy  is  beneficial,  since  equations  (19)  and  (24)  can  be 
solved  in  the  Least  Square  Error  sense,  and  the  effect  of 
noise  is  reduced. 

In  the  general  case  of  a  noisy  2F  x  P  matrix  W  the 
solution  propagation  method  can  be  summarized  as  fol¬ 
lows.  A  possibly  large,  full  subblock  of  W  is  first  de¬ 
composed  by  factorization.  Then,  this  initial  solution 
is  grown  one  row  or  one  column  at  a  time  by  solving 
systems  analogous  to  those  in  (19)  or  (24),  in  the  LSE 
sense. 

However,  because  of  noise,  the  order  in  which  the  rows 
and  columns  of  W  are  incorporated  into  the  solution  can 
affect  the  exact  values  of  the  final  motion  and  shape  so¬ 
lution.  Consequently,  once  the  solution  has  been  prop¬ 
agated  to  the  entire  measurement  matrix  W,  it  may  be 
necessary  to  refine  the  results  with  a  steepest-descent 
minimization  of  the  residue 


(see  equation  (7)). 

There  remain  the  two  problems  of  how  to  choose  the 
initial  full  subblock  to  which  factorization  is  applied  and 
in  what  order  to  grow  the  solution.  In  fact,  however, 
because  of  the  final  refinement  step,  neither  choice  is 
critical  as  long  as  the  initial  matrix  is  large  enough  to 
yield  a  good  starting  point.  We  illustrate  this  point  in 
the  first  experiment  of  the  next  section. 

5  More  Experiments 

In  this  section  we  first  test  the  propagation  method 
for  occlusions  with  a  laboratory  experiment  (subsection 
5.1).  Then,  we  demonstrate  the  robustness  of  the  factor¬ 
ization  method  with  two  streams  taken  outdoors  with  a 
hand-held  amateur  camera.  In  the  stream  of  subsection 

5.2,  occlusions  are  negligible.  In  the  stream  of  subsection 

5.3,  they  are  dominant. 

5.1  A  Lab  Experiment  with  Occlusions 

In  this  image  stream,  a  ping-pong  ball  with  black  dots 
marked  on  its  surface  is  rotated  450  degrees  in  front  of 
the  camera.  The  rotation  between  adjacent  frames  is 
2  degrees,  so  the  stream  is  226  frames  long.  Figure  11 
shows  the  first  frame  of  the  stream,  with  the  automati¬ 
cally  selected  features  overlaid. 

Every  30  frames  (60  degrees)  of  rotation,  the  feature 
tracker  looks  for  new  features.  In  this  way,  features  that 
disappear  on  one  side  around  the  ball  are  replaced  by 
new  ones  that  appear  on  the  other  side.  Figure  14  shows 
the  tracks  of  60  features,  randomly  chosen  among  the 
829  found  by  the  selector. 

If  all  measurements  are  collected  into  the  noisy  mea¬ 
surement  matrix  W,  the  two  halves  U  and  V  of  W  have 
the  same  fill  pattern:  if  the  x  coordinate  of  a  measure¬ 
ment  is  known,  so  is  the  y  coordinate.  Figure  12  shows 
the  fill  matrix  for  our  experiment.  This  matrix  has  the 
same  size  as  either  U  ot  V,  that  is,  F  x  P.  Every  col¬ 
umn  corresponds  to  a  feature  point,  and  every  row  to 
a  frame.  Shaded  regions  denote  known  entries.  The  fill 
matrix  has  226  x  829  =  187354  entries.  Of  these,  30185 
(about  16  percent)  are  known. 

To  start  the  motion  and  shape  computation,  the  algo¬ 
rithm  first  looks  for  a  large,  full  submatrix.  Rows  and 
columns  in  the  submatrix  need  not  be  adjacent.  Finding 
a  maximally  large  submatrix  is  both  hard  and  unneces¬ 
sary,  since  the  initial  matrix  need  only  be  large  enough 
to  allow  for  a  reliable  initialization  of  the  motion  and 
shape  matrices. 

Our  system  looks  for  the  initial  matrix  by  exploring 
the  feature  columns  that  originate  in  the  first  frame.  In 
our  experiment,  there  are  183  such  features,  which  oc¬ 
cupy  the  left  226  x  183  submatrix  of  the  fill  matrix  of 
figure  12. 

The  matrix  initialization  routine  sorts  the  columns  in 
that  submatrix  in  decreasing  order  of  length  (number  of 
known  entries).  The  routine  then  adds  one  column  at  a 
time.  At  every  step,  the  size  of  the  largest  full  matrix 
that  can  be  formed  with  the  given  columns  is  equal  to 
the  number  of  columns  times  the  length  of  the  shortest 
column.  The  routine  stops  when  adding  an  extra  column 
would  reduce  the  size  of  the  largest  full  matrix. 
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In  the  example  we  consider,  the  initial  submatrix  hap¬ 
pens  to  be  square,  of  size  53  x  53.  Motion  and  shape 
are  computed  for  this  submatrix  with  the  factorization 
method  of  section  2.  From  this  initiad  estimate,  the  solu¬ 
tion  is  grown  by  repeatedly  solving  systems  like  those  of 
equations  (19),  to  add  new  rows,  and  of  equation  (24), 
to  add  new  columns.  Instead  of  3  x  3  and  6x3,  however, 
the  systems  are  now  more  generally  3  x  P„  and  2F„  x  3 
at  step  n. 

The  greater  F„  and  /*„,  the  more  overconst  rained  the 
solution  of  each  system.  Consequently,  we  again  use  a 
greedy  strategy  to  grow  the  initial  matrix  into  the  final 
solution.  At  each  step  of  the  growth,  the  row  or  column 
is  chosen  that  has  the  maxinnum  number  of  known  entries 
in  the  same  columns  or  rows  as  the  current  solution. 

Eventually,  all  of  the  motion  and  shape  values  are 
determined.  As  a  result,  the  unknown  84  percent  of 
the  measurement  matrix  can  be  hallucinated  from  the 
known  16  percent,  with  the  only  exception  of  the  very 
few  rows  and  columns  containing  fewer  than  three  en¬ 
tries,  for  which  no  reconstruction  is  possible. 

Figure  13  shows  two  views  of  the  fin^J  shape  results, 
taken  from  the  top  and  from  the  side.  Notice  the  missing 
features  at  the  bottom  of  the  ball  in  the  side  view.  That 
is  the  part  of  the  ball  that  was  always  invisible,  because 
it  rested  on  the  rotating  platform. 

To  display  the  motion  results,  we  look  at  the  i/  and 
vectors  directly.  We  recall  that  these  unit  vectors  point 
along  the  rows  and  columns  of  the  image  sensor  for  frame 
/  in  1, . . . ,  F.  Because  the  ping-pong  ball  rotates  around 
a  fixed  axis,  both  i /  and  j j  should  sweep  a  cone  in  space, 
as  shown  in  Figure  15.  The  tips  of  i/  and  jj  should 
describe  two  circles  in  space,  centered  along  the  axis  of 
rotation.  Figure  16  shows  two  views  of  these  vector  tips, 
from  the  top  and  from  the  side.  Notice  the  double  arc 
in  the  top  part  of  6gure  16.  If  the  motion  reconstruction 
were  perfect,  the  two  arcs  would  be  indistinguishable. 

5.2  An  Outdoor  Sequence 

In  this  subsection,  we  describe  the  results  of  the  factor¬ 
ization  method  on  a  stream  of  a  real  building,  taken  with 
a  hand-held  camera. 

Outdoor  images  are  harder  to  process  than  streams 
produced  in  the  lab  because  lighting  changes  less  pre¬ 
dictably  and  the  motion  of  the  camera  is  more  difficult  to 
control.  As  a  consequence,  features  are  harder  to  track: 
the  images  are  unpredictably  blurred  by  motion,  and 
corrupted  by  vibrations  of  the  recorder’s  head,  both  dur¬ 
ing  recording  and  digitization.  Furthermore,  the  camera 
jumps  and  jerks  produce  a  wide  range  of  image  dispari¬ 
ties. 

Figure  17  shows  some  of  the  180  frames  of  the  build¬ 
ing  stream.  The  overall  motion  covers  a  relatively  small 
rotation  angle,  approximately  15  degrees. 

The  features  found  by  the  selection  algorithm  in  the 
first  frame  are  shown  in  figure  18.  There  are  many  false 
features.  The  reflections  in  the  window  partially  visi¬ 
ble  in  the  top  left  of  the  image  move  non-rigidly,  and  so 
do  the  intersections  between  the  roof  and  the  horizontal 
edges  of  the  siding  on  the  right  of  the  same  window.  A 
few  more  false  features  can  be  found  in  the  lower  left  cor¬ 


ner  of  the  picture,  where  the  vertical  bars  of  the  handrail 
intersect  the  horizontal  edges  of  the  bricks  of  the  wall  be¬ 
hind.  We  removed  these  bad  features  by  hand,  by  mask¬ 
ing  away  the  two  parts  of  the  image  involved.  The  front 
view  of  the  shape  results  in  Figure  19  clearly  shows  the 
cuts. 

Figure  20  shows  the  tracks  of  60  features  selected  ran¬ 
domly  out  of  the  376  found  by  the  selection  algorithm. 
Notice  the  very  jagged  trajectories  due  to  the  erratic 
motion  of  the  hand-held  camera. 

Figures  19  and  22  show  a  front  and  a  top  view  of 
the  building  as  reconstructed  by  the  shape  and  motion 
recovery  algorithm.  To  obtadn  these  figures,  we  triangu¬ 
lated  the  tracked  feature  points,  and  mapped  the  pixel 
values  in  the  first  frame  onto  the  resulting  surface.  The 
structure  of  the  visible  part  of  the  building’s  three  walls 
has  clearly  been  reconstructed.  In  these  figures,  the  left 
wall  appears  to  bend  on  the  right,  where  it  intersects 
the  middle  wall.  This  occurred  because  the  feature  se¬ 
lector  found  features  along  the  shadow  of  the  roof  just 
on  the  right  of  the  intersection  of  the  two  walls,  rather 
than  at  the  intersection  itself.  Thus,  the  appearance  of  a 
bending  wall  is  an  artifact  of  the  triangulation  done  for 
rendering. 

Figure  21  shows  two  views  of  the  if  and  jf  rotation 
vectors,  in  the  style  of  figure  16.  Notice  how  much  more 
jagged  the  trajectories  are,  when  compared  with  those 
of  figures  16. 

From  the  experiment  described  in  this  subsection  we 
conclude  that  our  approach  works  also  for  image  streams 
taken  outdoors  with  the  jerky  motion  produced  by  a 
hand-held  camera. 

The  identification  of  false  features,  that  is,  of  features 
that  do  not  move  rigidly  with  respect  of  the  environment, 
remains  an  open  problem  that  must  be  solved  for  a  fully 
autonomous  system. 

5.3  An  Indoor  Sequence  with  Occlusion 

In  the  previous  experiment  we  only  tested  the  factor¬ 
ization  method  and  ignored  features  that  disappeared 
during  the  stream.  In  this  subsection  we  describe  an  ex¬ 
periment  where  occlusion  is  a  dominant  phenomenon:  a 
hand  holds  a  cup  and  rotates  it  in  front  of  the  camera 
by  about  ninety  degrees.  Figure  23  shows  four  out  of  the 
240  frames  of  the  stream. 

An  additional  difficulty  in  this  experiment  is  the  need 
for  figure/ground  segmentation.  Since  the  camera  was 
mounted  on  a  fixed  stand,  however,  this  problem  is  eas¬ 
ily  solved:  features  that  do  not  move  belong  to  the  back¬ 
ground. 

The  stream  presented  in  this  experiment  shows  some 
nonrigid  motion:  as  the  hand  turns,  the  configuration 
and  relative  position  of  the  fingers  changes  slightly.  This 
effect,  however,  is  small  and  did  not  affect  the  results 
appreciably. 

A  total  of  207  features  was  selected.  Occlusions  are 
marked  by  hand  in  this  experiment.  The  fill  matrix 
of  figure  24  illustrates  the  occlusion  pattern.  Figure  26 
shows  the  image  trajectory  of  60  randomly  selected  fea¬ 
tures. 

Figures  25  and  28  show  a  front  and  a  top  view  of  the 
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cup  and  the  visible  fingers  as  reconstructed  by  the  propa¬ 
gation  method  of  section  4.  These  figures  were  obtained, 
as  for  the  experiment  in  the  previous  subsection,  by  tri¬ 
angulating  the  tracked  feature  points,  and  mapping  pixel 
values  onto  the  resulting  surface.  The  shape  of  the  cup 
was  recovered,  as  well  as  the  rough  shape  of  the  fingers. 

Figure  27  shows  two  views  of  the  i/  and  jy  rotation 
vectors.  Notice  that  motion  is  smoother  here  than  for 
the  hand-held  camera  stream  (figure  21),  but  is  still  far 
less  regular  than  in  the  laboratory  experiments  (compare 
with  figure  16). 

From  the  experiment  discussed  in  this  subsection,  we 
conclude  that  the  solution  propagation  method  of  sub¬ 
section  4  deals  well  with  occlusion  also  when  the  image 
stream  is  taken  outdoors  with  poorly  controlled  motion 
and  lighting.  The  most  urgent  problem  to  be  solved  to 
make  this  propagation  method  into  a  fully  autonomous 
system  is  how  to  detect  occlusion  events  in  the  image 
stream. 

6  Conclusion 

The  factorization  method  is  a  first  step  towards  a  flexible 
and  robust  method  for  shape  and  motion  recovery  from 
television  images.  Even  now,  in  its  early  stages  of  de¬ 
velopment,  the  method  works  for  actual  image  streams 
taken  outdoors  with  an  inexpensive  camera  under  erratic 
motion. 

A  posteriori,  the  conceptual  simplicity  of  the  factor¬ 
ization  method  is  perhaps  its  most  surprising  feature. 
But  this  very  simplicity  makes  the  method  extensible 
to  more  and  more  general  situations.  The  treatment  of 
occlusions  is  a  case  in  point. 
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Figure  7:  A  view  of  the  computed  shape  from  approxi¬ 
mately  above  the  building  (compare  with  figure  8). 


Figure  9;  For  a  quantitative  evaluation,  distances  l>e- 
tween  the  features  shown  in  the  ])icture  were  measured 
on  the  actual  model,  and  compared  with  tlic  computed 
results.  The  comparison  is  shown  in  figure  10. 


Figure  8:  A  real  picture  from  above  the  building,  similar 
to  figure  7. 


3  •  SJTjTJ 


Figure  10;  Comparison  between  measured  and  computed 
distances  for  the  features  in  figure  9.  The  number  b<'- 
forc  the  slash  is  the  measured  distance,  the  one  after 
is  the  computed  distance.  Lengths  arc  in  millimeters. 
Computed  distances  were  scaled  so  that  the  computed 
distance  between  features  1)7  and  282  is  the  same  ns  the 
measured  distance. 
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Figure  11:  The  first  frame  of  the  ping-pong  stream,  with 
overlaid  features. 


Figure  12:  The  fill  matrix  for  the  ping-pong  ball  experi¬ 
ment.  Shaded  entries  are  known. 


•  >*■ 


Figure  14:  Hacks  of  60  randomly  selected  features  from 
the  stream  of  figure  11. 


Figure  15:  Rotational  component  of  the  camera  mo¬ 
tion  for  the  ping-pong  stream.  Because  rotation  occurs 
around  a  fixed  axis,  the  two  mutually  orthogonal  unit 
vectors  \j  and  jy,  pointing  along  rows  and  columns  of 
the  image  sensor,  sweep  two  450-degree  cones  in  space. 


Figure  13:  Top  and  side  views  of  the  reconstructed  ping- 
pong  ball. 


Figure  16:  Top  and  side  views  of  the  ij  and  jj  vectors 
identifying  the  camera  rotation.  See  Figure  15. 
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Figure  19;  A  front  view  of  the  three  reconstructed  walls,  Figure  22;  A  view  from  above  of  the  tliree  recoiistrueied 

with  the  original  image  intensities  mapped  onto  the  re-  walls,  with  image  intensities  mapped  onto  the  snifare. 

suiting  surface. 


Figure  26;  Tracks  of  60  raiicloinly  selected  features  from 
the  cup  stream. 


160  240 


Figure  23;  Four  out  of  the  240  frames  of  the  cup  image 
stream. 


Figure  24;  The  240  x  207  fill  matrix  for  the  cup  stream  Figure  27:  Top  and  side  views  of  the  ij  and  vectors 
(figure  23).  Shaded  entries  are  known.  identifying  the  camera  rotation  for  the  cup  stream  (figure 

23). 


Figure  25;  A  front  view  of  the  cup  and  fingers,  with 
the  original  image  intensities  mapped  onto  the  resulting 

surface.  Figure  28:  A  view  from  above  of  the  cup  and  fiiigi-rs  with 

image  intensities  mapped  onto  the  .surface. 


Integrated  3D  Recovery  and  Visualization  of  Flight  Image  Sequences 


Sanghoon  SuII  and  Narendra  Ahuja  * 
Beckman  Institute 

University  of  Illinois,  Urbana,  IL  61801 


Abstract 

We  present  an  algorithm  for  estimating  and  visualiz¬ 
ing  motion  of  an  observer  relative  to  a  scene.  A  key 
feature  of  the  approach  presented  is  an  integrated  use 
of  multiple  image  attributes  which  are  shared  by  both 
estimation  and  visualization  processes.  We  focus  on 
flight  image  sequences,  i.e.,  image  sequences  acquired 
by  an  observer  moving  smoothly  over  a  planar,  tex¬ 
tured  surface.  The  approach  presented  allows  the  use 
of  image  cues  such  as  regions,  point  features,  optical 
flow,  texture  gradient  and  vanishing  line.  The  integra¬ 
tion  of  information  in  these  diverse  cues  is  carried  out 
using  optimization.  Visualization  is  done  using  the  im¬ 
age  attributes  extracted  from  the  image  sequence  dur¬ 
ing  3D  recovery.  For  reliable  estimation,  a  sequential 
batch  method  is  used  to  compute  motion  and  structure. 
Experimental  results  on  motion  and  structure  estima¬ 
tion  and  visualization  are  presented  for  a  real  image 
sequen(  e  digitized  from  a  commercially  available  video 
tape.  The  visualization  sequence  appears  very  simi¬ 
lar  to  the  original  sequence  in  informal  viewing  on  a 
workstation  monitor. 

1  Introduction 

This  paper  is  concerned  with  the  problem  of  estimat¬ 
ing  and  visualizing  motion  and  structure  of  a  scene  as 
seen  by  an  observer  in  relative  motion.  There  are  two 
objectives  of  the  paper.  First,  it  addresses  the  problem 
of  recovering  motion  and  structure  parameters  from 
a  monocular  image  sequence.  Second,  it  uses  the  in¬ 
termediate  results  of  the  recovery  process  to  sythesize 
an  image  sequence  that  depicts  the  motion  and  struc¬ 
ture.  A  key  feature  of  the  approach  presented  which 
helps  meet  both  objectives  is  an  integrated  use  of  mul¬ 
tiple  image  attributes  or  cues.  These  cues  carry  the 
motion  and  structure  information  of  interest  to  differ¬ 
ent  degrees,  and  have  different,  often  complimentary, 
strengths  and  shortcomings.  Thus  when  a  given  at¬ 
tribute  does  not  contribute  significantly  to  the  estima¬ 
tion  process,  other,  more  pertinent  cues  help  achieve 
reliable  estimation.  The  goal  is  to  estimate  motion  and 
structure  parameters  such  that  the  estimated  param- 
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eters  best  explain  the  presence  of  all  of  the  observed 
image  cues  throughout  the  image  sequence. 

The  result  of  the  integrated  recovery  process  is  two 
fold:  it  simultaneously  gives  the  estimates  of  the  mo¬ 
tion  and  structure  parameters  as  well  as  identifies  those 
image  cues  which  are  found  to  contribute  to  these  es¬ 
timates.  In  a  sense,  this  amounts  to  the  identification 
of  image  characteristics  that  carry  the  most  informa¬ 
tion  about  the  relative  motion  and  structure.  A  result 
of  this  is  that  we  can  use  these  image  attributes  for 
creating  depictions  of  the  scene  based  on  the  premise 
that  the  display  of  these  attributes  will  be  the  most 
cost-effective  way  of  communicating  to  the  observer  the 
same  motion  and  structure  characteristics  as  perceived 
from  the  original  image  sequence.  These  depictions 
may  thus  also  be  viewed  as  a  three-dimensional  (3D) 
interpretation  based  approach  to  image  sequence  com¬ 
pression  which  obviously  should  have  very  high  com¬ 
pression  ratios. 

The  identification  and  analysis  of  the  relative 
strengths  of  different  cues  for  the  problem  at  hemd  is 
a  research  problem  in  itself.  In  general,  the  available 
cues,  and  sometimes  even  their  relative  merits,  depend 
upon  the  scene  under  consideration.  Within  the  con¬ 
text  of  the  navigation  scenario  as  mentioned  earlier,  in 
this  paper  we  focus  on. the  problem  of  an  observer  mov¬ 
ing  above  a  planar,  textured  surface  such  as  while  in  an 
aircraft  which  is  landing  or  taking  off.  The  goal  is  to 
recover  the  translational  and  rotational  motion  of  the 
observer  and  the  orientation  of  the  plane  as  a  function 
of  time,  from  the  sequence  of  images  of  the  plane  eic- 
quired  during  the  motion.  The  approach  we  present  al¬ 
lows  the  use  of  the  following  image  cues:  regions,  point 
features,  optical  flow,  texture  gradient,  and  vanishing 
line.  (The  vanishing  line  is  defined  as  the  intersection 
of  the  image  plane  with  a  plane  which  includes  the  cam¬ 
era  center  and  is  parallel  to  the  object  plane.)  These 
cues  could  be  changed  to  achieve  increased  robustness 
for  the  given  scenario  or  to  suit  a  different  scenario, 
without  affecting  the  basic  approach. 

The  framework  for  integration  used  in  this  paper  is 
one  of  optimization.  The  objective  function  to  be  min¬ 
imized  represents  the  error  between  the  observe  im¬ 
age  attributes  and  those  corresponding  to  motion  and 
structure  parameters,  structure  consistency  through 
the  image  sequence,  and  a  measure  of  the  nonsmooth- 
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ness  of  motion.  The  visualization  is  done  using  the 
attributes  extracted  from  the  image  sequence  during 
3D  recovery. 

Section  2  presents  an  overview  our  algorithm  for  in¬ 
tegrated  3D  recovery  and  visucdization  of  motion  over 
a  planar,  textured  surface  from  a  monocular  image  se¬ 
quence.  Section  3  presents  the  results  obtained  in  ex¬ 
periments  with  a  sequence  of  29  images,  digitized  from 
a  commercially  available  videotape  of  a  film  taken  from 
an  aurcraft.  The  visualization  sequence  obtained  ap¬ 
pears  compellingly  similar  to  the  original  when  the  two 
are  played  side  by  side  on  a  SUN  workstation  moni¬ 
tor,  although  we  have  not  performed  any  rigorous  psy¬ 
chophysical  experiments  about  this. 


2  Algorithm 


Figure  1:  A  framework  for  our  approach. 

There  are  six  major  steps  in  our  approach  as  shown 
in  Fig.l.  The  third  and  fourth  step  are  represented 
by  one  block  (Integrated  Estimation)  and  the  shape 
from  texture  was  not  used  in  this  paper.  Details  are 
presented  in  [5]. 

The  goal  of  the  first  step  is  to  detect  independently 
points,  lines  and  regions  in  each  frame.  Optical  flow  is 
also  computed  between  each  pair  of  adjacent  frames. 
Candidate  vanishing  lines  are  identified  from  among 
the  detected  lines.  Images  cues  are  considered  only 
below  the  candidate  vsuiishing  lines  to  reduces  com¬ 
putation  time.  The  optical  flow  is  used  at  only  those 
locations  where  a  point  feature  detector  responds.  This 
helps  in  selecting  reliable  flow  since  point  features  are 
usually  detected  at  locations  having  high  intensity  gra¬ 
dients. 

The  second  step  establishes  correspondences  between 
features  in  each  paur  of  adjacent  images  using  a  first  or¬ 
der  model  of  the  image  plane  displetcement  of  the  fea¬ 
tures.  Thus,  the  displacement  vector  {Dx,Dy),  which 
represents  the  position  after  motion  of  a  feature  located 
at  (x,y),  is  given  by; 

Dx  =  co-l-cix  +  cjy  (1) 


Dy  =  C5  -(-  cex  -I-  c^y.  (2) 

Each  distinct  image  plane  motion  is  represented  by  a 
distinct  set  of  values  for  the  coefficients  Cj’s.  All  such 
sets  of  Cj’s  supported  by  the  feature  locations  in  two 
adjacent  frames  are  identified  usmg  Hough  transform. 
The  support  for  any  set  of  Cj  values  is  computed  from 
the  image  plane  dist2inces  between  the  observed  fea¬ 
ture  locations  and  those  predicted  by  the  c,-  values  in 
the  set  under  consideration.  The  well  supported  Cj  val¬ 
ues  then  simultaneously  determine  a  segmentation  of 
the  images  into  distinctly  moving  objects,  and  estab¬ 
lish  correspondences  between  features  contained  within 
each  object.  In  general,  correspondences  are  not  found 
for  all  attributes  contained  in  an  object;  some  of  the 
attributes  remain  unmatched. 

In  third  step,  the  correspondences  found  are  used  to 
obtain  the  fits  of  the  displacement  vectors  in  different 
segments  to  a  second  order  model.  The  errors  in  the 
resulting  fits  are  used  to  merge  any  distinct  first-order 
segments  that  now  have  identical  second  order  param¬ 
eters.  In  the  problem  at  hand,  this  leads  to  a  merger  of 
all  segments  since  all  features  are  known  to  belong  to 
a  single  plane.  Here,  the  motion  and  structure  param¬ 
eters  can  be  linearly  computed  from  pairs  of  adjacent 
images. 

The  objective  of  the  fourth  step  is  to  use  the  es¬ 
tablished  feature  correspondences  to  determine  motion 
and  structure  (i.e.,  the  orientation  of  the  plane).  These 
parameters  could  be  computed  in  a  linear  fashion  from 
pairs  of  adjacent  images.  However,  when  an  image 
is  paired  with  its  predecessor  and  successor  images  in 
the  sequence,  the  resulting  structure  parameters  will 
in  general  not  be  identical.  Thus,  the  requirement  of 
such  consistency  of  structure  parameters  must  be  ex¬ 
plicitly  enforced.  This  makes  the  motion  and  structure 
estimation  a  nonlinear  problem.  To  enforce  this  re¬ 
quirement,  we  must  consider  a  batch  of  frames  at  a 
time.  We  therefore  perform  motion  and  structure  es¬ 
timation  over  a  sliding  window  of  N  frames  along  the 
image  sequence.  For  each  such  window,  the  motion 
and  structure  parameters  are  estimated  by  minimizing 
the  following  objective  function  with  respect  to  motion 
and  surface  orientation  parameters: 


N-2 


G  —  ^{\rEk,T  +  ^pEk,p  +  ^iEk,i  +  ^jEkj 
k=0 

N-l 

+A 

3C  Et  ,SC  +  A  mco  Ek  ,mco  )  +  ^  KiEt,vi^) 


where 


i=0 


Ek.r 

error  from  regions 

Ek,p 

error  from  points 

Ek,, 

error  from  lines 

Ek., 

error  from  optical  flows 

Ek,$c 

structure  constraints 

Ek.mco 

motion  coarseness 

Ek.vl 

error  from  the  vanishing  line. 
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This  objective  function  cont2iins  the  contributions  of 
multiple  featu  js  to  the  scene  characteristics  to  be  es¬ 
timated  Ea  h  contribution  is  weighted  by  a  factor 
A.  The  term  does  not  restrict  motion  nor  struc¬ 
ture.  It  requires  that  the  structure  aX  t  =  k  should 
be  the  s^lme  whether  computed  from  the  pair  of  frames 
(It— 1,^)  or  (k,k+l).  Et^meo  makes  the  motion  param¬ 
eters  vary  smoothly  which  is  quite  reasonable  especially 
when  we  are  dealing  with  multiple  frames.  Et^v/  is  the 
penalty  term  which  makes  the  orientation  parameters 
stay  within  a  certain  range  of  the  initial  values  com¬ 
puted  from  the  candidate  vanishing  lines.  We  could 
use  the  linear  solutions  from  the  previous  step  as  ini¬ 
tial  guesses  for  this  nonlinear  minimization,  however, 
the  surface  orientations  corresponding  to  the  detected 
vanishing  lines  are  used  since  they  are  usually  better 
than  the  linear  soultions.  The  objective  function  G  is 
optimized  with  respect  to  surface  orientation  param¬ 
eters  only.  Hence,  the  number  of  iteration  variables 
is  2(N  —  1)  for  surface  unit  normals  in  a  batch  since 
motion  parameters  which  minimize  G  are  linearly  com¬ 
puted  for  given  surface  orientations. 

The  motion  and  structure  estimates  obtained  from 
each  batch  are  compatible  with  the  attributes  of  the 
images  in  the  batch.  Clearly  the  larger  the  batch,  the 
more  compatible  the  estimates  will  be  with  the  image 
sequence.  In  this  fifth  step,  motion  parameters  derived 
from  batch  computations  are  sequentially  updated  as 
follows  although  we  can  use  any  sequential  updating 
algorithm  available  in  the  literature: 

m(k)  =  m(jfc  -  1)  -b  /ijt(ni(k)  —  m(k  -  1))  (4) 

where  ni(k)  is  the  present  estimate  obtained  using  the 
batch  approach  on  the  most  recent  block  of  data  at 
t  =z  k,  and  is  the  smoothing  parsuneter  based  on 
the  objective  function  value  at  <  =  fc.  The  result  is  a 
set  of  estimates  of  rotation  and  normalized  translation 
parametrs  obtained  from  the  image  sequence.  Then, 
using  these  estimated  motion  parameters,  we  recom¬ 
pute  the  orientations  for  all  frames.  Finally,  we  rescale 
translation  and  structure  parameters  starting  from  the 
initial  frame  by  using  any  set  of  points  on  the  plane. 

Finally,  in  the  sixth  step,  we  use  the  estimated  mo¬ 
tion  and  structure  parameters  for  visualization.  For 
this  purpose,  only  those  image  attributes  for  which 
correspondences  are  found  in  the  first  step  are  con¬ 
sidered.  The  remaining  attributes  are  discarded.  The 
synthesized  sequence  is  a  simplified  depiction  of  the 
retained  attributes.  Now,  an  attribute  may  not  be 
present  throughout  an  image  sequence  because,  for  ex¬ 
ample,  it  may  not  be  detected  in  each  image.  Each 
such  attribute  is  introduced  in  each  image  where  it  is 
missing.  This  is  done  by  extrapolating  from  the  nearest 
frame  where  it  is  detected,  using  the  estimated  motion 
and  structure  values.  Jacobian-based  area  relationship 
is  used  for  each  pixel  in  a  region  for  extrapolatatiou. 


3  Experimental  results 

We  derived  a  sequence  of  29  frames  from  a  commer¬ 
cially  available  VHS  video  tape  of  a  film  shoe  from  a 
flying  aircraft.  The  digitization  was  done  with  a  resolu¬ 
tion  of  720  by  486.  The  frames  were  initially  digitized 
in  three  RGB  channels  with  8  bits  per  color.  Color 
images  were  converted  to  HSV  (Hue,  Saturation  and 
Value)  [3].  In  our  experiments,  we  use  only  the  V  com¬ 
ponent  which  is  defined  as  max(R,G,  B).  The  size  of 
each  image  is  reduced  to  to  600  by  464  to  remove  the 
jitter  on  the  boundary.  In  Fig.6,  we  show  the  resulting 
29  frames.  Since  the  commercial  VHS  tape  is  far  from 
having  the  quality  of  the  mastertape,  digitized  images 
are  very  noisy.  There  is  also  blurring  of  images  since 
they  were  take  from  a  camera  mounted  on  a  flying 
aircraft. 

Next,  we  extract  regions,  lines  and  flow  as  image 
attributes.  Examples  of  these  detected  features  are 
shown  in  Figs  .2,  3  and  4.  The  results  of  segmentei- 
tion,  matching  and  merging  for  one  frame  are  shown  in 
Fig.5.  For  segmentation  and  matching,  we  use  only  re¬ 
gions  and  flow  which  are  below  the  detected  candidate 
vanishing  lines. 

When  we  iteratively  minimize  Eq.(3),  we  set  Aj  and 
Ap  equal  to  zero  since  we  do  not  use  lines  and  point 
features.  By  setting  A,m  =  0,  we  do  not  impose  any 
smoothness  restriction  on  motion  and  structure.  The 
values  of  Ar  and  A j  used  are  equal  to  one.  Compared 
to  them,  we  use  larger  values  of  A,c  and  A„/. 

The  visualization  sequence  (Fig.7)  is  synthesized  by 
displaying  (i)  those  regions  whose  correspondences  are 
used  for  motion  and  structure  estimation  and  (ii)  the 
vanishing  line  derived  from  the  estimated  surface  orien¬ 
tation.  As  explained  earlier,  missing  regions  are  intro¬ 
duced  based  on  the  estimated  motion  and  structure  pa¬ 
rameters.  The  original  sequence  and  the  resulting  visu¬ 
alization  sequence  are  presented  in  Fig.6  and  Fig.7.  If 
we  watch  the  two  sequences  visually  as  they  are  played 
on  a  SUN  workstation  monitor,  we  perceive  the  same 
motion  and  structure  from  them  in  an  informal  view¬ 
ing. 
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Figure  2:  Extracted  regions  at  t  =  25,  Figure  3;  Extracted  lines  at  t  =  25. 
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Abstract 

We  present  an  approach  to  improve  the  match¬ 
ing  of  features  in  the  domain  of  feature-based 
motion  analysis  for  multiple  frames.  In  an  au¬ 
tomated  system,  correspondence  data  are  usu¬ 
ally  noisy  and  fragmented.  Either  synthetic 
data  with  Gaussian  noise  added  or  manually 
selected  feature  correspondence  has  been  most 
commonly  used  for  motion  analysis.  We  com¬ 
bined  establishment  of  correspondence  and 
motion  estimation  and  developed  a  technique 
that  gradually  refines  the  initial  noisy  corre¬ 
spondence  data  and  links  fragments  of  a  sin¬ 
gle  feature  into  one  trajectory  using  feedback 
from  the  prior  3-D  motion  estimation.  First,  3- 
D  motion  parameters  are  estimated  using  the 
noisy  initial  correspondence  data.  Then,  each 
noisy  trajectory  is  partitioned  into  overlapping 
subsets  each  of  which  conforms  to  the  esti¬ 
mated  motion.  The  largest  set  is  selected  as 
the  input  to  the  next  motion  estimation.  This 
selection  process  is  repeated  and  the  gaps  in 
the  refined  correspondence  data  are  filled  by 
guidance  from  the  predicted  motion.  Test  re¬ 
sults  for  a  standard  real  image  sequence  are 
presented. 

1  Introduction 

Motion  analysis  is  one  of  the  important  research  areas 
in  computer  vision.  The  goal  of  motion  analysis  is  to 
detect  coherently  moving  objects  and  to  recover  the  rel¬ 
ative  motion  between  a  viewer  and  the  objects  as  well 
as  the  structure  of  the  environment.  The  information 
about  moving  objects  may  assist  solving  problems  such 
as  segmentation  and  shape  analysis. 

Feature-based  motion  analysis  techniques  are  built  on 
a  common  framework  -  feature  extraction,  establish¬ 
ment  of  correspondence,  estimation  of  motion  parame- 
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ters  and  recovery  of  3-D  structure.  The  recovered  struc¬ 
ture  is  a  sparse  depth  map  and  interpolation  is  needed 
if  dense  depth  map  is  required. 

Most  earlier  motion  analysis  efforts  have  been  con¬ 
cerned  with  only  two  or  three  frames  [Roach  and  Aggar- 
wal,  1980;  Tsai  and  Huang,  1984;  Adiv,  1985;  Anandan, 
1987].  Reconstruction  of  3-D  structure  from  2-D  data 
is  an  inherently  ill-posed  problem.  The  formulations  are 
very  sensitive  to  noise  in  the  input  data  and  the  results 
are  often  unstable.  To  overcome  the  inherent  instability, 
longer  sequence  of  images  must  be  used. 

Broida  [Broida  ef  a/.,  1990]  analyzed  images  of  objects 
undergoing  a  general  motion,  assuming  no  knowledge 
of  the  object.  They  also  developed  Cramer-Rao  lower 
bounds,  which  indicates  a  theoretical  bound  for  the  per¬ 
formance  of  the  estimation.  Shariat  [Shariat  and  Price, 
1990]  presented  a  method  for  the  computation  of  3-D 
motion  parameters  by  using  the  correspondence  of  point 
features  over  multiple  frames.  They  developed  a  mathe¬ 
matical  formulation  for  3  points  in  3  frames,  2  points  in 
4  frames  and  1  point  in  5  frames. 

In  order  to  fully  utilize  the  advantage  of  multiple 
frames,  feature  correspondences  ranging  over  the  entire 
sequence  are  desired.  However,  obtaining  reliable  corre¬ 
spondence  over  multiple  frames  is  a  nontrivial  task  for 
several  reasons. 

First,  the  set  of  extracted  features  is  not  identical  for 
all  frames.  This  can  be  caused  by  variations  in  the  ex¬ 
tractions  of  features.  Therefore,  a  feature  may  not  have  a 
proper  correspondence  and  the  number  of  features  may 
be  different  from  frame  to  frame.  If  a  feature  fails  to 
be  extracted  in  some  of  the  frames  due  to  occlusion  or 
other  reasons,  its  trajectory  is  fragmented  into  shorter 
ones  and  the  depth  estimation  of  the  feature  gets  less 
stable.  These  short  corresponding  sequences  should  be 
linked  into  one  for  a  reliable  depth  estimation.  Second,  a 
correct  correspondence  over  the  entire  sequence  requires 
correct  matching  over  all  pairs  of  adjacent  frames.  Once 
a  matching  error  occurs  between  any  two  frames,  the 
entire  trajectory  may  be  useless. 

Jenkin  [Jenkin,  1983]  presented  a  method  for  tracking 
the  3-D  motion  of  points  from  their  2-D  perspective  im¬ 
ages  as  viewed  by  a  binocular  vision  system.  He  used  a 
smoothness  assumption  that  the  location,  the  speed  and 
the  direction  of  a  given  point  feature  would  be  relatively 
unchanged  from  one  frame  to  the  next. 
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Sethi  [Sethi  and  Jain,  1987]  suggested  a  method  of 
establishing  the  correspondence  based  on  the  smooth¬ 
ness  of  motion.  Their  work  used  an  assumption  that  the 
number  of  extracted  points  are  always  the  same  except 
for  one  frame  where  some  points  may  disappear  due  to 
occlusion.  A  trajectory  is  generated  for  each  point  by  a 
nearest  neighbor  heuristic.  Then,  the  matches  of  inter¬ 
mediate  points  are  updated  until  there  is  no  further  gain 
of  smoothness. 

Cheng  [Cheng  and  Aggarwal,  1990]  relaxed  the  re¬ 
striction  of  the  same  number  of  points  for  every  frame. 
The  first  stage  of  their  algorithm  establishes  an  initial 
correspondence  by  maximizing  the  smoothness  of  mo¬ 
tion.  The  second  stage  inspects  the  last  four  frames  and 
arranges  the  correspondence  among  them. 

The  test  results  for  the  last  three  are  either  on  syn¬ 
thetic  data  or  on  manually  selected  point  features  from 
real  images.  Fletcher  [Fletcher  ei  aL,  1991]  developed  an 
algorithm  for  tracking  multiple  feature  points  in  a  real 
image  sequence.  After  an  initial  correspondence  of  the 
first  few  frames,  the  matching  for  subsequent  frames  is 
guided  by  the  2-D  motion  of  the  features.  The  track¬ 
ing  is  highly  dependent  on  the  prediction  from  the  2-D 
motion  and  fails  when  the  prediction  fails. 

Recently,  there  has  been  work  on  automated  mo¬ 
tion  analysis  systems.  Leung  [Leung  et  al,  1991]  com¬ 
bined  feature  extraction,  matching  and  motion  estima¬ 
tion  algorithms,  and  applied  it  to  well-calibrated  outdoor 
scenes.  Chandrashekhar  [Chandrashekhar  and  Chel- 
lappa,  1991]  also  combined  the  same  operations  and  used 
the  prediction  from  3-D  motion  for  feature  correspon¬ 
dence.  Sull  [Sull  and  Ahuja,  1991]  developed  an  algo¬ 
rithm  which  segments  and  matches  the  regions  and  then 
estimates  3-D  motion  parameters  and  structure  from  two 
views. 

Analysis  of  motion  from  a  real  image  sequence  is  a 
complex  task.  A  more  reliable  result  can  be  achieved  in 
an  integrated  approach,  where  feature  extraction,  estab¬ 
lishment  of  correspondence  and  motion  analysis  are  per¬ 
formed  in  cooperative  manner,  exchanging  information 
among  the  separate  sub  systems.  For  example,  the  per¬ 
formance  of  feature  matching  is  enhanced  when  guided 
by  prediction  from  motion.  The  feature  extraction  of  an 
object  in  one  frame  can  be  guided  by  its  expected  prop¬ 
erties  induced  from  the  corresponding  objects  in  other 
frames.  We  show  that  an  improvement  can  be  achieved 
with  an  integrated  approach.  We  combined  feature  cor¬ 
respondence  and  motion  estimation.  Initial  noisy  corre¬ 
spondence  data  are  gradually  refined  and  fragments  of 
a  trajectory  of  a  single  feature  are  linked  into  one  using 
feedback  from  the  3-D  motion  estimation. 

This  paper  describes  part  of  a  system  that  extracts 
the  structure  of  a  scene  from  a  moving  camera.  The  mo¬ 
tion  of  the  camera  is  arbitrary,  but  in  applications  such 
as  autonomous  vehicles,  the  motion  would  be  roughly 
along  the  line  of  sight  of  the  camera.  This  work  is  in 
the  domain  of  feature-based  motion  analysis.  The  work 
reported  here  concentrates  on  the  use  of  3-D  motion  to 
improve  the  matching  of  features  and  the  resulting  im¬ 
provement  in  the  motion  and  structure  estimation. 

In  the  next  section,  we  give  an  overall  description  of 


Figure  1:  Overall  Block  Diagram 

the  system.  A  detailed  description  of  the  refinement  of 
noisy  correspondence  using  3-D  motion  information  is  in 
section  3.  Finally,  we  present  the  test  results  on  real 
image  data,  followed  by  a  brief  conclusion. 

2  Description  of  Integrated  System 

The  block  diagram  in  figure  1  shows  the  use  of  each  sub¬ 
system  and  the  feedback  of  information.  The  basic  fea¬ 
tures  are  regions  and  corners  extracted  from  the  contour 
of  the  regions.  Each  block  is  described  briefly. 

Each  image  in  the  sequence  is  segmented  into  a  set  of 
regions  by  a  recursive  splitting  technique  that  uses  the 
statistics  of  image  attributes  [Ohlander  et  al.,  1978].  Seg¬ 
mented  regions  are  recursively  segmented  into  smaller 
regions  until  the  size  of  the  region  is  too  small  or  the 
attribute  is  inseparable.  In  our  work,  the  segmentation 
of  the  current  frame  is  guided  by  that  of  the  previous 
one  for  a  consistent  segmentation  of  regions  from  frame 
to  frame. 

Corners  are  obtained  at  the  intersection  of  those  lin¬ 
ear  segments  which  approximate  the  contour  of  a  region. 
When  an  object  is  clear  polygonal  shape,  the  position  of 
the  corners  tend  to  be  consistent  for  a  wide  range  of  scale 
changes  and  rotation.  Contours  of  natural  objects  such 
as  trees,  rocks  and  colored  spots  on  the  ground  are  usu¬ 
ally  irregular  curves  and  it  is  less  likely  to  obtain  long 
matches  of  consistent  corners  of  natural  objects. 

Feature  matching  is  done  in  a  hierarchical  way  to 
speed  up  computation  and  increase  stability.  Region 
correspondences  are  first  established  and  corners  are 
matched  along  the  matched  regions.  Matching  is  per¬ 
formed  both  in  forward  &nd  backward  direction  to  ensure 
one-to-one  match  of  features. 
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Relaxation-based  symbolic  matching  [Faugeras  and 
Price,  1981]  is  used  for  both  region  and  corner  match¬ 
ing.  The  matching  system  uses  feature-based  symbolic 
description.  For  region  matching,  the  features  include 
average  values  of  the  image  intensity,  size,  location  and 
simple  shape  measures.  Relations  included  in  the  de¬ 
scription  are  also  those  which  are  easily  computed,  such 
as  adjacency,  relative  position  and  near-by. 

For  corner  matching,  the  properties  of  the  corners  are 
position,  inside-angle,  in  and  out-direction  (the  angle 
and  the  orientations  of  the  two  line  segments  forming 
the  corner),  radial-angle  and  radial-distance  (the  orien¬ 
tation  and  distance  from  the  center  of  the  region). 

3-D  trajectory  and  structure  are  estimated  with 
chronogeneous  motion  analysis  [Franzen,  1991],  which 
can  handle  uniform  acceleration  with  constant  trans¬ 
lation  and  rotation.  This  program  requires  at  least  3 
frames  with  visible  points  but  allows  points  to  be  skipped 
in  some  frames. 

3  Refinement  and  Linking  of 
Correspondence 

The  main  sources  of  noise  in  correspondence  data  are 
inconsistent  feature  extraction  and  erroneous  feature 
match  .  These  two  are  somewhat  inseparable  since  many 
of  the  matches  among  the  inconsistent  features  are  cor¬ 
rect  in  the  symbolic  sense  while  the  resulting  correspon¬ 
dence  data  are  noisy.  When  the  object  has  a  smooth 
boundary,  the  extracted  contour  has  no  distinct  corners 
when  approximated  by  linear  segments  and  the  location 
of  a  corner  may  drift  from  frame  to  frame.  This  un¬ 
certainty  in  the  position  degrades  the  reliability  of  the 
estimation  of  the  motion  parameters.  One  advantage  of 
two-step  feature  match  (region  and  corner)  is  that,  when 
the  underlying  region  match  is  entirely  at  fault,  very  few 
corner  matches  (whether  they  are  correct  or  incorrect) 
are  obtained  since  the  similarity  and  compatibility  be¬ 
tween  corners  are  very  low  in  the  relaxational  matching. 

Image  acquisition  under  unfavorable  conditions  also 
deteriorates  the  quality  of  the  correspondence.  When 
the  camera  is  mounted  on  a  vehicle  moving  on  terrain, 
the  motion  of  the  camera  may  have  fluctuations,  which 
violates  the  assumption  that  the  3-D  motion  parameters 
are  constant  throughout  the  observation  time.  When 
the  motion  of  the  camera  is  a  little  out  of  the  normal 
path,  all  the  features  in  that  frame  are  uniformly  shifted. 
Uniform  shifts  causes  correlated  noise,  which  is  known 
to  have  worse  effects  than  random  noise. 

It  is  desirable  to  detect  and  delete  the  erroneous  part 
of  noisy  data  and  select  only  the  good  part  for  motion 
estimation.  We  review  two  measures  of  reliability  for  the 
refinement,  smoothness  and  fitting  error. 

The  smoothness  of  the  curve  formed  by  the  trajectory 
proved  to  be  useful  2-D  information  for  the  establish¬ 
ment  of  correspondence  [Sethi  and  Jain,  1987]  when  no 
knowledge  about  the  motion  is  available.  The  usefulness 
of  smoothness  as  the  measure  of  reliability  of  established 
correspondence,  however,  is  limited  in  our  work.  First, 
the  trajectories  are  not  always  long  enough  to  produce 
any  meaningful  smoothness  measure.  Furthermore,  the 
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Figure  2:  Refinement  and  linking  and  noisy  trajectories 

smoothness  of  a  trajectory  with  just  one  or  two  spurious 
points  can  be  very  low  while  an  entirely  faulty  trajectory 
may  form  a  very  smooth  trajectory. 

The  fitting  error  is  the  difference  between  the  input 
position  and  its  estimated  position  when  the  motion  in¬ 
formation  is  available.  This  measure  can  indicate  the 
quality  of  a  trajectory,  but  does  not  point  out  which 
points  are  the  spurious  ones  (gross  points)  in  the  trajec¬ 
tory  [Fischler  and  Bolles,  1981]  when  the  motion  estima¬ 
tion  is  based  on  least  mean  square  error  method. 

When  one  or  two  spurious  points  cause  a  trajectory 
to  be  partially  faulty  as  is  often  the  case  with  the  cor¬ 
ner  matches  in  our  experiments,  neither  the  .smoolhne.s.s 
measure  nor  the  fitting  error  measure  is  likely  to  give  a 
good  rating  to  the  trajectory.  Part  of  sucli  trajectories 
are  usable  data  for  motion  estimation.  We  developed 
a  method  where  a  trajectory  is  analyzed  anil  spurious 
points  are  detected  and  deleted  by  clustering  of  points 
in  the  trajectory. 

3.1  RcRnciaeiit  of  Correspondence 

Figure  2  illustrates  the  basic  idea  of  refinement  and  link¬ 
ing  of  noisy  correspondence  data,  where  a  box  with  a 
question  mark  represents  a  spurious  point  . 

Trajectory  3  is  highly  noisy  and  thus  useless.  Parts  of 
point  trajectories  1,  4,  5  arc  faulty.  The  sfnirious  [joints 
should  be  excluded  from  the  input  to  motion  estima¬ 
tion.  7'he  trajeitory  of  a  single  feature  is  fragmented 
into  shorter  ones  (trajectories  2  and  5)  that  should  be 
linked  into  one.  Otherwise  they  lo.se  their  identity  and 
each  one  is  treated  as  a  .separate  feature.  Furthermore, 
the  total  disparity  vectors  for  fragmented  features  are 
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shorter  than  those  from  a  complete  sequence  and  thus 
the  depth  estimation  is  less  stable. 

The  noisy  data  are  continuously  refined  in  the  feed¬ 
back  loop.  First,  motion  peirameters  are  estimated  using 
the  raw  correspondence  data.  Then,  each  noisy  trajec¬ 
tory  is  partitioned  into  overlapping  subsets  of  points  each 
of  which  conforms  to  the  estimated  motion.  The  largest 
set  is  selected  as  the  input  for  the  next  motion  estima¬ 
tion.  If  a  trajectory  consists  of  good  matches  except  for 
a  few  frames,  then  the  bad  points  are  detected.  When 
a  false  match  merged  two  separate  (correct)  trajectories 
into  one,  usually  the  larger  correct  set  is  found.  If  the 
trajectory  is  too  noisy  to  produce  any  reliable  subsets, 
then  it  is  deleted.  This  selection  process  is  repeated  3  or 
4  times  until  a  set  of  reliable  correspondence  is  selected. 

Refined  correspondence  are  linked  with  the  guidance 
from  predicted  motion.  Linked  data  are  refined  again 
since  there  may  be  additional  spurious  points. 

3.1.1  Selection  of  Good  Points  from  a 
TVajectory 

A  trajectory  is  partitioned  into  sets  of  compatible 
points  and  the  best  set  is  selected.  Partitioning  is  based 
on  clustering  of  compatible  point  pairs  rather  than  clus¬ 
tering  of  compatible  points. 

Given  Af ,  an  estimate  of  motion,  and  a  trajectory  in¬ 
cluding  points  (A  B  C),  a  pair  of  points  (A  B)  is  defined 
to  be  able  to  predict  point  C  if  point  C  can  be  approx¬ 
imately  obtained  by  extrapolating  (A  B)  in  accordance 
with  M  as  described  later.  The  tolerance  in  the  approx¬ 
imate  position  of  C  depends  on  the  2-D  distance  of  the 
point  pair  (A  B)  and  the  point  C: 


Given  Trajectory 

PolttU  (2  S)  arc  i^nrtonf  polnto. 


Figure  3:  Compatibility  between  pairs  of  points  in  tra¬ 
jectory.  X  mark  is  the  predicted  location  and  the  circles 
indicate  allowable  errors. 

(  1  3  4  «)  <13  4  5) 


||C-C|l<amin(|lA-C||,llB-C|l), 

where  a  is  0.25  in  our  experiments. 

If  a  pair  of  points  (A  B)  with  a  reliable  motion  esti¬ 
mate  in  a  trajectory  is  a  good  match,  then  (A  B)  should 
predict  the  positions  of  other  good  points  in  the  trajec¬ 
tory.  In  the  example  of  figure  3,  points  (1  2  3  4  5  6)  form 
a  trajectory,  where  points  (2  5)  are  the  spurious  points. 
Pairs  (1  3)  and  (4  6)  are  compatible,  and  pairs  (12)  and 
(3  4)  are  not  compatible. 

Clustering  is  done  by  partitioning  of  a  graph,  where 
each  node  is  a  pair  of  points  and  the  presence  of  an  edge 
indicates  that  the  two  pairs  are  compatible. 

The  algorithm  of  clustering  is  given  below,  where  a 
pair  (A  B)  is  defined  to  be  compatible  to  a  cluster  of 
pairs  if  (A  B)  is  able  to  predict  all  pairs  of  points  in  the 
cluster. 

Clustering  Algorithm 

For  each  node  in  a  graph 

if  the  node  does  not  belong  to  a  cluster 
then  if  there  is  a  cluster  with  which 
the  node  is  compatible 
then  put  the  node  into  the  cluster 
else  generate  a  new  cluster 


After  clustering,  we  get  several  fully  connected  sub¬ 
graphs.  Of  the  subgraphs  with  connectivity  above  a 
threshold,  the  set  with  the  largest  number  of  points  is 


Figure  4:  Selection  of  cluster 


chosen  for  the  input  to  the  motion  estimation  in  the  next 
iteration.  Connectivity  is  defined  as: 

.  .  number  of  edges 

connectivity  = - ; - - — : — 

number  of  points 

Figure  4  shows  the  clusters  from  the  trajectory  in  fig¬ 
ure  3.  All  pairs  of  points  (1  3  4  6)  are  compatible.  The 
pairs  (1  3)  and  (4  5)  are  compatible  even  though  point 
5  is  a  spurious  point.  Both  subgraphs  include  the  same 
number  of  points.  The  connectivity  of  the  graph  com¬ 
posed  of  points  (1  3  4  6)  is  3.75  and  the  connectivity  of 
that  including  the  spurious  points  is  0.25. 

3.1.2  Extrapolation  from  a  Point  Pair 

In  order  to  extrapolate  a  point  pair  in  the  3-D  space, 
the  depths  for  the  points  are  necessary.  Chronogeneous 
motion  analysis  used  in  this  research  produces  the  depth 
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of  each  point  feature.  This  estimated  depth  is  not  used 
in  the  extrapolation  since  it  is  obtained  for  the  entire 
(possibly  noisy)  trajectory.  When  the  positions  in  the 
image  plane  are  available  for  two  frames,  the  depth  of  a 
point  at  each  frame  can  be  computed  as  follows; 

Given  2  dimensional  points  Pi,Pj  in  the  image  plane 
and  an  estimate  of  motion  parameters, 

Compute  the  three  dimensional  points  Qi,Qj  which 
are  projected  onto  Pi,Pj  in  the  image  plane. 

We  define 

Tj  j-:  translation  vector  (from  frame  i  to  j) 

Ri,j‘  rotation  matrix  (from  frame  i  to  j) 

We  let  Qi  =  KiPi  and  Qj  =  Kj  Pj  as  shown  in  figure  5. 
Then 

RijQi  +  Pi,i  —  Qj 

Hence 

KiRijPi  +  Tij  =  KjPj 

Then 

[  Ri,jPi\-Pj  ][  ^‘  ]=-Ti.j 

Let  A  =  [  Ri,jPi  \  —  Pj  ] 

Then 

Finally,  Qi  =  KiPi  and  Qj  =  KjPj. 

If  either  Ki  or  Kj  is  negative,  the  point  pair  iPi,Pj) 
is  not  used  in  the  clustering  since  it  is  on  the  wrong  side 
of  the  image  plane  and  in  conflict  with  the  estimated 
motion  parameter. 

The  3-D  position  of  the  point,  Qk,  at  an  arbitrary 
frame  k  is  computed  either; 

Q?  =  Ri,kQi  +  Ti,k 

Q)  =  Rj.kQj  +  Tj.k 

where  Q*  is  an  estimate  of  Qk  with  reference  to  Qj . 
Of  the  two  estimates  and  Pj’  which  are  the  pro¬ 
jections  of  Qi  and  Qj  on  the  image  plane,  the  better 
approximation  of  the  point  P*  is  used  in  the  compatibil¬ 
ity  computation. 


b  •ocofdaDoe  with  eitin»iB4 
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Figure  5:  Qi,Qj  are  at  the  extension  of  Pi,Pj. 


3.2  Linking  of  Correspondence 

After  the  refinement  of  initial  correspondence  data,  we 
get  a  set  of  refined  point  trajectories.  Each  point  in  a 
trajectory  is  marked  as  S  (Selected)  or  D  (Discarded). 
We  define  the  effective  length  of  a  trajectory  as  the  num¬ 
ber  of  points  marked  S  in  it. 

If  linking  is  done  only  among  already  established  tra¬ 
jectories,  then  the  scope  of  possible  matches  is  limited 
and  an  isolated  corner  does  not  have  a  chance  of  being 
linked  with  a  trajectory.  Hence  we  extend  a  trajectory 
whose  effective  length  is  longer  than  3  and  look  for  the 
best  matching  corner  in  each  frame  along  the  predicted 
extension.  The  predicted  path  for  a  point  trajectory  is 
obtained  from  the  motion  parameter  estimation. 

A  window  centered  at  the  predicted  position  is  set 
for  each  frame.  Then,  the  match  strengths  of  all  the 
corners  in  the  window  are  computed  with  respect  to  all 
the  S  corners  in  the  trajectory.  The  match  strengths  are 
averaged  and  the  best  corner  is  selected  as  the  one  that 
is  to  be  linked  to  the  trajectory.  Slight  deviations  of  the 
properties  of  the  S  corners  are  smoothed  in  the  averaging 
and  thus  the  chance  of  wrong  match  decreases. 

A  simplified  version  of  the  matching  function  used 
in  the  initial  corner  match  is  used  for  the  computa¬ 
tion  of  the  match  strength.  Only  the  internal  proper¬ 
ties  of  corners  are  used  in  the  computation  and  the  rela¬ 
tions  among  corners  are  not  considered.  Local  properties 
such  as  INSIDE-ANGLE,  IN-DIRECTION  and  OUT- 
DIRECTION  are  given  high  weighting  since  the  con.sis- 
tency  of  global  properties  is  usually  broken  when  a  region 
match  gets  fragmented. 

3.2.1  Properties  of  Linking 

Let  n  =  {Pi,  P2.  ^3>  •  •  • .  ^m}  be  the  set  of  point  tra¬ 
jectories.  Under  ideal  conditions,  11  is  partitioned  into 
0  =  {T\,T2,T3,  . . .  ,T„}  where  T),!  <  i  <  n,  satisfies 
the  following  properties. 

Property  1.  Ti'ansitivity:  If  trajectory  A  is  linked 
with  trajectory  B  and  B  is  linked  with  trajectory  C,  then 
A  is  linked  with  C.  (A  and  C  should  not  share  common 
frames.) 

Property  2.  Symmetry:  If  the  extension  of  trajec¬ 
tory  A  is  linked  with  trajectory  B,  then  the  extension  of 
B  also  is  linked  with  A. 

In  most  cases,  the  extension  of  disconnected  corner 
matches  meet  both  properties.  However,  when  there  are 
nearby  objects  with  similar  patterns,  complicated  linking 
happens  which  does  not  satisfy  the  properties.  We  do 
not  use  these  links  in  motion  estimation. 

3.2.2  Refinement  of  Linked  Matches 

Linked  correspondences  consist  of  trajectories  com¬ 
posed  of  S  points  and  linked  corners.  Linked  correspon¬ 
dences  are  refined  in  the  same  way  as  the  initial  corre¬ 
spondence  to  delete  spurious  points  m  the  linking.  Usu¬ 
ally  one  refinement  process  is  enough  since  the  quality 
of  linked  data  is  much  better  than  the  initial  data. 


4  Results 

The  algorithm  has  been  tested  for  standard  sets  of  real 
image  sequences.  We  present  the  result  for  the  Rocket 
Field  seouence  provided  by  Dutta  at  UMASS  [Dutta  ei 
al.,  1989J.  Of  the  30  frames  of  the  sequence,  we  used  the 
first  15  frames. 

Typical  region  and  corner  matches  are  shown  in  fig¬ 
ure  6.  The  dots  mark  the  matched  corner.  In  figure  6 
(a),  the  object  is  a  muddy  area  on  terrain.  The  shapes 
for  some  of  the  frames  are  quite  different  from  the  rest 
and  the  positions  of  the  corners  are  rather  random.  Still, 
part  of  the  sequence  of  corner  matching  is  usable,  which 
forms  the  best  cluster  in  the  refinement  process.  In  fig¬ 
ure  6  (b),  the  corner  is  on  the  lower  right  of  the  build¬ 
ing.  Corner  location  error  is  hardly  expected  on  this 
very  clear  boundary.  However,  the  irregular  motion  of 
the  vehicle  caused  the  perturbation  of  the  trajectory. 

The  initial  and  improved  corner  correspondence  data 
used  in  motion  estimation,  superimposed  on  the  image  of 
the  first  frame  of  the  sequence,  are  shown  in  figure  7.  The 
start  frame  and  end  frame  for  those  corner  matches  are 
arbitrary  and  so  the  end  points  of  each  trajectory  may 
not  point  to  any  physical  object  on  the  underlying  image. 
Figure  7  (a)  represents  the  initial  noisy  correspondence. 
The  black  dot  of  a  trajectory  indicates  the  start  position. 
The  trajectories  are  noisy  and  some  of  them  move  in 
random  direction.  Its  refinement  is  shown  in  figure  7  (b), 
where  most  of  the  noisy  parts  are  deleted.  The  refined 
data  are  linked  in  figure  7  (c)  and  refined  again  as  in 
figure  7  (d). 

Figure  8  shows  the  reconstructed  trajectories  and  the 
top  view  of  the  objects  for  which  the  ground  truth  is 
provided.  The  initial  data  are  so  noisy  that  some  of  the 
reconstructed  trajectories  are  far  from  the  real  ones  as 
shown  in  figure  8  (a),  and  the  estimated  depths  of  objects 
are  unreliable  as  shown  in  figure  8  (b).  The  results  from 
final  linked  and  refined  correspondence  data  are  shown 
in  figures  8  (c)  and  (d).  The  reconstructed  trajectories 
are  very  close  to  the  real  ones  and  the  reliability  of  the 
estimated  depths  are  enhanced. 

Table  1  is  a  list  of  the  estimation  error  with  reference 
to  the  ground  truths  when  the  final  correspondence  data 
are  used.  Most  of  them  lie  within  20  %.  The  estimated 
motion  parameters  for  the  final  correspondence  data  are 
also  shown  in  table  2. 

5  Conclusion 

In  this  paper,  we  presented  a  feedback  approach  to  im¬ 
prove  the  matching  of  features  in  the  domain  of  feature- 
based  motion  analysis  for  multiple  frames.  We  applied 
this  approach  in  a  motion  analysis  system  that  is  built 
on  hierarchical  feature  extraction  and  matching.  We  also 
described  the  sources  and  effects  of  errors  in  correspon¬ 
dence  data  found  in  the  real  image  data.  We  showed 
that  initial  noisy  correspondence  data  are  gradually  re¬ 
fined  and  fragments  of  a  single  feature  are  linked  into 
one  trajectory  in  a  feedback  loop  where  the  feedback 
from  S-D  motion  estimation  guides  the  prior  stages. 

We  have  applied  this  method  to  real  image  sequences 
of  test  data  set  and  showed  that  the  motion  and  structure 


Figure  6:  Typical  region  and  corner  matches.  The  num¬ 
bers  near  the  dots  in  the  right  figures  are  the  frame  num¬ 
bers. 

estimation  is  improved  due  to  the  improved  correspon¬ 
dence. 
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Point 

Depth 

%  Error 

1 

43.91 

38.16 

-13.08 

14 

49.32 

48.92 

-0.81 

54 

28.58 

28.10 

-1.69 

91 

35.50 

43.16 

21.56 

130 

27.11 

24.84 

-8.39 

153 

19.67 

19.54 

-0.70 

157 

56.42 

68.78 

21.91 

Table  1:  Reconstructed  Structure  Compared  to  Ground 
Truth  Values  for  the  Rocket  Field  Sequence 


Parameter 

X 

y 

z 

Translation 

1  3.7858 

-3.3910 

4.0264 

Acceleration 

0.0171 

0.0130 

Rotation 
(in  degrees) 

0.2841 

0.2170 

Rot- Center 

1  -954.6281 

-409.9108 

511.4143 

Table  2;  Estimate  of  Motion  Parameter  for  the  Rocket 
Field  Sequence  at  the  final  Frame 
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Abstract 

We  summarize  the  work  presented  in  our  dis¬ 
sertation.  This  includes  contributions  to  the 
following  areas  of  time  varying  image  anal¬ 
ysis:  motion  representation,  structure  from 
known  motion  (SFKM),  structure  from  mo¬ 
tion  (SFM),  and  motion  parameter  estima¬ 
tion  (MPE).  Our  research  deals  with  multi¬ 
frame  monocular  image  sequences  under  per¬ 
spective  projection,  and  is  feature  based.  We 
present  a  general  methodology  for  develop¬ 
ing  efficient  algorithms  for  the  SFKM  and 
SFM/MPE  problems  when  motion  is  affine. 

The  resulting  algorithms  treat  points,  lines, 
and  occluded  features  in  a  uniform  manner. 
Straightforward  parallel  versions  of  these  al¬ 
gorithms  may  be  implemented. 

We  introduce  a  class  of  motion  that  we  call 
chronogeneous  motion,  and  its  associated  ma¬ 
trix  representation.  Serial  SFM/MPE  algo¬ 
rithms  have  been  implemented  for  several  spe¬ 
cific  subclasses  of  affine  motion.  Among  these 
is  a  very  robust  algorithm  for  the  class  of  rigid 
chronogeneous  motion. 

1  Introduction 

Motion  analysis  is  an  important  area  of  research  within 
the  field  of  computer  vision,  that  deals  with  sequences  of 
two  or  more  images.  Some  of  the  goals  of  motion  anal¬ 
ysis  processing  are  to  determine  the  coherently  moving 
objects  in  the  environment,  to  determine  the  shape  of 
each  object,  and  to  determine  the  motion  of  the  camera 
and/or  each  object.  Our  work  deals  specifically  with  the 
following  motion  analysis  problems: 

•  Motion  Parameter  Estimation  (MPE)  Problem: 
Given  a  model  of  permissible  motion,  where  per¬ 
ceived  motion  is  modeled  as  being  a  projection  of 
motion  in  3-space,  estimate  the  motion  parameters 
for  a  camera  or  for  a  set  of  tokens  that  share  a 
common  motion. 

'This  research  was  supported  by  the  Advanced  Research 
Projects  Agency  of  the  Department  of  Defense  and  was  mon¬ 
itored  by  the  Air  Force  Office  of  Scientific  Research  under 
Contract  No.  F49620-90-C-0078.  The  United  States  Govern¬ 
ment  is  authorized  to  reproduce  and  distribute  reprints  for 
governmental  purposes  notwithstanding  any  copyright  nota¬ 
tion  hereon. 


•  Structure  from  Known  Motion  (SFKM)  Problem: 
Given  a  sequence  of  two  or  more  images,  and  sets 
of  corresponding  tokens  that  share  a  common  mo¬ 
tion,  where  the  motion  of  these  tokens  is  known, 
determine  the  geometric  relationships  between  to¬ 
kens. 

•  Structure  from  Motion  (SFM)  Problem:  Given  a 
sequence  of  two  or  more  images,  and  sets  of  corre¬ 
sponding  tokens  that  share  a  common  motion,  de¬ 
termine  the  geometric  relationships  between  tokens. 

•  Depth  Determination  Problem:  Given  a  sequence 
of  two  or  more  images,  and  sets  of  corresponding 
tokens  that  share  a  common  motion,  determine  the 
3D  position  of  each  token  relative  to  the  viewer,  for 
each  frame. 

The  depth  determination  and  SFM  problems  are 
closely  related,  and  often  authors  do  not  distinguish  be¬ 
tween  the  two.  We  have  actually  solved  the  depth  de¬ 
termination  problem  although  we  loosely  claim  to  have 
solved  the  SFM  problem. 

The  SFM  and  MPE  problems  are  very  impc  ..ant  re¬ 
search  topics  in  the  area  of  time- varying  image  analysis. 
Efficient  and  robust  solutions  to  these  two  problems  have 
important  robotics  and  automated  vehicle  applications. 
A  solution  to  these  problems  allows  the  (scaled)  struc¬ 
ture  and  motion  of  moving  objects  to  be  recovered.  Al¬ 
ternatively,  the  egomotion  of  a  camera  (and  hence  a  vehi¬ 
cle)  can  be  determined.  Furthermore,  a  (sparse)  environ¬ 
mental  depth  (or  time  to  collision)  map  can  be  recovered. 
In  addition,  future  feature  image  plane  positions  may  be 
predicted,  possibly  aiding  the  correspondence  process. 

The  SFKM  problem  may  be  viewed  as  a  generaliza¬ 
tion  of  the  stereo  problem  to  multiple  frames,  where  the 
interframe  transformations  are  known.  A  solution  to  this 
problem  is  relevant  to  trinocular  and  slider  stereo,  and  to 
attempts  to  determine  object  structure  by  moving  a  cam¬ 
era  (or  object)  through  a  controlled  sequence  of  poses. 
It  is  also  relevant  to  passive  ranging.  In  some  cases  cam¬ 
era  motion  can  be  determined  quite  accurately  by  means 
of  inertial  navigation  sensors.  When  camera  motion  is 
accurately  known,  a  solution  to  the  SFKM  problem  al¬ 
lows  better  environmental  depth  estimates  for  stationary 
objects  than  when  motion  parameters  and  structure  are 
estimated  jointly. 

The  use  of  multiple  (as  opposed  to  two)  frames  is  im¬ 
portant  for  several  reasons: 

•  the  solution  process  tends  to  be  more  robust, 
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•  structure/motion  can  be  recovered  with  fewer  fea¬ 
tures  being  tracked,  and 

•  higher  order  derivatives  of  motion  can  be  estimated. 

For  this  reason,  the  extensive  work  on  two  frame  struc¬ 
ture  from  motion,  is  not  directly  applicable.  We  want  to 
develop  algorithms  that  can  recover  motion  when  only  a 
very  small  number  of  features  are  visible,  and  two  frame 
linear  methods  (such  as  [Weng  et  al.,  1989])  require  at 
least  8  correspondences,  and  preferably  more. 

2  Assumptions 

We  present  solutions  to  the  SFKM  and  SFM/MPE  prob¬ 
lems  under  a  multiframe  feature  based  paradigm.  Our 
work  deals  specifically  with  monocular  image  sequences. 
We  have  solved  these  problems  under  the  following  con¬ 
straints  and/or  assumptions: 

•  The  input  consists  of  a  set  of  trajectories,  where 
each  trajectory  gives  the  image  plane  position  of  a 
feature  in  each  of  several  frames. 

•  Associated  with  each  feature  in  each  frame  is  an 
(externally  supplied)  inverse  covariance  matrix  that 
determines  the  uncertainty  in  the  image  plane  po¬ 
sition  of  the  corresponding  feature.  If  this  matrix 
is  the  zero  matrix,  then  the  feature  is  “occluded”  in 
the  corresponding  frame.  If  this  matrix  is  singular, 
then  the  feature  is  a  line  feature. 

•  A  central  projection  (perspective)  imaging  model  is 
assumed. 

•  The  focal  length  and  other  camera  calibration  pa¬ 
rameters  are  assumed  to  be  known  constants.^ 

•  It  is  assumed  that  a  stationary  camera  is  imaging 
a  moving  object.  However,  it  is  shown  how  the 
results  also  apply  for  a  camera  moving  through  a 
static  environment. 

•  All  features  are  assumed  to  be  undergoing  the  same 
motion. 

•  For  our  methodology  to  be  applicable,  3D  feature 
position  must  be  transformed  in  an  affine  manner. 
However,  the  assumed  class  of  motion  may  be  any 
subclass  of  affine  motion. 

Specifically,  object  structure  is  not  restricted  in  any  way. 

3  The  Methodology 

Our  general  approach  is  as  follows.  The  SFKM  and 
SFM/MPE  problems  are  formulated  as  function  min¬ 
imization  problems.  Through  proper  selection  of  the 
function  to  be  minimized,  it  is  possible  to  determine  the 
structural  unknowns  in  closed  form  in  terms  of  the  mo¬ 
tion  parameters.  This  results  in  a  large  reduction  in  the 
number  of  unknowns,  and  leads  to  a  class  of  algorithms 


’In  the  SFM/MPE  case,  the  focal  length  need  not  be 
known  accurately,  as  in  the  absence  of  certain  additional  in¬ 
formation,  the  absolute  scale  cannot  be  determined  anyway. 
A  rough  estimate  of  the  focal  length  is  satisfactory,  provided 
the  focal  length  is  constant.  In  addition,  our  methodology 
could  easily  be  extended  to  the  case  where  the  focal  length 
varies,  provided  the  ratios  of  the  focal  lengths  for  the  various 
frames  were  known  accurately. 


with  very  low  time  complexity.  Furthermore,  the  compu¬ 
tations  for  different  features  are  similar,  largely  indepen¬ 
dent  of  one  another,  and  interact  mainly  through  the  mo¬ 
tion  parameters.  Therefore,  potentially  large  speedups 
can  be  achieved  through  parallel  processing. 

We  now  discuss  our  methodology  in  more  detail.  First, 
let  us  introduce  some  notation.^  Let 

n,  be  the  total  number  of  features  being  tracked  (in¬ 
dexed  0  through  n(  —  1), 

nj  be  the  number  of  frames  (indexed  — (ny  —  1)  through 

0),^ 

Q’j  =  be  the  spatial  (3D)  position  of 

the  j'’’  feature  in  the  t’’*  frame, 

be  the  image  plane  coordi¬ 
nates  corresponding  to  Q’j, 

Q’p.  =  be  the  actually  measured  image 

plane  coordinates  of  the  y’*  feature  in  the  i'*  frame, 

Si  j  be  the  (symmetric)  (2  x  2)  covariance  matrix  as¬ 
sociated  with  the  image  plane  location  uncertainty 
of  the  j"*  feature  in  the  i*’*  frame,  and  let 
be  its  inverse,  and  let 

:F,j  be  the  symmetric  (3  x  3)  matrix 


0 


-1 


1  0  -X,r"j 

0  1  -Yfj 


Using  the  preceding  notation,  the  image  plane  error 
norm  is  defined  as  follows; 


n,  -  1 


1)  )  =  0 


■  =  -(1/ 

This  norm  may  also  be  written  as 


0 

fopt  =  ^  ^ 


^  O’  ^ T  O’ 

^  (2) 


i=-(nj  -  1)  ;=0 


Ideally,  the  image  plane  error  norm  is  the  function 
that  should  be  minimized.  However,  this  leads  to  non¬ 
linear  optimization  problems  in  large  numbers  of  un¬ 
knowns,  with  certain  attendant  difficulties.  Ando  [Ando, 
1991]  has  recently  pointed  out  that  this  norm  is 
quadratic  in  the  x  and  y  components  of  3D  position,  and 
therefore  these  components  can  be  computed  in  closed 
form  in  terms  of  the  motion  parameters  and  unknown 
z-components.  This  still  leaves  considerable  nonlinear¬ 
ity,  a  potentially  large  (although  significantly  reduced) 
number  of  unknowns,  and  the  problem  of  providing  good 
initial  guesses  for  all  the  parameters. 

We  have  tried  an  alternate  approach.  The  image 
plane  error  norm  is  replaced  by  one  of  the  following  two 


*Iii  the  following,  for  simplicity,  the  foc.al  leiiglli  is  taken 
to  be  be  unity. 

^ Under  this  indexing  scheme,  the  final  frame  of  the  image 
secpience  has  an  index  of  0,  and  all  preceding  frames  have  a 
negative  index. 


approximations.^  We  refer  to  the  first  of  these  approxi¬ 
mations  as  the  pseudo-perspective  norm®: 


0 

fpseudo  =  2 

i  =  —  (nj  —  1)  >=0 


^2 

“Oj 


n, -1  0 

=  5  E  IT  E  QIj  (3) 

>=0  _  1) 

We  refer  to  the  second  as  the  quadratic  norm: 

0  n(  —  1 

f,uad  =  ^,  E  E  (4) 

i=  —  {nf  -  1)  7=0 

Both  these  norms  have  the  property  that  they  are 
quadratic  functionals  of  the  structural  unknowns.  For 
the  pseudo-perspective  norm,  the  structural  unknowns 
are  the  estimates  of  the  refined  image  plane  location  and 
inverse  depth  (inverse  2-component)  of  each  feature  in 
the  zeroth  frame.  For  the  quadratic  norm,  the  structural 
unknowns  are  merely  the  estimates  of  the  3D  position  of 
each  feature  in  the  zeroth  frame.  In  addition,  there  are 
certain  linear  constraints,  that  we  do  not  discuss  further 
at  this  point  (see  [Franzen,  1991b]). 

Therefore,  for  either  of  these  norms,  the  SFKM  prob¬ 
lem  may  be  solved  in  closed  form,  as  the  problem  to 
be  solved  is  a  quadratic  programming  problem  subject 
to  linear  constraints.®  Using  this  result,  it  is  possible 
to  develop  iterative  algorithms  to  solve  the  SFM/MPE 
problem  for  any  given  subclass  of  affine  motion,  with 
only  the  motion  parameters  as  unknowns.  This  decou¬ 
pling  of  the  estimation  of  the  motion  parameters  and  the 
structure  results  in  a  large  reduction  in  the  number  of 
unknowns.  Cui,  Weng,  and  Cohen  [Cui  et  al.,  1990],  who 
approach  the  problem  differently,  achieve  a  similar  de¬ 
coupling.  Although,  in  general,  the  resulting  SFM/MPE 
problem  is  nonlinear  in  the  motion  parameters,  a  closed 
form  solution  is  possible  for  certain  specific  classes  of 
motion. 

Algorithms  developed  using  this  methodology  have  a 
very  low  time  complexity.  The  time  to  solve  the  SFKM 
problem  is  proportional  to  the  number  of  correspon¬ 
dences.  For  cases  where  the  SFM/MPE  problem  can  be 
solved  in  closed  form,  the  computational  complexity  is  of 
the  same  order.  For  iterative  solutions  to  the  SFM/MPE 
problem,  the  time  for  each  iteration  is  proportional  to 
the  number  of  correspondences.  When  the  number  of 
motion  parameters  is  fixed,  independent  of  the  number 
of  frames,  the  number  of  iterations  is  bounded  in  prac¬ 
tice,  and  therefore  the  total  time  is  again  proportional 
to  the  number  of  correspondences. 


^In  [Franzen,  1991b,  section  5.1.3],  two  pairs  of  approxi¬ 
mations  are  introduced,  where  the  members  of  each  pair  arc 
intimately  related. 

*One  may  choose  any  particular  frame  as  the  one  from 
which  the  replicated  (inverse)  z  weights  are  used.  In  practice, 
we  choose  the  middle  frame  of  the  image  sequence  rather  than 
the  zeroth. 

®The  motion  must  be  affine  in  order  for  this  result  to  go 
through. 


4  Are  Solutions  Significantly  Biased? 

A  valid  concern  is  that  significantly  biased  solutions 
might  occur  as  a  result  of  minimizing  these  alternate 
norms.  Based  on  real  data  and  extensive  simulation  re¬ 
sults,  the  following  claims/observations  may  be  made. 

First,  if  the  quadratic  error  norm  is  minimized  di¬ 
rectly,  the  resulting  solutions  are,  indeed,  generally  un¬ 
satisfactory  and  may  have  significant  bias.  This  was 
the  case  for  motion  in  the  class  of  uniform  3D  accel¬ 
eration.  In  order  to  improve  the  quality  of  the  solu¬ 
tions,  a  principled  bias  correction  scheme  weis  devised 
for  both  the  quadratic  and  the  pseudo-perspective  norms 
(see  [Franzen,  1991b,  .sections  4.4  and  5.1.2]  for  details). 
For  motion  in  the  class  of  uniform  relative  3D  acceler¬ 
ation,  the  resulting  improvement  weis  quite  substantial. 
The  bias  correction  consists  of  adding  some  terms  that 
are  quadratic  functions  of  the  structural  unknowns.  In 
the  presence  of  noise,  these  added  terms  generally  cause 
the  minima  of  these  alternate  norms  to  more  closely  co¬ 
incide  with  those  of  the  image  plane  error  norm. 

Second,  the  bias-corrected  pseudo-perspective  norm 
performs  optimally  and  no  significant  improvement 
could  be  achieved  by  minimizing  the  image  plane  error 
norm  instead.  This  claim  is  based  on  a  statistical  anal¬ 
ysis  of  data  from  synthetic  test  cases  for  chronogeneous 
motion  (see  below).  Actual  residuals  were  in  excellent 
agreement  with  theoretically  predicted  residuals. 

Third,  some  of  our  undocumented  synthetic  experi¬ 
ments  indicate  that  when  the  image  plane  error  norm 
is  minimized,  there  is  a  tendency  to  get  trapped  in  lo¬ 
cal  minima.  Although  anecdotal,  our  experience  seems 
to  indicate  that  energy  surfaces  for  this  norm  tend  to 
be  somewhat  bumpy  near  global  minima.  This  prob¬ 
lem  does  not  seem  to  be  shared  by  the  alternate  norms. 
Because  these  alternate  error  norms  are  quadratic  func¬ 
tionals  in  the  structural  unknowns,  the  resulting  error 
surfaces  .seem  to  be  more  well-behaved. 

5  Chronogeneous  Motion 

Chronogeneous  motion  was  introduced  in  [Franzen, 
1988;  Franzen,  1989;  Franzen,  1991b].  This  cla-ss  of  mo¬ 
tion  has  15  degrees  of  freedom  in  the  general  ca.se,  and 

degrees  of  freedom  in  the  rigid  case.  Rigid  clirono- 
geneous  motion  includes  uniform  acceleration  and  con¬ 
stant  angular  velocity  rotation  and  translation  as  special 
cruses. 

Chronogeneous  motion  has  the  following  important 
properties; 

•  affine  transformation  of  coordinate  space  -  this 
permits  application  of  our  more  general  results 
to  derive  efficient  algorithms  for  the  .SFK.M  and 
SFM/MPE  problems  for  this  class  of  motion. 

•  matrix  representaiion  -  this  allows  computations  to 
be  expre,s.sed  in  a  succinct  manner,  namely  using  a 
(5  X  5)  transformation  matrix. 

•  unique  representation  -  corresponding  to  each 
chronogeneous  motion  there  is  exactly  one  chrono¬ 
geneous  transformation  matrix  and  visa  versa. 

•  fixed  number  of  motion  paiameters  independent  of 
the  number  of  frames  -  this  enhances  the  stability 
of  the  recovered  solutions. 
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•  inieresiing  class  of  motion  -  chronogeneous  motion 
is  sufficiently  general  to  model  commonly  occurring 
types  of  motion,  yet  not  so  general  as  to  require  an 
inordinate  number  of  parameters  to  be  determined. 

As  homogeneous  coordinates  are  transformed  by  ho¬ 
mogeneous  matrices,  so  also  are  chronogeneous  coordi¬ 
nates  transformed  by  chronogeneous  matrices.  Time  is 
the  additional  (fourth)  component  of  these  5  dimensional 
vectors. 

In  [Franzen,  1988;  Freinzen,  1989;  Franzen,  1991b], 
the  properties  of  the  chronogeneous  matrix  representa¬ 
tion  are  investigated  at  some  length.  For  instance,  it 
is  shown  how  to  straightforwardly  compute  motion  pa¬ 
rameters  such  as  axis  and  rate  of  rotation,  center  of 
rotation,  velocity,  and  acceleration  magnitude^  given  a 
(rigid)  chronogeneous  transformation  matrix,  and  visa 
versa. 

6  Particular  Implementations 

We  have  devised  solutions  to  the  SFM/MPE  problem 
for  several  different  subclasses  of  affine  motion,  namely 
uniform  rotation  about  the  optical  center  of  the  cam¬ 
era,  uniform  relative  3D  acceleration,  and  rigid  chrono¬ 
geneous  motion.  A  computer  program  was  written  that 
implements  these  methods  in  their  full  generality,  except 
that  (true)  line  features  are  not  currently  supported. 

6.1  Chronogeneous  mo' ion 

We  have  developed  an  iterative  algorithm  to  recover 
structure  and  motion  parameters,  when  the  motion  is 
rigid  chronogeneous.  The  rigid  chronogeneous  SFM  al¬ 
gorithm  is  extremely  robust,  and  almost  always  finds  the 
“correct”  solution,  or  one  with  a  residual  that  is  just  as 
good.  This  is  in  large  part  due  to  our  scheme  for  generat¬ 
ing  multiple  (eight)  initial  guesses,  which  almost  always 
results  in  the  global  minimum  of  the  objective  function 
being  found. 

Extensive  simulations  were  performed  on  the  rigid 
chronogeneous  SFM  algorithm,  as  well  as  some  tests  on 
real  data.  Synthetic  tests  investigated  algorithm  perfor¬ 
mance  as  a  function  of  number  of  frames,  structure  and 
type  of  object  being  imaged,  object  size,  and  to  a  lesser 
extent  resolution  and  noise.  We  can  only  give  a  very 
brief  summary  of  these  results  here.  For  the  full  test 
results  see  [Franzen,  1991b,  Chapter  6]. 

6.1.1  Synthetic  test  results 

For  the  experiments  in  which  the  number  of  frames 
was  varied,  the  actual  extent  of  motion  was  held  con¬ 
stant  in  each  test  case,  but  sampling  occurred  at  a  cor¬ 
respondingly  higher  rate  when  the  number  of  frames  was 
increased.  There  was  a  very  significant  improvement  in 
the  accuracy  of  solutions  when  the  number  of  frames  was 
increased  from  3  to  7,  a  small  but  significant  improve¬ 
ment  from  7  to  11  frames,  and  very  little  improvement 
from  11  to  15  frames. 

Certain  experiments  showed  that  when  the  imaged 
structure  was  planar  there  were  alternate  valid  interpre¬ 
tations.  This  result  was  not  unexpected.  Variations  in 
environmental  depth  of  roughly  10%  to  15%  were  re¬ 
quired  to  eliminate  the  ambiguity.  In  general,  it  was 
found  that  in  order  to  obtain  good  and  unambiguous 

^We  refer  to  these  as  the  external  motion  parameters. 


External  notion  paraneters  for  final  frame: 
spin  (deg/frame):  (  -9.857159,  2.648302,  10.779173) 

center  of  rot:  (  1.288754,  -0.750319,  1.362862} 

velocity:  (  -0.192815,  -0.042104,  -0.087495) 

acceleration:  (  -0.001222,  0.000328,  0.001337) 

signed  acceleration  magnitude:  0.001841 

Table  1:  Computed  External  Motion  Parameters  for  the 
Rolling  Tire  Sequence 


reconstructions,  a  large  field  of  view  and  adequate  de¬ 
viation  from  planarity  were  much  more  important  than 
imaging  large  numbers  of  points  per  se.  In  addition, 
even  for  certain  nonplanar  configurations,  if  the  motion 
in  depth  was  less  than  10%  to  15%  over  the  sequence, 
there  was  some  possibility  for  ambiguity. 

It  was  found  that  for  a  cube,  a  ’’  '•.ker  reversed  in¬ 
terpretation  always  exists  as  a  loc.  i  minimum  of  the 
objective  function.  For  a  cube  subtending  a  small  an¬ 
gle,  or  whose  motion  in  depth  is  small  compared  to  the 
dimensions  of  an  edge,  this  alternate  interpretation  oc¬ 
casionally  hais  a  lower  residual  than  the  “correct”  inter¬ 
pretation  due  to  noise. 

In  general,  in  terms  of  being  able  to  recover  egomotion 
and  being  able  to  estimate  times  to  collision,  the  algo¬ 
rithm  performs  quite  satisfactorily  (except  near  the  focus 
of  expansion,  or  for  extremely  distant  features)  if  errors 
on  the  order  of  5%  to  10%  are  deemed  acceptable.  For 
this  task,  an  image  resolution  of  512  by  512  is  marginal. 
Correspondingly  better  results  could  be  achieved  at  a  IK 
by  IK,  or  2K  by  2K  resolution. 

6.1.2  A  real  test  case 

The  Rolling  Tire  sequence  was  first  analyzed  by 
Broida  in  section  8.3  of  his  dissertation  [Broida,  1987], 
where  he  referred  to  it  as  the  Car  sequence.  Eight  fea¬ 
tures  (adhesive  dots)  were  placed  on  the  right  front  tire 
of  a  car.  The  car  was  translated  (roughly)  uniformly 
from  left  to  right,  while  also  approaching  the  camera. 
Therefore,  the  features  were  undergoing  a  cycloidal  t’'pe 
motion  in  the  image  plane.  Figure  1  shows  the  first  and 
last  frames  of  this  image  sequence. 

The  rigid  chronogeneous  SFM  algorithm  finds  three 
different  clusters  of  solutions.  The  worst  of  these  solu¬ 
tions  has  an  optimal  residual  well  over  sixty  thousand 
times  the  best,  and  is  a  totally  spurious  solution.  The 
second  best  solution  still  has  a  residual  more  than  fif¬ 
teen  times  the  best,  and  is  not  a  serious  alternative  to 
the  best  solution. 

Tables  1  and  2  display  respectively  the  computed  ex¬ 
ternal  motion  parameters  and  computed  structure  for 
the  best  solution.  For  this  solution,  a  small  negative 
vertical  velocity  component  is  found,  which  is  consistent 
with  the  fact  that  the  camera  was  pointed  slightly  down¬ 
ward.  Figure  2  shows  the  approximate  reconstructed 
image  plane  trajectories®  for  the  best  solution  found  by 
the  rigid  chronogeneous  SFM  algorithm,  when  overlaid 
on  the  initial  frame  of  the  image  sequence. 

We  now  compare  our  results  to  those  computed  by 
Broida’s  batch  algorithm.  Table  3  compares  the  inter¬ 
point  distances  computed  by  our  algorithm  to  those  corn- 

®The  input  data  was  preptocessed  and  there  was  some 
difficulty  getting  the  scaling  and  the  offset  correct. 
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Structure  for 

initial 

frame: 

POINT 

0:  ( 

1,5251, 

-0.0786, 

5.8790) 

POINT 

1:  ( 

1.6455, 

-0.0243, 

5.9683) 

POINT 

2:  ( 

1.9061, 

0.3491, 

6.1166) 

POINT 

3:  ( 

1.9349, 

1.0020, 

6.0065) 

POINT 

4:  ( 

1.5632, 

1.3655, 

5.6021) 

POINT 

5;  ( 

1.2515, 

1.2271, 

5.3752) 

POINT 

6:  ( 

0.9885, 

0.9295, 

5.2138) 

POINT 

7:  ( 

0.9841, 

-0.0353, 

5.4382) 

structure  for  final  frame: 

POINT  0:  (  -1.7093,  0.3168,  3.7162) 
POINT  1:  (  -1.7757,  0.1757,  3.6827) 
POINT  2:  (  -1.8029,  -0.2935,  3.7747) 
POINT  3:  (  -1.5095,  -0.7328,  4.1750) 
POINT  4:  (  -1.0292,  -0.6294,  4.6137) 
POINT  5:  (  -0.8608,  -0.2668,  4.7026) 
POINT  6:  (  -0.7994,  0.1555,  4.6610) 
POINT  7:  (  -1.2910,  0.7641,  4.0534) 


Table  2:  Computed  Structure  for  the  Rolling  Tire  Se¬ 
quence 


6.2  Uniform  3D  acceleration 


When  motion  lies  in  the  class  of  uniform  3D  accelera¬ 
tion,  a  closed  form  algorithm  results  when  the  quadratic 
norm  is  minimized.  The  algorithm  involves  finding  tiie 
eigenvector  corresponding  to  the  minimum  eigenvalue  of 


Figure  2:  Reconstructed  Trajectories  for  the  Rolling  Tire 
Sequence 


®The  ground  truth  and  the  estimates  generated  by 
Broida’s  algorithm  were  extracted  from  Table  8.5  on  page  183 
of  his  dissertation  [Broida,  1987].  We  have  corrected  a  couple 
of  typographical  errors  that  appeared  in  the  original  (private 
communication  with  the  author). 


Figure  1:  First  and  Last  Frames  (Frames  0  and  15)  of 
Rolling  Tire  Sequence 


puted  by  Broida’s  algorithm,  and  to  the  ground  truth.® 
Our  solution  has  been  rescaled  so  that  the  average  com¬ 
puted  interpoint  distance  is  equal  to  the  average  inter¬ 
point  distance  for  the  ground  truth.  Our  results  are 
much  better  than  Broida’s.  Our  largest  interpoint  dis¬ 
tance  error  is  less  than  2.4%.  Many  of  Broida’s  distances 
have  errors  of  roughly  50%,  and  the  largest  error  is  neatly 
74%. 

Since  the  tires  were  (roughly)  pointed  in  the  direction 
of  travel,  the  angle  between  the  axis  of  rotation  and  the 
velocity  vector  should  be  90  degrees.  The  angle  com¬ 
puted  by  our  algorithm  was  74.7  degrees.  Broida’s  algo¬ 
rithm  computed  an  angle  of  roughly  1.50  degrees.  The 
total  rotational  angle  was  meeisured  to  be  3.85  radians. 
This  compares  to  our  result  of  3.89  radians,  and  Broida’s 
result  of  3.96  radians.  The  total  translational  distance 
was  measured  to  be  45.0  inches.  Broida's  result  was  43.5 
inches,  and  ours  wets  39.0  inches,  which  was  reasonable 
considering  the  CRLB  estimated  standard  deviation  was 
5.5  inches. 

The  only  plausible  explanation  for  our  algorithm  per¬ 
forming  so  much  better  than  Broida’s  is  that  his  algo¬ 
rithm  converged  to  a  local  minimum  in  this  particular 
case.  However,  the  solution  found  by  his  algorithm  does 
not  correspond  to  any  of  the  solutions  found  by  our  algo¬ 
rithm.  Perhaps  the  image  plane  norm  (which  Broida’s 
algorithm  was  minimizing)  and  the  pseudo-perspective 
norm  have  different  nonglobal  minima,  or  perhaps  our 
algorithm  did  not  find  that  particular  local  minimum. 


Table  3:  Interpoint  Distances  for  the  Rolling  Tire  Se- 
qvience 


a  certain  (6  x  6)  matrix.  The  motion  parameters  and 
structure  are  recovered  by  applying  appropriate  linear 
transformations  to  this  eigenvector.  The  six  eigenvec¬ 
tors  found  by  the  algorithm  are  used  to  generate  six  of 
the  initial  guesses  for  the  rigid  chronogeneous  SFM  al¬ 
gorithm. 

This  algorithm  is  described  in  detail  in  [Franzen, 
1991b,  Chapter  4]  and  in  [Franzen,  1991a].  This  algo¬ 
rithm  is  quite  fast  and  could  run  in  real  time  on  cur¬ 
rent  generation  hardware,  if  a  processor  per  feature  were 
available. 


6.3  Pure  rotation 

We  have  developed  an  iterative  algorithm  that  recovers 
refined  image  plane  position  and  rotational  motion  pa¬ 
rameters  when  the  motion  consists  of  a  uniform  rotation 
about  the  optical  center  of  the  camera.  This  algorithm 
minimizes  the  pseudo-perspective  norm.  It  is  used  to 
generate  one  of  the  initial  guesses  for  the  rigid  cbrono- 
geneous  SFM  algorithm. 

The  pure  rotation  algorithm  converges  quite  fast,  but 
there  is  room  for  further  improvement,  as  it  does  not 
always  find  the  global  minimum.  By  imposing  certain 
restrictions,  a  closed  form  quaternion  method  could  be 
used  to  generate  a  better  initial  guess  than  the  all  zero 
guess  that  we  have  used.  SeelFranzen,  1991b,  sec¬ 
tions  5.3.3  and  7.3.6]  for  further  details. 


7  “Danger  -  Will  Robinson”'" 

It  is  the  nature  of  published  research  that  what  mani¬ 
festly  does  not  work  generally  doesn’t  get  published.  1 
would  like  to  take  his  opportunity,  however,  to  indicate 
some  things  that  caused  myself  a  great  deal  of  wasted  ef¬ 
fort,  and  perhaps  are  not  generally  appreciated,  so  that 
other  researchers  may  benefit. 

At  a  certain  point  in  our  research,  we  tried  to  develop 
an  iterative  SFM/MPE  algorithm  that  could  operate 
when  arbitrary  interframe  rigid  transformations  wore  al¬ 
lowed.  The  algorithm  attempted  to  minimize  the  image 
plane  error  norm.  This  algorithm  had  tremendous  dif¬ 
ficulty  locating  the  global  minimum  when  unintelligent 
initial  guesses  were  used.  In  addition,  in  synthetic  exper¬ 
iments  when  known  original  (noise  free)  configurations 
were  used  cis  initial  guesses,  during  the  process  of  mini¬ 
mization  the  solutions  tended  to  diverge  from  what  was 
expected  (although  the  residual  was  indeed  reduced). 
Admittedly,  small  numbers  of  points  (say  8)  were  being 
used.  However,  this  has  a  bearing  on  recursive  methods 
such  as  [Cui  et  al.,  1990]  that  compute  a  different  rigid 
transformation  for  each  frame.  Such  methods  7nay  be 
unstable/unreliable  unless  many  features  are  tracked. 

When  the  rigid  chronogeneous  SFM  algorithm  was 
first  being  tested,  there  was  only  one  initial  guess, 
namely  the  solution  found  by  the  closed  form  uniform 
3D  acceleration  algorithm.  In  simulations,  sometimes 
the  original  chronogeneous  algorithm  gave  good  results, 
and  sometimes  poor  results.  Since  theoretical  expected 
values  of  the  residuals  had  been  derived,  and  the  com¬ 
puted  residuals  were  on  average  too  high  (and  tended  to 
be  distributed  in  clusters),  something  was  clearly  amiss. 
We  determined  that  the  algorithm  was  finding  alternate 
local  minima.  For  certain  test  configurations,  the  algo¬ 
rithm  would  converge  to  one  of  several  different  minima 
as  the  added  image  plane  noise  was  altered  from  one 
test  case  to  the  next.'^  We  hoped  that  by  devising  a 
more  sophisticated  initial  guessing  scheme,  or  by  adding 
appropriate  biasing  terms,  or  by  tweaking  some  of  the  in¬ 
ternal  parameters  of  the  numerical  optimization  routine 
this  problem  could  be  eliminated.  Way  too  much  time 
was  spent  on  this  endeavo”.  Although  certain  modifica¬ 
tions  improved  the  results,  try  eis  we  might,  we  couldn’t 
hit  upon  any  scheme  of  generating  single  initial  gucs.se.s 
that  converged  to  the  global  minimum  over  a  wide  range 
of  motions  and  structures  more  than  roughly  90%  of  the 
time.  It  was  only  after  an  intelligent  scheme  for  gen¬ 
erating  multiple  initial  gues.ses  was  devised  that  robust 
behavior  was  achieved. 

Many  iterative  computer  vision  algorithms  have  been 
developed  that  minimize  nonlinear  functions.  Wc 
don’t  think  that  researchers  have  adequately  addressed 
whether,  with  what  frequency,  and  under  what  condi¬ 
tions  global  minima  are  actually  found.  Unless  a  very 
good  initial  guess  is  available  by  some  other  mechanism, 
or  one  has  taken  particular  care  to  devise  a  good  sys¬ 
tem  of  initial  guesses,  one  should  realize  that  nonglobal 

'°Sonie  of  you  may  remember  the  televi.sioii  series  Lost  in 
Space.  Among  other  things,  it  featured  a  robot  that  often 
issued  dire  warnings  -  unfortunately  usually  too  late  to  be  of 
any  real  assistance. 

"Commonly,  two  or  three  clusters  of  solutions  were  found. 
For  one  early  test  configuration,  there  seemed  to  be  four 
clusters. 


minima  may  be  found  a  nonnegligible  fraction  of  the 
time.  When  they  are  applicable,  it  is  for  this  reason 
that  (largely)  closed  form  algorithms  such  as  [Spetsakis, 
1991]  (which  handles  both  point  and  line  features  over 
exactly  3  frames)  and  [Tomasi  and  Kanade,  1991]  (which 
recovers  arbitrary  rigid  transformations  and  shape  over 
multiple  frames,  but  assumes  orthographic  projection) 
have  great  utility. 

8  Summary 

We  have  introduced  a  methodology  for  solving  the 
SFKM  and  SFM/MPE  problems.  The  approach  is 
one  of  function  minimization,  where  the  function  to  be 
minimized  is  either  the  pseudo-perspective  norm  or  the 
quadratic  norm.  The  resulting  algorithms  have  the  fol¬ 
lowing  desirable  properties: 

•  An  arbitrary  number  of  features  and  frames  are  per¬ 
mitted  (provided  only  that  information  sufficient  to 
properly  constrain  solutions  must  be  supplied). 

•  Both  point  and  line  features  are  handled  within  the 
same  framework. 

•  A  given  feature  need  not  be  visible  in  every  frame 
and  so  the  method  works  in  the  presence  of  occlu¬ 
sion  or  correspondence  “drop  outs.” 

•  It  is  possible  to  model  the  uncertainty  in  feature 
position  due  to  motion  blur. 

•  Serial  computation  times  are  generally  proportional 
to  the  number  of  correspondences. 

•  Efficient  parallel  implementations  can  be  developed 
in  a  straightforward  manner. 

Researchers  should  consider  minimizing  either  of  the 
aforementioned  norms  in  other  contexts,  especially  the 
(bias-corrected)  pseudo-perspective  norm.  These  norms 
seem  to  have  smoother  and  more  well-behaved  energy 
surfaces  than  the  image  plane  error  norm.  The  quadratic 
norm  should  not  be  minimized  without  bias  correction. 

Chronogeneous  motion,  chronogeneous  transforma¬ 
tion  matrices,  and  also  chronogeneous  coordinates  were 
introduced.  The  chronogeneous  matrix  representation 
has  many  interesting  properties  which  are  discussed  fur¬ 
ther  in  [Franzen,  1988],  (Franzen,  1989],  and  [Franzen, 
1991b,  Chapter  3].^^ 

SFM/MPE  algorithms  were  implemented  for  the  fol¬ 
lowing  classes  of  motion:  uniform  rotation  about  the  op¬ 
tical  center  of  the  camera,  uniform  3D  acceleration,  and 
rigid  chronogeneous  motion.  The  rigid  chronogeneous 
SFM  algorithm  is  very  robust. 

We  note  that  several  solution  error  metrics  not  oth¬ 
erwise  discussed  here  were  introduced  in  our  disserta¬ 
tion.  These  were  arrived  at  after  much  consideration, 
and  dissatisfaction  with  commonly  used  error  metrics 
such  as  mean  squared  3D  error.  Researchers  may  find 
them  useful  in  evaluating  the  performance  of  algorithms. 
See  (Franzen,  1991b,  section  6.1]  for  details. 
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Abstract 

This  paper  addresses  the  problem  of  motion  seg¬ 
mentation  using  the  Singular  Value  Decomposition 
of  a  feature  track  matrix.  It  is  shown  that,  under 
general  assumptions,  the  number  of  numerically 
nonzero  singular  values  can  be  used  to  determine 
the  number  of  motions.  Furthermore,  motions  can 
be  separated  using  the  right  singular  vectors  asso¬ 
ciated  with  the  nonzero  singular  values.  A  relation¬ 
ship  is  derived  between  a  good  segmentation,  the 
number  of  nonzero  singular  values  in  the  input  and 
the  sum  of  the  number  of  nonzero  singular  values 
in  the  segments.  The  approach  is  demonstrated  on 
real  and  synthetic  examples.  The  paper  ends  with 
a  critical  analysis  of  the  approach. 

1  Introduction  and  Previous  Work 

The  use  of  “motion”  information  has  been  around 
for  many  years,  and  considerable  research  progress 
has  been  made.  Little  of  this  work  on  motion, 
however,  addresses  the  issue  of  segmenting  the  mo¬ 
tions  in  a  scene  into  different  components  but  rather 
assumes,  often  implicitly,  that  that  the  images  have 
been  previously  segmented  into  their  constituent 
motions.  Previous  work  on  motion  segmentation 
includes: 

•  intensity  based  image  segmentation  using  known 
depth  [Peleg  and  Rx>m,  1990]  [Yamamoto,  1990], 

•  looking  for  “boundaries”  in  “optic-flow  fields” 
(i.e.  assuming  locally  uniform  motion,  but  not 
knowledge  of  depth),  [Shizawa  and  Mase,  1990] 
[Adiv,  1985],  [Murray  and  Buxton,  1987], 

•  clustering  in  some  type  of  a  priori  defined  para¬ 
metric  motion  space  [Fennema  and  Thompson, 
1979],  [Dickmanns,  1989], 

•  techniques  using  Markov  Random  Fields  to  find 
both  a  segmentation  and  description  of  a  mo¬ 


tion  sequence  [Heita  and  Bouthemy,  1990],  [Sub- 
rahmonia  et  ai,  1990], 

•  and  a  technique  which  looks  for  2  motion  com¬ 
ponents  in  an  image  sequence  [Bergen  et  ai, 
1990b],  [Bergen  et  ai,  1990a]. 

Much  of  this  work  has  made  restrictive  assump¬ 
tions  on  the  scene/motion  to  allow  segmentation 
or  assumes  considerable  a  priori  information  (e.g. 
a  depth  map). 

The  approach  we  introduce  in  this  paper,  seg¬ 
ments  the  motions  in  a  scene  into  their  different 
rigid  body  motions  without  knowledge  of  camera 
or  object  motion,  shape,  or  depth.  We  assume 
an  input  of  tracked  features  and,  except  for  the 
smoothness  necessary  for  that  tracking,  the  ap¬ 
proach  to  segmentation  and  motion/shape  recov¬ 
ery  does  not  require  smoothness  assumptions  on 
either  the  objects  or  the  motion.  To  obtain  these 
desirable  features  we  use  the  elegant  “factoriza¬ 
tion  approach”  to  shape/motion  recovery.  The 
idea  of  using  SVD  factorization  of  motion  tracks 
was  recently  introduced  (see  [Tomasi  and  Kanade, 
1990a],[Tomasi  and  Kanade,  1990b])  and  ir  dis¬ 
cussed  in  more  detail  elsewhere  in  these  proceed¬ 
ings. 

The  paper  is  organized  as  follows:  we  first  dis¬ 
cuss  the  SVD  in  a  little  more  detail.  We  then 
present  a  short  review  of  Tomasi  and  Kanade’s 
ground  breaking  work  on  SVD  factorization  and 
motion  recovery.  This  material  in  hand,  we  de¬ 
scribe  our  approach  in  Section  3.  Section  4  presents 
some  experimental  testing  of  the  approach,  fol¬ 
lowed  by  a  critical  analysis  in  Section  5.  We  end 
with  some  conclusions  and  a  discussion  of  future 
work. 
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2  Backgroup^’ 

In  this  sectir  3sent  background  material 

on  the  singula.  Jecomposition  and  on  the 

motion  factorizai.  .  technique. 

2.1  Some  properties  of  the  SVD 

Assume,  without  loss  of  generality,  /  is  an  m  x  p 
matrix,  m>  p.  By  computing  the  Singular  Value 
Decomposition  (SVD)  of  the  input  I  we  can  ob¬ 
tain  3  matrices  L,  E,  R  such  that  I  =  LURj .  Fur¬ 
thermore,  L  =  •••y  is  an  p  X  p  orthogo¬ 

nal  matrix,  R  =  [ri,r2, . .  .rp]  is  an  n  x  p  orthog¬ 
onal  matrix,  and  S  is  a  diagonal  p  x.  p  matrix 
diasf((ri,(T2, . .  .o-p)  with  oi  >  <T2  >  . . .  >  <Tp  >  0. 
The  value  tr,-  is  referred  to  as  the  ith  singular  value, 
and  the  column  vectors  are,  respectively,  the 
ith  left  singular  vector  and  the  ith  right  singular 
vector. 

We  now  state  some  properties  of  the  SVD  which 
will  be  useful  in  the  remainder  of  the  paper.  There 
is  a  strong  relationship  between  SVD  and  eigen¬ 
values  /  eigenvectors.  In  particular,  the  singular 
values  are  the  eigenvalues  of  7^7  and  the  right  sin¬ 
gular  vectors  are  its  eigenvectors.  Since  7^7  is  a 
symmetric  matrix,  the  singular  values  are  also  the 
squares  of  the  eigenvalues  of  7.  To  help  in  visual¬ 
izing  how  the  SVD  extracts  much  of  the  structure 
inherent  in  a  matrix,  it  is  also  helpful  to  know  that 
the  singular  values  of  the  matrix  7  are  precisely  the 
lengths  of  the  semi-axes  of  the  hyperellipsoid  E  de¬ 
fined  by  £  =  {y\y  =  7i,  ||x||2  =  1}.  Furthermore, 
the  location  of  the  axes  of  this  ellipsoid  are  given 
by  the  columns  of  L.  Thus  the  SVD  is  strongly 
related  to  the  principal  component  analysis  of  the 
data  in  7. 

The  relationship  between  SVD  and  principal  com¬ 
ponents  explains  why  we  are  able  to  perform  mo¬ 
tion  segmentation  using  clustering  of  the  right  sin¬ 
gular  vectors.  Principal  component  analysis  is  gen¬ 
erally  performed  to  reduce  the  dimensionality  of 
a  data  set  with  many  interrelated  variables  to  a 
much  smaller  set,  the  principal  components,  while 
retaining  as  much  of  the  original  variation  as  pos¬ 
sible.  Each  principal  component  is  composed  of 
a  linear  function  a  which  operates  on  the  vector 
*  of  random  variables  and  maximizes  the  variance 
(and  is  uncorrelated  with  previously  found  compo¬ 
nents).  The  fcth  principal  component  is  given  by 


a^x  where  Ofc  is  an  eigenvector  of  the  covariance 
matrix  of  x  corresponding  to  its  fcth  eigenvalue. 
Since  the  right  singular  vectors  of  the  SVD  are  the 
eigenvectors  of  7^7,  they  are  precisely  the  princi¬ 
pal  components.  In  our  case,  each  random  variable 
is  the  location  of  a  feature  point  given  at  different 
times.  Random  variables  are  related  in  a  definite 
way  according  to  whether  they  are  part  of  the  same 
motion.  This  is  ultimately  expressed  in  the  princi¬ 
pal  components. 

If  the  SVD  of  7  is  such  that  o"!  >  . . .  >  > 

(Tk+i  =  •••tTp  =  0,  then  we  know,  Rank(I)  =  k 
and  7  =  0  rj  =  Lkt^kRj-  When  the 

matrix  7  is  noisy,  the  issue  of  determining  “numer¬ 
ical”  rank,  call  it  72(7),  is  more  diflicult  and  will 
be  discussed  in  section  3.5. 

Finally,  we  comment  on  the  computational  com¬ 
plexity  of  the  SVD  algorithm.  For  our  needs,  the 
straightforward  approach  to  the  SVD  costs  around 
7mp^  -|-  4p^  -b  0{p^  -b  mp)  FLOPS  for  a  matrix  with 
m  rows  and  p  columns.  Code  for  computing  the 
SVD  can  be  found  in  any  good  numerical  package 
(LINPACK,  EISPACK,  NAG,  IMSL).  We  use  a  lo¬ 
cally  modified  version  of  the  code  from  Numerical 
Recipes  in  C  [Press  et  al.,  1988). 

2.2  Background  on  the  shape/motion  fac¬ 
torization  technique. 

Our  discussion  of  the  factorization  technique  has 
been  broken  into  four  terse  subsections.  We 
will  treat  both  the  2D  case  (motion  restricted  to  a 
plane)  and  the  3D  case.  The  rationale  for  this  is 
that  the  2D  case  is  significantly  easier  to  visualize 
and  present  while  the  methods  are  very  similar  in 
practice. 

All  of  the  conceptual  content  of  this  section  fol¬ 
lows  from  the  work  of  Tomasi  and  Kanade.  For 
the  2D  problem  we  follow  [Tomasi  and  Kanade, 
1990c]  and  for  the  3D  problem  we  follow  [Tomasi 
and  Kanade,  1991).  This  section  is  intended  to 
introduce  notation  for  our  development,  and  does 
not  thoroughly  explore  the  factorization-based  ap¬ 
proach. 

The  following  assumptions  are  needed  to  make 
the  approach  feasible: 

•  the  imaging  system  is  orthographic, 

•  there  are  at  least  3  frames 

•  there  exists  a  feature  tracker  which  can,  given 
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the  image  sequence,  solve  the  frame  to  frame 
correspondence  problem  and  track  points  over 
extended  periods  of  time. 

•  each  object  in  motion  yields  >  3  points  which 
are  tracked  over  the  entire  image  sequence, 

•  a  motion  must  have  a  nonzero  rotational  com¬ 
ponent. 

Again,  these  are  minimal  assumptions.  In  general 
performance  will  be  better  with  more  frames  and 
more  points  taken  over  a  wide  change  in  rotational 
angles. 

2.2.1  Input  representation:  tracks  of  shapes 
in  motion 

The  input  to  the  factorization  procedure  is  a  ma¬ 
trix  /,  representing  tracks,  i.e.  image  positions  of 
feature  points  over  time.  We  assume  there  are  p 
feature  points  over  /  frames,  and  that  in  frame  i 
point  j  is  at  pixel  location  in  the  im¬ 

age  plane.  These  can  be  interpreted  as  a  pair  of 
matrices  U  and  V  giving  the  horizontal  and  verti¬ 
cal  component  of  the  point’s  location  respectively. 
Let  U'  and  V  represent  the  same  position  matri¬ 
ces  where  each  row  has  been  shifted  so  as  to  have 
mean  0  (i.e.  independently  subtract  the  centroid 
of  each  frame  from  every  element  of  that  row).  It 
is  not  necessary  that  every  frame  of  the  motion  se¬ 
quence  be  in  the  input  matrix.  In  particular,  while 
dense  sampling  may  be  necessary  for  tracking,  the 
input  matrix  can  be  made  from  significantly  fewer 
frames. 

In  the  2D  case,  our  input  matrix  is  /  x  p  and 
is  simply  U,  For  the  3D  case  we  assume  an  input 
matrix  W  which  is  (2/)  x  p  with  W  =  ^, .  If  the 
dimensionality  of  the  input  does  not  matter,  we 
will  use  I  to  represent  the  input  track  matrix  of 
size  mx  p. 

2.2.2  Representing  Shape 

We  represent  a  point  in  the  world  coordinate 
system  as  pi.  For  the  projection  of  this  point  in  the 
2D  case,  we  adopt  a  homogeneous  representation 
(xj,2i,l),  so  we  can  represent  translation.  In  3D, 
where  translation  is  removed  by  subtracting  the 
centroid  of  each  frame,  we  use  the  representation 
(x,-,  Pi,  Zi).  We  can  then  collect  p  of  these  point  into 
a  shape  matrix,  say  5  of  size  3  x  p,  by  considering 
each  point  as  a  column  in  the  matrix. 


2.2.3  Representing  Motion 

In  2D  let  Oj  be  the  angle  between  the  A"  axis  and 
the  camera  in  frame  i.  The  projection  of  point  pj 
into  the  image  is  given  by  j  =  [cos(a,  ),  sin(o,),  t,] 
where  t,  is  the  projection,  onto  the 
frame  /,,  of  the  translation  vector  (as  measured 
from  the  first  frame).  We  can  collect  this  into  a 
motion  matrix,  M2d- 

In  the  3D  representation,  we  assume  a  fixed  world 
coordinate  system.  In  this  coordinate  system  let 
fi  be  a  unit  vector  (represented  by  its  endpoint) 
which  is  aUgned  with  the  image  rows  in  frame 
and  let  cj  be  the  unit  vector  aligned  with  the  im¬ 
age  columns  in  frame  /,.  Given  this,  we  see  that 
{//j  =  f,  •  Pj  and  V/j  =  Ci  ■  Pj.  Thus  we  can  build 
a  (2/)  X  3  motion  matrix  taking  r,,  i  =  1../  as  the 
first  /  rows  and  ci,  i  =  1../  as  rows  (/  -(-  1) . .  .2/. 

With  this  notation  we  have  U  =  M2dS  and  W  = 
MzdS,  or  letting  the  dimension  be  implicit  I  = 
MS. 

2.2.4  Factoring  the  input  matrix 

Let  us  consider  the  SVD  of  /,  and  let  E  be  the 
submatrix  of  S  containing  the  numerically  nonzero 
singular  values.  Let  L  (^)  be  the  associated  rows 
(columns)  of  L  (/?  respectively).  Then,  as  is  always 
true  for  the  SVD,  /  =  .  We  drop  the  “  for 

simplicity  and  henceforth  the  exact  interpretation 
of  //  S  or  ^  will  be  given  by  the  context. 

Note  that  the  dimensionality  of  the  L  and  R 
matrices  are  exactly  the  same  as  the  dimensional¬ 
ity  of  the  shape  and  motion  matrix  because  if  the 
motion/shape  is  not  degenerate,  then  =  3.  If 
we  let  M'  =  L-I!, 2  and  5'  =  E^  •  R^ ,  then  we  even 
have  the  same  form  I  =  M'  •  S'.  In  fact,  these  ma¬ 
trices  are,  up  to  an  affine  transformation,  exactly 
the  shape  and  motion  matrices,  i.e.  there  exists  a 
nonsingular  A  such  that  M  —  M'  ■  A,  S  =  A~^S'. 
The  exact  procedure  for  obtaining  A  depends  on 
the  dimension  as  detailed  in  [Tomasi  and  Kanade, 
1990c]  and  [Tomasi  and  Kanade,  1991].  Intuitively, 
A  is  obtained  by  requiring  the  resulting  motion  ma¬ 
trix  to  have  the  proper  orthogonality  properties. 

3  Description  of  factorization-based  seg¬ 
mentation  of  multiple  motions 

In  this  section  we  first  establish  a  relationship 
between  the  number  of  singular  values  and  the 
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number  of  motions.  We  present  a  simple  example 
of  the  approach  on  real  data.  We  end  this  section 
with  a  more  technical  discussion  of  the  determina¬ 
tion  of  rank  and  our  cluster  analysis  process. 

3.1  Representing  multiple  motions  and  the 
related  number  of  singular  values 

When  there  is  a  single  motion  we  consider  the 
same  representation  as  in  [Tomasi  and  Kanade, 
1990c]  and  [Tomasi  and  Kanade,  1991],  (see  Sec¬ 
tion  2.2)  which  results  in  a  matrix  formulation  of 
I  =  MS’,  where  /  is  m  x  p,  M  is  m  x  3,  5  is  3  x  p 
and  m  =  2/  or  /  depending  on  the  dimension. 

We  now  consider  the  case  where  there  are  two 
motions.  Let  M'  and  M"  be  motion  matrices  of 
size  m  X  3.  Let  the  associated  shape  points  be 
S',  S"  consisting  of  p'  and  p"  points  respectively. 
Assume  these  points  are  associated  (in  some  order) 
with  tracks  ti,i  =  1, . . . ,  (p'  -f  p")..  Then  we  can 
represent  the  track  matrix  I  =  MS,  where  M  = 
with  I  being  matrix  concatenation.  Let 
each  column  Sj  be  of  the  form  [5^  ^ ,  S'j  ^,  S'j  ^,  0, 0, 0]^ 
if  track  tj  is  associated  with  a  point  S'j  and  be  of 
the  form  [0,0,0, 5^(21  track  tj  is  asso¬ 

ciated  with  point  5". 

That  such  a  decomposition  accounts  for  the  track 
matrix  is  easily  shown.  Generalizing  this  to  N  mo¬ 
tions  is  straightforward,  resulting  in  a  motion  ma¬ 
trix  M  being  m  x  ZN  and  the  shape  matrix  being 

aJVxEftf'p*. 

3.1.1  Multiple  motions  and  factorization: 

Definitions  and  observations 

In  some  ways  the  motion  matrix  M  is,  in  reality, 
not  a  motion  at  all.  It  is  not  an  optic  flow;  it  is 
not  a  parametric  equation  defining  the  path  of  the 
object;  it  is  not  even,  necessarily,  the  path  taken 
by  any  point  in  the  scene.  Instead,  each  row  in 
the  motion  matrix  can  be  interpreted  as  the  trans¬ 
formation  which  takes  a  world  point,  in  a  fixed 
frame  of  reference,  to  its  projection  in  the  associ¬ 
ated  image  frame.  A  motion  is  then  a  subspace 
of  /-dimensional  space  defined  by  the  span  of  the 
columns  of  M.  The  path  (i.e.  x{t),  y{t),  z{t))  taken 
by  any  single  object  point  is  represented  as  a  sin¬ 
gle  point  in  this  /-dimensional  space,  and  hence  we 
call  this  path  space.  A  rigid  body  motion  requires 
all  points  being  considered  to  follow  paths  which 
are  within  the  span  of  the  columns  of  the  associated 


motion  matrix  M.  Since  we  cannot  directly  mea¬ 
sure  M,  we  will  generally  consider  the  information 
in  the  columns  of  /,  to  define  the  observable  mo¬ 
tion.  We  will  use  motion  to  mean  either  real  mo¬ 
tion  (M)  or  observable  motion  (/)  where  the  con¬ 
text  should  disambiguate  the  interpretation.  We 
now  explore  the  ramifications  of  this  representa¬ 
tion. 

In  what  follows,  let  i  =  I..N  be  track  matrices 
of  different  motions  with  TZ{Ii)  >  0.  Let  Mi  be 
the  associated  motion  matrices  and  5,  the  shapes. 
Assume  no  noise  so  that  7S(/)  =  Rank(I). 
Definition  1  A  track  matrix  I  corresponds  to  a 
motion  if  and  only  if  (hereafter  iff  )  its  columns 
span  a  nonempty  subset  of  path  space 
Observation  1-1  A  track  matrix  I  corresponds 
to  a  motion  iff  72(7)  >  0. 

Definition  2  Assume  I\  is  associated  with  a  sin¬ 
gle  motion  (Mi  and  Si  assumed  to  be  f  xZ  and 
Z  X  p  respectively.)  Ii  is  called  nondegenerate 
iff  72(7i)  =  3.  That  is,  a  nondegenerate  motion 
spans  a  3  dimensional  subspace  of  path  space. 
(Similarly  define  nondegenerate  Mi  and  Si) 
Observation  2-1  For  N  motions  we  have  72(  M )  < 
ZN,  TZ{S)  <  ZN  and  hence  it  follows  that  72(7)  < 
ZN. 

Observation  2-2  A  nondegenerate  track  matrix 
is  the  product  of  a  nondegenerate  motion  and 
nondegenerate  shape.  Nondegenerate  motions 
require  at  least  3  frames.  Nondegnerate  shapes 
requires  at  least  3  points. 

Definition  3  The  track  matrix  I2  contains  a  dif¬ 
ferent  motion  from  that  of  7i  iff  the  span  of  the 
columns  of  I2  is  not  contained  within  the  span  of 
the  columns  of  Ii .  That  is  I2  has  a  motion  dif¬ 
ferent  from  Ii  iff  there  are  points  in  path  space 
which  could  be  associated  with  I2  which  could 
never  be  generated  by  the  motion  underlying  Ii . 
Observation  3-1  It  follows  that  I2  contains  a  mo¬ 
tion  different  from  that  of  Ii  iff  72(7]  1 72)  >  72(7] ). 
Definition  4  Two  motions  7]  and  I2  are  said  to 
be  different  motions  iff  they  each  contain  a  mo¬ 
tion  different  from  the  other.  (Note  that  if  the 
span  of  7]  ts  a  proper  subset  of  the  span  of  I2 
then  I2  contains  a  motion  different  from  7] ,  but 
not  the  other  way  around). 
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Observation  4-1  Obviously  I2  and  are  differ¬ 
ent  iff  7J(/i|/2)  >  max(7J(/i), 72.(72)). 

Observation  4-2  Note  that  such  definitions,  in 
terms  of  subspaces,  may  not  totally  capture  the 
intuitive  notion  of  “different  motions”.  If  there 
are  two  well  separated  clusters  within  a  single 
subspace,  humans  might  interpret  them  as  dif¬ 
ferent  motions.  This  definition  is  saying  that 
two  motions  are  the  same  if  there  exists  a  rigid 
body  interpretation  that  makes  them  the  same. 
It  does  not  preclude  using  additional  informa¬ 
tion  to  further  label  submotions  with  different 
labels.  (In  fact,  the  algorithm  to  be  presented 
can  often  “segment”  these  well  separated  clus¬ 
ters,  however  it  knows  that  they  are,  accord¬ 
ing  to  the  above  definitions,  2  clusters  from  the 
same  motion.) 

Observation  4-3  Adding  a  single  track  from  a 
“different”  motion  to  I  must  increase  the  rank 
of  I  by  1. 

Observation  4-4  If  track  matrix  I  is  decomposed 
into  2  parts  Ii  and  I2,  then  72(7)  <  72(7i)  + 

nih). 

Definition  5  7i  and  I2  are  called  linearly  inde¬ 
pendent  motions  iff  72(7i|72)  =  72(7i)  +  72(72). 

Observation  5-1  A  track  matrix  from  N  linearly 
independent  motions  will  have  72(7)  =  37V. 

Observation  5-2  If  a  track  matrix  I,  composed 
of  N  linearly  independent  nondegenerate  mo¬ 
tions  (containing  at  least  4  points  each),  is  de¬ 
composed  into  2  parts  I\  and  I2,  then  72(7)  = 
72(7i)  +  72(72)  iff  for  every  set  of  shape  points 
{5}j  associated  with  a  single  motion  {M}j  in  I, 
the  tracks  from  these  points/motions  are  con¬ 
tained  entirely  in  either  I\  or  I2.  In  simpler 
terms,  under  general  motion  assumptions  the 
segmentation  of  I  is  good  (but  not  necessarily 
complete)  if  and  only  if  the  number  of  nonzero 
singular  values  before  segmentation  is  the  same 
as  the  sum  of  the  nonzero  singular  values  of  the 
segments. 

To  prove  this  last  observation,  let  be 

the  tracks  from  motion  i  that  appear  in  segment 
j,  with  the  understanding  that  if  no  points  of  mo¬ 
tion  i  appear  in  segment  j,  then  is  empty. 


Also  let  segment  j  =  0  refer  to  the  original  (unseg¬ 
mented  data).  By  definition  we  have,  modulo  some 
column  permutations,  Ij  = 

which  implies  72(7j)  =  YiiZi  '^i{MiSi}j).  Thus 
we  can  restate  our  observation  as 

i—N  i=N  i=N 

5272({M.5.}o)  <  + 

t  =  l  t  =  l  i  =  l 

(1) 

with  equality  holding  if  and  only  if  the  segmenta¬ 
tion  is  good.  That  a  good  segmentation  implies 
equality  follows  directly  from  the  above  equation 
and  the  definition  of  linear  independence.  To  show 
that  equality  implies  a  good  segmentation  we  ar¬ 
gue  as  foDows.  First,  note  that  since  the  number 
of  nonzero  singular  values  is  always  nonnegative, 
the  splitting  of  any  motion  cannot  reduce  the  right 
hand  sum.  Since  each  motion  contains  at  least  4 
points  either  aU  of  its  points  are  in  one  segment, 
in  which  case  it  contributes  a  value  of  n  to  the 
right  hand  side,  or  its  points  are  split  across  the 
partition  and  it  contributes  a  minimum  of  n  -|-  1 
to  the  right  hand  sum.  Noting  that  each  motion’s 
contribution  to  the  left  hand  side  is  exactly  n,  the 
observation  follows. 

One  of  the  reasons  for  proving  the  observation  as 
we  did  is  that  it  provides  insight  to  what  may  hap¬ 
pen  if  the  assumptions  of  the  observation  are  vio¬ 
lated.  If  there  are  motions  with  only  3  points,  then 
a  segmentation  may  fail  in  an  undetectable  man¬ 
ner.  More  of  a  potential  problem  is  that  if  there  are 
motions  which  are  linearly  dependent,  then  there 
may  exist  incorrect  segmentations  into  two  groups 
which  will  not  increase  the  total  number  of  singular 
values.  Luckily,  the  observation  often  holds  even 
if  there  are  linearly  dependent  motions.  If  some  of 
the  motions  are  dependent,  the  equality  will  hold 
if  and  only  if  all  the  points  in  the  linearly  depen¬ 
dent  subset  of  motions  are  grouped  into  a  single 
partition. 

The  importance  of  this  final  observation  should 
not  be  overlooked.  It  provides  a  theoretical  basis 
for  checking  the  segmentation.  If  we  start  with  37V 
singular  values  and  eventually  get  TV  subsets  with 
3  singular  values  each,  we  know  that  we  have  a 
correct  segmentation! 
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3.2  Overview  of  approach 

The  method  can  be  summarized  as  the  following 
steps: 

1.  Find  tracks  (we  do  not  address  this  issue  in  this 
paper)  and  compute  track  matrix  I 

2.  Compute  SVD  yielding  Z/,  E,  R 

3.  Determine  Tt{I)  and  prune  columns  and  rows 
of  L  and  R^  respectively. 

4.  Generate  clusters,  using  iZ,  to  get  potential  seg¬ 
mentations. 

5.  For  a  potential  segmentation,  remove  the  ‘seg¬ 

mented’  tracks  from  original  image  yielding  /,, 
and  compute  the  SVD  of  each  separately.  If 
7^(7)  =  then  we  know  the  partition 

is  good,  otherwise  we  try  the  next  clustering  at 
this  level.  If  the  last  clustering  fails,  backtrack 
to  the  call  to  this  level. 

6.  Given  a  good  partition,  if  a  cluster  has  <  3 
singular  values  we  compute  shape  and  motion, 
else  treat  this  as  a  new  input  matrix  for  the 
next  level  of  segmentation  and  recursively  goto 
step  2. 

The  first  step,  which  we  assume  is  handled  by 
some  other  process,  determines  the  tracks  of  points 
in  the  scene.  We  also  expect  an  error  estimate  for 
the  track  information.  It  is  not  really  important 
that  the  tracks  are  dense  in  either  time  or  space, 
although  the  quality  of  the  motion  estimate  and 
shape  estimates  depend  on  those  densities  respec¬ 
tively. 

3.3  A  simple  2D  Example  using  Real  Data 

We  now  present  a  simple  example  which  we  will 
foUow  through  as  we  describe  the  algorithm  in  more 
detail.  The  motion  was  obtained  by  moving  a  cam¬ 
era  in  a  plane  using  a  precision  Datel  rotation  stage 
while  keeping  a  high  contrast  scene  in  view.  While 
the  camera  rotated,  2  objects  in  the  scene  were  in¬ 
dependently  moved.  One  object  (on  the  right)  was 
rotated  about  its  axis,  and  the  other  was  translated 
with  a  small  amount  of  local  rotation.  We  obtained 
our  epi-image  by  grabbing  one  512  pixel  scanline 
per  frame  time  for  100  frames.  To  make  the  seg¬ 
mentation  task  more  difficult  we  moved  the  epi’s  of 
the  two  “objects”  closer  to  produce  the  epi-image 
and  its  associated  edges  shown  in  Figure  1.  The 
high  contrast  objects  allowed  for  easy  “tracking” 


Figure  1:  Top  shows  original  epi-polar  image 
(graylevel  camera  scanlines).  The  bottom  figure 
shows  the  tracks  found  for  57  points  over  101 
frames. 

of  features  using  a  simple  Sobel  edge  detector  and 
edge  linking  which  locally  fits  a  quadratic  to  the 
Sobel  response  and  tracks  the  maximum  with  sub¬ 
pixel  precision.  Tracks  which  were  not  continued 
from  the  first  line  to  the  last  line  were  not  included 
in  the  input  matrix.  Figure  2  shows  the  segmented 
tracks  determined  by  the  algorithm. 


Figure  2:  Segmented  tracks  from  detected  motions 

3.4  Some  Intuition  as  to  how/ why  it  works 

To  help  the  reader  get  a  better  feel  for  what 
the  components  of  the  SVD  are,  we  present  some 
partial  reconstructions  and  component  analysis. 

By  examining  Figure  3,  one  can  get  a  feel  for 
which  components  of  the  motion  are  associated 
with  each  singular  value.  The  scene  has  20  frames 
of  20  points  (ten  per  motion)  embedded  into  a 
20x256  image.  The  singular  values  associated  with 
the  input  are  31.321,  3.724,  1.934,  0.284,  0.131, 

0.000444, 0.000000, 0.000000, - The  figure  shows 

the  incremental  effects  of  reconstructing  the  tracks 
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using  the  SVD  with  only  the  first  singular  value 
(aU  others  set  to  zero),  the  first  and  the  second, 
and  finally  all  three.  In  the  image  which  uses  only 
the  first  singular  value  (second  from  top)  we  can 
see  that  the  leftward  “flow”  of  the  underlying  the 
tracks  is  captured.  Including  the  components  as¬ 
sociated  with  the  second  singular  value  (third  from 
top)  adds  some  of  the  right  hooking  but  still  does 
not  capture  everything.  Adding  in  the  third  would 
result  in  an  image  indistinguishable  from  the  orig¬ 
inal  and  is  not  shown.  The  last  image  (bottom) 
shows  one  of  the  two  motion  segments  recovered 
(it  is  exact).  The  grouping  for  segmentation  can 
be  found  in  the  first  right  vector,  or  in  other  words, 
the  first  principal  component. 

3.5  Determining  rank 

A  central  part  of  our  segmentation  algorithm  re¬ 
quires  determining  the  rank  or  number  of  nonzero 
singular  values  for  the  idealized  input  which  lead 
to  the  measured  input  matrix  I.  If,  as  is  usually 
the  case,  our  desired  input  matrix,  say  D,  is  per¬ 
turbed  by  noise  (error),  say  E,  then  we  cannot  ex¬ 
pect  the  singular  value  decomposition  of  I  =  D-\-E 
to  yield  the  exact  number  of  singular  values  of  D\ 
i.e.  we  need  to  do  more  than  determine  thr  num¬ 
ber  of  nonzero  singular  values.  In  determining  an 
approximation  to  the  rank  of  D  we  will  use  some 
knowledge  of  E. 

It  can  be  shown  that  the  difference  between  the 
singular  values  of  the  ideal  input,  (Tk{D)  and  the 
singular  values  of  the  measured  input,  <Tk{I)  satisfy 
the  following  properties: 

kfc(/)  -  <Tk{D)\  <  €<Tl(/),  Vfc  <  p,  (2) 

\(Tk{I)-<Tk{D)\<ax{E)<\\E\\F  ^k<p,  (3) 
and 

•£(^,(1)  -  MD)f  <  ll^ll?-  =  E  (“I) 

k=l 

where  ||  •  ||2  and  ||  •  ||f>  are  the  second  and  Frobe- 
nius  matrix  norms  respectively,  and  €  is  the  ma¬ 
chine  precision  for  computation.  See  [Golub  and 
van  Loan,  1983,  Sect  6.5  and  Cor.  8.3.2,  8.3.5]  for 
proofs  and  more  detail. 

We  now  develop  bounds  on  TZ{i).  Let  be  k*  be 
the  smallest  integer  such  that  VA:  >  k*,  ak{D)  =  0. 


Figure  3:  Partial  SVD  reconstruction  of  2  synthetic 
2D  motions.  The  first  image  shows  scaled  version 
of  original  track  image.  The  second  shows  par¬ 
tial  reconstruction  of  the  original  tracks  using  first 
singular  value,  i.e.  /i  ®  ri.  The  third  shows  the 
partial  reconstruction  using  the  first  two  singular 
values.  Adding  the  third  singular  value  gives  back 
the  original  image.  The  final  image  shows  the  seg¬ 
mented  and  reconstructed  tracks  of  one  of  the  mo¬ 
tions.  Note  the  segmentation  is  exact.  (See  text 
for  more  details.) 


Notice  for  k  >  k*  wg  have  Ofc(/)  —  <Tk{D)  =  (Tk{I)- 
Combining  this  observation  with  equations  2-  4  we 
have  a  lower  bound 


7^(7)  >  max  k  s.t.  < 


f^k{I)  >  fo-,(/), 

«^fc(7)  >  ||£'l|F,or 

klTf’l'kp)  >  I|£||f 


(5) 

These  lower  bounds  on  72.(7)  require  a  conservative 
estimate  (over  estimate)  of  ||jF||f-  Note  this  relies 
on  relatively  weak  knowledge  of  the  noise  E,  and 
makes  no  assumption  about  the  shape  of  the  e-  ror 
distribution.  In  the  examples  in  this  paper  we  as¬ 
sume  II^^IIf  <  ^1)  where  is  set  to  be  greater  than 
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or  equal  to  the  expected  point  RMS  error.  For  real 
datasets,  the  expected  RMS  is  computed  by  com¬ 
puting  the  SVD  of  a  known  non  degenerate  single 
motion  and  subtracting  the  reconstruction  (using 
first  3  singular  values)  from  the  original  track  ma¬ 
trix.  For  the  example  images  estimated  RMS  was 
.04. 

In  our  upper  bound  we  assume  that  there  is  a 
computable  predicate  Np{E),  which  when  applied 
to  a  matrix  E  returns  true  only  if  E  “must”  be 
considered  pure  noise.  As  we  will  see  later,  be¬ 
cause  we  are  interested  in  tracks  of  motion,  it  is 
quite  reasonable  to  assume  that  any  matrix  must 
be  noise  if  <  S2,  Vt,y.  For  the  real  exam¬ 
ples  we  have  set  a  very  conservative  estimate  of 
62  =  .0016.  (Our  upper  bounds  would  be  smaller 
if  we  were  less  conservative).  Another  reasonable 
noise  predicate  can  be  obtained  by  assuming  that 
for  all  observable  I  we  will  have  |1£^||f  >  ^2- 

Assuming  that  the  predicate  Np(E)  is  conserva¬ 
tive  (i.e.  it  never  returns  TRUE  if  the  matrix  E 
could  be  valid  data),  we  can  get  an  upper  bound 
on 

Tl{I)<mva.  k  s.t.  Ap  ^  (T,/j  (gi  (6) 

In  words,  the  upper  bound  is  the  smallest  k  such 
that  the  difference  between  /  and  the  kth  recon¬ 
struction  of  I  must  be  noise. 

It  should  be  noted  that  the  sanctity  of  these 
“bounds”  depends  on  the  conservative  measures  of 
the  noise  models,  while  the  distance*  V  tween  them 
(and  hence  the  usefulness  in  approximating 
depend  on  the  error  model  being  as  sharp  as  pos¬ 
sible,  i.e.  not  too  conservative.  In  most  cases  the 
upper  and  lower  bounds  differ  by  more  than  1  and 
hence  only  restrict  ^(/),  rather  than  determining 
it.  Because,  as  ws  shall  see  later,  it  is  convenient 
for  our  algorithm  to  have  a  single  number  for 
in  these  cases  we  make  a  final  “heuristic”  determi¬ 
nation  of  it  using  a  knee  finding  technique  on  the 
logarithm  of  singular  values  between  the  upper  and 
lower  bound.  That  is,  we  chose  as  72(7)  the  i  that 
has  maximal  curvature  on  the  curve  (i,log(<T,))  for 
i  in  the  range  determined  by  equations  5-  6.  This 
last  heuristic  step  has  correctly  determined  72(7) 
in  most  of  our  test  cases.  It  is  important  to  note 


that  without  the  bounds  to  determine  the  search 
window,  the  knee  finding  would  be  much  more  dif¬ 
ficult  since  in  the  region  of  vabd  singular  values 
there  may  be  other  “local”  knees,  especially  in  the 
case  where  there  are  multiple  motions  of  signifi¬ 
cantly  different  magnitudes. 

For  our  2D  example,  the  8  largest  singular  values 
were  13820.38,  51.91  16.24  3.51,  1.96,  1.26,  1.14, 
and  .96.  Just  looking  at  the  above  numbers  it 
would  be  hard  to  determine  if  there  was  one  or 
two  motions  without  knowledge  of  the  expected 
noise  and  the  bounds  on  72.  The  algorithm  for  de¬ 
termining  72  determines  the  bounds  4  and  34  and 
the  knee  finder  determines  that  there  are  5  nonzero 
singular  values.  After  segmentation  the  algorithm 
found  one  cluster  had  3  nonzero  singular  values, 
the  other  2  nonzero  singular  values. 

To  give  you  an  idea  of  the  complexity  of  the  task 
of  determining  the  number  of  significant  singular 
values,  consider  the  example  shown  in  Figure  4. 
Here  we  have  tracks  over  40  frames  of  four  motions 
with  10  points  each.  The  lowest  solid  line  shows  the 
singular  values  without  noise.  In  this  case  the  72(7) 
is  obvious  since  the  nonsignificant  singular  values 
are  numerically  equal  to  zero  (and  hence  cannot 
be  plotted  here  since  the  plot  is  logarithmic).  The 
other  lines  show  the  singular  values  when  white 
noise  is  added  with  standard  deviation  .1,  .2.,3  .4 
and  .5.  Notice  there  are  several  “knees”  and  when 
the  noise  level  is  too  high  the  last  knee  becomes 
indistinguishable  from  noise.  If  we  did  not  have 
bounds  to  limit  the  search  for  72(7)  it  would  be 
very  difficult  to  determine  the  correct  number  of 
significant  singular  values. 

3.6  Discussion  of  cluster  analysis 

First,  we  point  out  that  the  right  vector  asso¬ 
ciated  with  the  largest  singular  value,  call  it,  rj, 
need  not  be  the  vector  which  gives  the  best  seg¬ 
mentation.  If  the  two  motion  parameters  are  suf¬ 
ficiently  mixed  together  in  a  particular  dimension, 
but  separated  in  another,  the  row  associated  v/ith  a 
smaller  singular  value  may  actually  give  a  cleaner 
segmentation.  For  a  large  number  of  motions  it 
is  often  the  case  that  a  row  r,  will  allow  one  to 
easily  segment  out  a  single  motion,  while  lumping 
the  remaining  A  —  1  motions  into  a  single  cluster. 
Therefore,  it  is  useful  to  consider  numerous  r^’s. 
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Figure  4:  Here  we  see  the  singular  values  for  a 
synthetic  2D  case  as  the  standard  deviation  of  ad¬ 
ditive  white  noise  is  increased  from  0  to  .5.  There 
are  4  motions  with  10  points  each  tracked  over  40 
frames. 

This  is  really  a  general  clustering  problem,  closely 
related  to  clustering  used  in  factor  analysis.  How¬ 
ever,  because  we  have  the  ability  to  check  a  par¬ 
tition  of  the  data,  we  take  a  slightly  more  conser¬ 
vative  approach.  Our  goal  is  not  to  find  all  clus¬ 
ters  at  once,  but  rather  to  proceed  by  breaking  the 
data  up  into  2  partitions  and  checking  the  break. 
If  it  is  good,  then  we  recursively  solve  each  of  the 
sub-problems.  Thus  there  are  three  parts  to  our 
clustering  algorithm:  initialization  of  clusters,  re¬ 
finement  of  clusters,  splitting  data  and  checking 
clusters. 

We  perform  the  clustering  as  follows.  First  we 
use  the  first  right  vector  and  use  this  to  break  the 
track  matrix  into  two  new  input  matrices.  The 
SVD  of  each  of  the  proposed  segments  is  computed 
and  we  check  the  number  of  singular  values  as  sug¬ 
gested  by  Observation  5-2.  If  this  check  is  satisfied 
we  continue  segmentation  on  the  new  input  matri¬ 
ces.  If  it  falls,  we  then  try  clustering  on  the  second 
right  vector.  If  this  fails  we  try  the  sum  of  the  first 
two  vectors  weighted  by  their  singular  values.  Ob¬ 
viously  we  ccald  add  other  choices,  but  these  have 
been  sufficient  for  now. 

For  the  ID  clustering  problem  initialization  is 
straightforward.  We  know  we  are  looking  for  2 
clusters,  and  we  use  the  cluster  with  maximal  sep¬ 


aration  for  initialization.  In  particular,  we  com¬ 
pute  the  nearest  neighbor  distance  for  each  point 
and  use  the  largest  distance  as  the  breakpoints  be¬ 
tween  the  clusters.  For  the  2D  clustering,  we  ini¬ 
tially  cluster  on  the  maximal  separation  of  the  first 
right  vector. 

Given  the  initial  clusters,  we  refine  them  at¬ 
tempting  to  minimize  the  RMS  distance  of  each 
point  from  the  cluster  center.  An  algorithm  for  this 
is  detailed  in  [Duda  and  Hart,  1973,  Ch.  6].  Ap¬ 
plying  this  iterative  algorithm  results  in  2  clusters 
and  a  measure  of  compactness  (root  mean  square 
distance  to  cluster  center)  for  each  cluster.  In  most 
of  our  examples,  the  number  of  iterations  =1,  i.e. 
the  initial  clustering  is  already  minimal.  In  the 
very  noisy  cases  however,  the  initial  clustering  is  a 
poor  approximation  and  the  iterative  improvement 
offered  by  this  approach  can  significantly  improve 
the  quality  of  segmentation.  The  squared  error 
nature  can,  at  times,  cause  a  few  outliers  to  be 
included  in  the  wrong  cluster.  The  issues  of  nor¬ 
malization  in  clustering  arise  here  (e.g.  do  we  use 
simply  use  ■>•,),  see  [Duda  and  Hart, 
1973,  pp.  216  ad  pp.  224].  Currently  we  do  not 
normalize. 

Figure  5  shows  the  first  two  components  of  R 
used  in  clustering  on  the  2D  example.  The  points 
are  shown  with  labels  so  you  can  see  which  points 
correspond  to  which  motions.  Note  that  the  first 
right  vector  (associated  with  the  first  singular  value) 
correctly  classifies  the  motions  resulting  in  the  afore¬ 
mentioned  segmentation. 

4  Initial  Experimentation 

Our  initial  experiments  have  been  very  promis¬ 
ing.  Figure  6  illustrates  our  method  applied  to 
a  dataset  with  tracks  from  four  different  motions. 
The  epi-image  was  constructed  in  a  similar  fashion 
to  our  previous  2D  example  with  real  data  except 
that  additionally,  we  combined  epi-images  acquired 
at  different  time  periods  in  order  to  have  an  ex¬ 
ample  with  more  motions.  The  tracks  from  each 
motion  are  contiguous  and  can  be  distinguished 
using  the  bottom  image  in  this  figure.  Here  the 
tracks  from  two  different  motions  can  be  clearly 
discerned;  the  tracks  not  in  this  image,  make  up 
the  other  two  motions.  In  Figure  7,  we  show  the 
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Figure  5;  Here  we  see  the  clustering  for  each  track 
shown  in  Figure  1  using  its  entry  in  the  first  right 
singular  vector  as  the  Y  coordinate,  and  its  entry  in 
the  second  right  vector  as  the  X  coordinate.  Each 
point  is  labeled  with  the  motion  which  generated 
it. 

results  of  clustering  using  the  2nd  right  right  vec¬ 
tor.  Our  algorithm  segmented  the  p'  ints  marked 
with  “x”  from  those  with  the  “o”.  The  latter  cor¬ 
respond  to  the  motion  on  the  right  in  the  bottom 
image  of  Figure  6.  However,  it  is  clear  from  this 
plot,  that  it  is  possible  to  segment  all  four  motion 
components  just  using  the  first  two  right  vectors 
of  the  original  SVD,  without  further  recursion. 


Figure  6:  Top  shows  tracks  found  for  59  points  (4 
motions)  over  100  frames.  The  bottom  shows  the 
tracks  of  two  clearly  distinguishable  motions. 


We  have  also  experimented  with  synthetic  data 
using  additive  noise,  iV(0,  sd),  displacing  each  track 
point.  We  have  been  able  to  segment  two  mo¬ 
tions  with  complete  accuracy  for  noise  levels  up 
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Figure  7:  Here  we  see  the  clustering  of  the  first  2 
right  singular  vectors  for  a  realistic  case  with  four 
motions.  The  first  segmentation  splits  the  tracks 
labeled  with  an  “x”  from  those  with  a  “o”  -  the 
latter  correspond  to  the  motion  given  by  the  tracks 
shown  on  the  right  in  the  bottom  of  Figure  4. 

to  50%  of  the  point  value.  The  results  of 

segmenting  a  difficult  3D  synthetic  example  with 
three  motions  can  be  seen  in  Figure  8.  (Points 
are  labeled  with  their  corresponding  motion.)  The 
figure  shows  a  plot  of  the  first  two  right  singular 
vectors  for  each  of  3  motions  (10  random  (overlap¬ 
ping)  shape  points  per  motion,  50  frames  of  mo¬ 
tion).  The  first  paiss  of  the  algorithm  split  the  mo¬ 
tion  into  two  groups,  one  containing  motion  1  and 
the  point  labeled  3’,  and  the  second  the  remain¬ 
ing  points.  Initially  it  found  9  singular  values,  and 
after  segmentation  there  were  4  and  6  singular  val¬ 
ues.  The  group  with  6  singular  values  (motions  2 
and  3)  were  further  decomposed  into  their  correct 
motion  components. 

This  example  also  shows  why  it  might  be  useful 
to  disregard  the  cluster  labels  for  points  on  the 
fringes  of  the  cluster  and  then  add  them  into  the 
cluster  after  recomputing  the  SVD.  One  way  to 
do  this  would  be  to  split  the  tracks  conservatively 
into  two  groups  /j  and  I2  (  more  than  Tl{J)  points 
each),  and  compute  the  SVD  of  each.  Then  for  any 
track  ti  that  was  on  the  fringe,  compute  w  =  Ll ti. 
This  should  have  «  0,Vt  >  1Z{Ik)  if  ti  is  in  the 
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Figure  8:  Here  we  see  the  clustering  of  the  first  2 
right  singular  values  for  3  synthetic  motions.  Point 
3’  is  mistakenly  clustered  in  with  the  data  from 
motion  1. 

same  subspace.  While  it  sounds  promising,  this 
technique  has  yet  to  be  implemented  and  tested. 

5  Critical  analysis 

While  the  technique  presented  in  this  paper  does, 
in  our  opinion,  a  very  good  job  at  segmentation  of 
multiple  motions,  there  are  still  a  few  difficulties 
with  the  approach.  Most  of  the  difficulties  are  ac¬ 
tually  problems  with  the  underlying  factorization 
technique,  and  may,  we  hope,  be  overcome  with 
additional  effort. 

We  present  a  check-list  of  the  advantages  (-f) 
and  disadvantages  (— )  of  this  approach.  Those  a.s- 
pects  which  are  both  pros  and  cons  will  be  marked 
with  ±. 

+  The  segmentation  approach  using  the  R  compo¬ 
nent  of  the  SVD  appears  to  be  extremely  pow¬ 
erful. 

-f  The  factorization  technique  simultaneously  pro¬ 
vides  shape  and  motion. 

-1-  Shape  is  represented  relative  to  object  centroid 
and  hence  is  stable  with  respect  to  short-baselines 
in  motion. 

-f-  A  technique  for  bounding  and  approximating 
the  numerical  rank  %{I)  has  been  developed 


-h  The  segmentation  approach  comes  with  a  theo¬ 
retically  derived  check  on  the  quality  of  the  seg¬ 
mentation  (assuming  linearly  independent  mo¬ 
tions  and  that  72(7)  is  correct)). 

±  The  segmentation  method  is  quite  robust  w.r.t. 
positional  noise. 

±  Cost  is  0{Tnv?  -|-  n^)  (with  a  reasonable  con¬ 
stant).  If  the  number  of  points  and  frames  are 
not  too  large  this  is  quite  reasonable.  For  exam¬ 
ple  for  200  X  100  3D  data  set  (hence  a  400  x  100 
input  matrix)  the  approach  takes  ss  64  CPU 
seconds  on  a  12Meg-SparcStationl. 

—  The  representation  assumes  orthographic  pro¬ 
jection. 

—  The  segmentation  method  can  have  difficulties 
when  there  are  nearly  dependent  motions. 

6  Conclusions  and  Future  Work 

This  paper  generalized  the  factorization  approach 
for  simultaneous  recovery  of  motion  and  shape  to 
handle  multiple  motions.  It  presented  a  technique, 
using  the  information  available  from  factorization, 
for  the  segmentation  of  multiple  motions.  The 
method  includes  a  way  to  bound  the  numerical 
rank  of  the  input  and  proves  how  to  use  this  to 
check,  in  general,  that  the  segmentation  is  valid. 
The  segmentation  technique  was  demonstrated  on 
numerous  real  and  synthetic  examples. 

There  is  considerable  future  work  to  be  done 
in  the  area  of  factorization-based  motion  analysis. 
For  example,  consider  the  many  technical  reports 
on  which  Tomasi  and  Kanade  appear  to  be  work¬ 
ing  [Tomasi  and  Kanade,  1991].  Much  of  that  fu¬ 
ture  work  applies  here  as  well.  There  are  a  few 
things  on  which  we  expect  to  continue  working  as 
they  have  particular  impact  on  our  segmentation 
approach.  These  areas  include  a  more  thorough 
error  analysis  of  the  segmentation  including  a  bet¬ 
ter  analysis  of  different  approaches  to  clustering 
(including  removing  the  fringe  points),  compari¬ 
son  with  previous  motion  segmentation  techniques, 
and  extending  the  factorization  approach  to  handle 
partial  tracks  and  tracks  distorted  by  perspective. 
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Abstract 

Recovering  structure  from  motion  even  using  informa¬ 
tion  from  multiple  image  frames  is  difficult,  in  part 
because  motion  error  can  introduce  large,  correlated 
errors  in  the  structure  estimate.  A  method  is  pro¬ 
posed  for  recursively  recovering  structure  from  motion 
that  can  deal  with  this  problem.  Encouraging  results 
on  real  images  and  synthetic  data  are  presented. 

1  Introduction 

Two  frame  structure  from  motion  is  known  to  be  inac¬ 
curate.  The  reasons  for  this  are  familiar:  for  two  frames, 
each  structure  value  is  determined  by  a  single  measure¬ 
ment  which  is  unreliable  for  distant  points,  and  which 
is  strongly  affected  by  small  errors  in  the  motion.  The 
natural  remedy  to  this  problem  is  to  refine  the  struc¬ 
ture  estimation  by  combining  measurements  from  many 
frames  [2]  [3]  [8]  [11]  [12]  [13]  [16].  In  earlier  work  [14] 
[18],  we  have  presented  an  implementation  of  this  proce¬ 
dure  which  differs  from  previous  approaches  in  that  the 
motion  error  is  explicitly  taken  into  account.  In  this  pa¬ 
per,  a  detailed  theoretical  derivation  of  the  algorithm  is 
presented,  together  with  new  experimental  results  on  a 
real  image  sequence. 

The  difficulty  of  multiframe  structure  from  motion 
has  several  sources.  First,  even  assuming  the  motion 
is  known,  determining  the  structure  of  a  pcint  &om  two 
noisy  images  is  a  non-linear  problem,  and  biased  in  the 
depth  estimate.  This  problem  was  studied  experimen¬ 
tally  in  [17];  the  results  indicate  that  in  fact  the  non¬ 
linearity  of  the  pure  structure  measurement  is  not  a  seri¬ 
ous  problem — it  can  be  compensated  for  by  a  multiplic¬ 
ity  of  measurements  from  different  camera  positions. 

Another,  probably  more  serious,  difficulty  is  the  mo¬ 
tion  error.  It  is  known  that  small  errors  in  the  rota¬ 
tion  can  introduce  relatively  large  errors  in  depth  mea¬ 
surements  [5].  Moreover,  motion  error  introduces  strong 
correlationt  in  the  structure  errors.  For  instance,  if  the 
translation  is  identified  incorrectly,  then  the  measured 
positions  of  all  points  wiU  be  displaced  away  from  their 
actual  ones  in  approximately  the  same  direction:  their 
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position  errors  will  be  correlated.  An  analogous  result 
holds  for  rotation  error.  Conversely,  the  correlations  in 
the  structure  errors  are  the  record  of  the  motion  error. 
In  order  to  include  the  potentially  large  effects  of  motion 
error,  and  to  compensate  for  them,  a  record  of  these  cor¬ 
relations  must  be  maintained  in  the  current  error  esti¬ 
mate.  This  information  can  then  be  used  by  a  recursive 
algorithm  for  updating  the  old  structure  estimate  based 
on  new  image  information.  This  explicit  inclusion  of  mo¬ 
tion  error  is  the  main  new  contribution  of  this  paper. 

2  Overview  of  the  Algorithm 

The  camera  is  assumed  to  be  navigating  in  an  unknown, 
fixed  environment,  consisting  of  isolated  3D  points.  The 
2D  frame-to-hame  correspondences  of  these  points  are 
assumed  given.  The  goal  is  to  recursively  determine  the 
locations  of  the  3D  points. 

An  obvious  strategy  is,  for  each  new  image,  to  com¬ 
bine  the  old  structure  estimate  with  the  information  con- 
tiuned  in  the  new  image,  weighted  by  their  respective 
uncertainties.  However,  this  shape  from  pose  approach 
is  potentially  unstable,  since  initially  the  structure  es¬ 
timate  is  quite  inaccurate.  Instead,  at  each  iteration, 
Horn’s  relative  orientation  algorithm  [9]  is  used  to  de¬ 
termine  a  new  structure  estimate  from  the  most  recent 
pair  of  images,  which  is  then  fused  with  the  previous 
estimate.  Since  Horn’s  algorithm  gives  a  relatively  ro¬ 
bust  structure  measurement,  this  approach  can  recover 
from  errors  in  the  initial  cumulative  estimate  more  easily 
than  the  shape-from-pose  method.  Also,  because  of  the 
simplicity  of  Horn’s  objective  function,  an  approximate 
error  analysis  is  relatively  tractable.  However,  other  2 
(or  3)  frame  motion  algorithms  could  be  used  equally 
well. 

The  algorithm  is  as  follows.  At  each  time  step,  Horn’s 
algorithm  u  used  to  recover  the  structure  and  relative 
orientation  for  the  most  recent  image  pair.  Also,  the 
structure  error  of  this  algorithm  is  estimated,  'consist¬ 
ing  of  the  complete  covariance  matrix  including  cross¬ 
correlations  between  different  3D  points.  These  cross¬ 
correlations  represent  the  effects  of  the  uncertainty  in 
the  relative  orientation,  i.e.  the  frame-to-frame  motion 
error,  as  stated  above. 

The  output  of  Horn’s  algorithm  is  then  fused  with  the 
previous  structure  estimate.  To  do  this,  both  the  old 


and  new  structure  estimates,  and  their  estimated  errors, 
must  be  transformed  into  a  common  coordinate  system. 
This  transformation  is  not  exactly  known  and  induces 
additional  error.  The  coordinate  system  chosen  is  that 
of  the  moving  camera;  the  fused  structure  is  maintained 
in  this  system,  since  this  is  the  relevant  information  for 
the  robot.  The  fusing  is  done  by  the  standard  Kalman 
filter  method. 

In  principle,  the  structure  can  be  computed  with  ar¬ 
bitrary  accuracy  given  sufficiently  many  image  frames. 
More  precisely,  what  can  be  computed  accurately  is  the 
absolute  structure  or  shape  [19]  ,  i.e.  the  structure  in 
an  object-centered,  camera-independent  coordinate  sys¬ 
tem.  On  the  other  hand,  the  current  camera  pose,  and 
therefore  the  depths  as  measured  in  the  current  camera 
coordinate  system,  are  determinable  only  with  a  limited 
accuracy.  Previous  recursive  algorithms  for  multiframe 
structure  from  motion  had  limited  accuracy  of  shape  or 
structure  recovery,  because  of  the  limit  set  by  the  pose 
error.  In  contrast,  because  the  algorithm  described  here 
explicitly  takes  into  account  the  uncertmnty  in  the  cam¬ 
era  pose,  it  is  potentially  capable  of  determining  the 
shape  with  an  accuracy  greater  than  that  with  which 
the  pose  or  motion  is  recovered. 


3  Estimating  the  Motion  Error 

To  compute  the  structure  error  of  Horn’s  algorithm,  the 
error  in  relative  orientation  must  be  computed  as  an  in¬ 
termediate  step.  Horn’s  algorithm  computes  the  motion 
between  the  previous  (left)  and  most  recent  (right)  cam¬ 
era  frames  by  minimising  the  objective  function: 

®  =  E(b  (Si,  (>) 

t  t 

Here  the  notation  of  [9]  has  been  adopted.  The  error 
in  this  motion  estimate  is  computed  by  linearising,  as 
is  standard.  Thus,  the  derivatives  of  the  motion  pa¬ 
rameters  with  respect  to  the  image  coordinates  must  be 
calculated.  W  is  used  to  represent  the  five  motion  pa¬ 
rameters:  two  for  the  translation  direction,  and  three 
for  the  rotation.  The  translation  magnitude  is  omitted 
since  it  cannot  be  recovered  due  to  the  well-known  scale 
ambiguity  under  arbitrary  motion.  Also,  the  image  co¬ 
ordinates  of  the  two  images  are  represented  by  a  vector 
V  of  length  4m,  where  m  is  the  number  of  3D  points. 

The  recovered  values  of  the  motion  parameters  corre¬ 
spond  to  a  minimum  of  the  objective  function  E,  and 
therefore  dE/dVf  =  0.  After  a  perturbation  in  the  im¬ 
age  coordinates,  the  perturbed  motion  parameters  re¬ 
covered  by  Horn’s  algorithm  minimise  the  new  E.  Thus 
the  derivative  above  is  again  lero  when  evaluated  at  the 
new  values  for  V  and  W:  it  has  not  changed  in  value. 
This  implies  that: 


a^E 


rdV-h 


a^E 


:dW  =  0 


awa\  awavf 

for  the  small  perturbations  dV  and  dW.  Defining 

a^E  a^E 


N  = 


awav 


M  = 


awaw 


(2) 


(3) 


and  assuming  M  has  an  inverse,  equation  2  can 
rewritten  as 


aw 

av 


=  -M-^N 


be 

(4) 


In  terms  of  this  5  x  4m  matrix,  the  motion  error  covari¬ 
ance  is: 


Cou(W)  =  .B{dWdW’'}  ~  —E{dVdV’^}—  . 

Assuming  that  the  image  noise  at  each  pixel  is  in¬ 
dependent  and  has  the  same  standard  deviation  a,  the 
covariance  of  dV  is  proportional  to  the  identity  matrix, 
and  the  linearised  estimate  of  the  motion  error  is  the 
5x5  matrix: 


jaw  aw’’ 
av  av  ■ 


(6) 


8.1  Determining  the  M  matrix 
First,  the  motion  parameters  incorporated  in  W  must  be 
specified.  Let  b  be  the  unit  translation  vector,  and  let 
be  be  the  translation  actually  recovered  by  Horn’s  algo¬ 
rithm.  Then  an  arbitrary  b  near  be  can  be  represented 
as: 

b  =  b j  -f-  -  [bijipbe,  (7) 

where  b^  is  in  the  plane  perpendicular  to  be.  This  repre¬ 
sentation  is  adequate  because  for  this  linearised  analysis 
the  motion  error  is  assumed  to  be  smaU.  bj  is  repre¬ 
sented  explicitly  by  its  projection  on  two  perpendicular 

A 

axes  <1  and  tz  in  the  plane  normal  to  be. 

The  rotation  R  is  represented  using  quaternions,  but 
in  a  more  convenient  notation.  Let  the  axis  of  rotation  be 
S,  and  the  rotation  angle  be  0.  Then  the  unit  quaternion 

is: 

‘^0  0  -• 

R  =  (cos-’Ssin-)  =  (JZo.R),  (8) 

with  -I-  R*  =  1.  The  rotation  is  represented  by 
the  three-dimensional  vector  R.  The  rotation  matrix 
is  given  by: 

?;,  =  «(?,,)  =  (l-2|R|’)r,, 


2yi-|R|*(R  X  ?,, )  -I-  2(R  ■  r,, )R.  (9) 

The  vector  R  is  represented  in  terms  of  its  projections 
along  three  orthogonal  3D  directions  rji,  ijz  and  ffy. 

The  first  partial  derivatives  are: 


aE 

a(bT  •  e) 
aE 

aiRff) 


-2j]f?ib  .rr.  xC^^ri.),  (10) 
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where  is  obtained  by  differentiating  eq.  9. 

Evaluating  the  second  partial  derivatives  at  the  unper¬ 
turbed  solution,  where  b  =  be  and  bj  =  0,  gives: 


a^E 


5(b7<  •  ?)5(bT  •  fj) 


=  2X1  («  •  ^  (r{.  X  r,,) 
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-?  ^(b«  •  (p;.  X  ,  (11) 
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i 

-2  ^(be  •  (?;.  X  ?,,))(?•  (?,,  X  C^^(r,J)),  (12) 

t 
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VI  -  W" 

-2-r===(^  X  r,,)  -  2  -^"  *)(*  ^ 

Vi-|R|2  (i-|Rr)5 

~2-fi===(«  X  PiJ  +  2(?  - P|,€  +  f  •  p,,^)})].  (13) 

Vi-l»P 

These  equations  determine  the  elements  of  M.  M  could 
also  be  calculated  in  the  spirit  of  Horn’s  paper  [9],  by 
summing  first  over  all  image  points  prior  to  calculating 
the  derivatives. 

3.2  Determining  the  N  matrix 

To  determine  the  N  matrix,  the  partial  derivative  of  the 
matrix  (dJS/dW)  with  respect  to  the  vector  V,  the  set 
of  image  coordinates,  should  be  determined.  Define  the 
derivative  of  the  unit  ray  (either  left  or  right),  with 
respect  to  its  image  coordinates  Vj,  as  the  vector  func¬ 
tion: 

where  ?  denotes  an  arbitrary  direction  in  the  image 
plane,  and  /  is  the  focal  length.  Then  the  derivatives 
are: 

a^E 

d(bT  •^)5(V,,<.^  ~ 

2(?.  (?,,  X  flJUb,  ■  (?,,  X  E(D(r,„  V,,,  ■  ^))] 

+2[£,  .  (?,,  X  p{J][?.  (p,,  X  R(D(P,„  V,,, .  p)))].  (15) 
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2[?.  (p{,  X  D(rr,i,  K,,  •  p))][be  •  (p|,  X  P,J] 

+2l^.  (fi^  X  r,J][b.  •  (rl^  X  D(r,,*,  K,,  ■  ^)],  (16) 
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2[b,  •  (r,,  X  R(D(p,„  V.,,  ■  p)))][be  .  (p,,  X  C^.>(r,J)] 

+2[b,  •  (r,,  X  r5J][b. . (f,,  X  ^D(p,„  Vi,<  -p)))],  (17) 
and 

a^E  _ 

d(R  .q)5(F,.i  p)  " 

2[b. .  (?;,  X  D(r„i,  K., .  p))][b.  .  (C^;^,J  X  p,J] 

+2[b,  ■  (?;,  X  r,J][b. . (C^  ^p,J  X  D(p,,i,  K,,-p))]  (18) 

The  above  equations  determine  E.  From  M  and  E, 
the  motion  error  can  be  calculated  as  described  earlier. 

4  Determining  the  Structure  Error 

To  calculate  the  estimated  structure  error  due  to  Horn’s 
algorithm,  the  matrix  of  partial  derivatives  of  the  struc¬ 
ture  with  respect  to  the  image  coordinates  is  calculated, 
just  as  for  the  motion.  The  equation  for  the  location  of 
the  i-th  3D  point,  pi,  is: 

|b|(ExR(I|.)).(R(I,.)xI,) 

-  |R(l|.)xI..p  <”> 

where  the  3D  vector  I  =  (Ixily)  /)  gives  the  coordinates 
of  the  point  on  the  image  plane.  Recall  that  |b|,  the 
magnitude  of  the  translation  step,  is  not  determined  by 
Horn’s  algorithm  due  to  the  overall  scale  ambiguity. 

p,  depends  on  the  image  coordinates  partly  through 
the  motion  parameters  W,  which  are  themselves  func¬ 
tions  of  these  coordinates.  The  partial  derivatives  of  pi 
c£m  therefore  be  computed  by  the  chain  rule.  Let  the 
collection  of  3D  structure  estimates  be  represented  by 
the  state  vector  P„  =  (pi,  ...,Pm)i  which  has  length  3m. 
Then: 

dPn  =  -I-  (20) 

aw  av  av  '  ^ 

Since  the  matrix  (dW /dV)  has  already  been  computed 
above,  only  (5P„/9W)  and  (5P*/5V)  need  be  com¬ 
puted.  These  are,  respectively,  3m  x  5  and  3m  x  4m 
matrices,  and  are  given  by: 


api  _}bi(exX(Ir,))(E(I,,)xIr,) 


a(bTf) 


Ir,  (21) 
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lb|(bxR(?)).(R(Ii.)xIr.)^ 
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|b|(bxR(Ii.)).(R(?)xIp.)^ 

|R(Il.)  X  Ip.p 

(i?(bj  X  X  1,J 

|bl(bxR(Ii.)).(R(Ii.)xIp) 
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Finally,  the  estimated  covariance  of  the  structure  can 
be  computed  as  for  the  motion. 

dv  aw  dv  av  ■  '  ^ 

As  before,  C'ot(Pb)  =  .E{dPB  riPj}i  implying  that: 


where  Cn,  a  3m  x  3m  matrix,  is  given  by: 


Cot»(P„) 


jdPndPn 

=  _c»(v).5^  = 


5  Fusing  the  New  and  Old  Shape 
Estimates 

The  aim  is  to  combine  the  current  structure  estimate 
Pn  with  the  previous  estimate  of  the  shape,  P^.i,  in 
order  to  produce  a  new  shape  estimate  in  the  current 
camera  frame.  We  use  shape  rather  than  structure  in 
referring  to  the  combined  estimate  to  emphasise  that  it 
is  not  intriniscally  limited  in  accuracy  by  the  camera 
pose  estimation  error,  as  discussed  in  section  2.  This 
combination  requires  that  the  past  estimate  be  moved  to 
the  current  coordinate  frame  using  the  calculated  motion 
parameters.  Also,  the  error  in  the  combined  estimate 
should  be  updated  to  this  frame. 

The  moved  shape  estimate  can  be  written  as: 

P„  =  C„(Pir_,)-BB,  (27) 


Cb  —  Rn 


h  0 

0  I3 


0  I3 


The  3x3  matrix  £»  is  the  estimated  rotation  between 
the  coordinate  frames,  and  I3  is  the  3x3  identity  matrix. 
Similarly,  Bn,  a  3m  x  1  matrix,  is  given  by 


Bn=h„ 


where  b„  is  the  translation  between  the  camera  posi¬ 
tions.  Since  the  transition  matrices  Cn  and  Bn  are  them¬ 
selves  noisy,  this  situation  differs  from  the  standud  re¬ 
cursive  measurement  process  to  which  the  Kalman  filter 
is  normally  appUed. 

The  expected  error  of  Pb  in  the  current  coordinate 
system  is  again  computed  by  linearising: 


dPB  = 


jpF  .  ^Pn 


Define  the  3m  x  5  matrix  An  by: 


f^Wnl 

hVn  J  • 


Then  the  covariance  is: 

Cav{dPndPl)  =  CnCov(dP^_,dP",)C^ 

+AnCov[dVndVl)^.  (32) 

Here  it  has  been  assumed  that  Vn  and  Pb-i  ue  statis¬ 
tically  independent;  most  likely,  after  a  few  frames,  the 
correlation  between  the  left  image  and  the  previous  com¬ 
bined  estimate  is  small  and  the  independence  assumption 
is  valid. 

Finally,  to  determine  the  new  estimate  P^,  the  result 
of  Horn’s  algorithm  and  Pb  must  be  combined.  This 
is  straightforward  except  for  the  overall  scale  ambiguity 
in  recovering  motion  and  structure.  Ideally,  the  scale 
should  be  removed  from  the  state  vector;  currently,  the 
scale  of  the  translation  step  is  fixed  to  be  its  exact  value 
as  obtained  from  ground  truth.  Then,  the  standard 
Kalman  filter  result  [7]  for  the  fused  estimate  is: 

P^  =  Cou(Pi:)[CoT»(K)-»PB 

-1-Coi;(Pb)-^Pb],  (33) 

where  Cov(Pj[')  is  the  covariance  of  the  combined  esti¬ 
mate  at  this  camera  position: 

Cot»(PO  =  [Coi»(K)-‘  -t-  Cot;(PB)-‘]-'  (34) 

Cov(Pb),  the  estimated  error  of  Horn’s  algorithm  in  de¬ 
termining  structure,  has  already  been  calculated  in  Sec¬ 
tion  4.  Thus,  aU  the  ingredients  have  been  assembled  for 
computing  the  fused  shape. 
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6  Experiments 

A  series  of  experiments  has  been  carried  out  using  the 
algorithm  described  above  for  one  synthetic  and  two  real 
image  sequences.  To  measure  how  accurately  shape  has 
been  determined,  a  procedure  is  employed  that  elimi¬ 
nates  the  effects  of  error  in  the  camera  pose  estimate.  We 
use  Horn’s  absolute  orientation  algorithm  [10]  to  map  the 
shape  estimate  into  the  coordinate  system  of  the  ground 
truth.  The  measure  of  shape  error  is  the  average  dis¬ 
tance  between  the  true  and  estimated  positions  of  the 
3D  points  after  this  registration.  Absolute  orientation 
is  also  carried  out  for  Horn’s  two  frame  reconstruction 
of  the  structure,  for  purposes  of  comparison  with  the 
present  algorithm. 

The  parameters  of  our  experimental  sequences  are 
shown  in  Table  1.  In  the  first  experiment,  a  synthetic 
image  sequence  was  used.  20  points  were  tracked  over  13 
image  frames,  with  depths  ranging  from  43  to  65  units. 
In  a  fixed  coordinate  system,  the  camera  translated  to¬ 
wards  the  scene  in  a  sig-sag  path  around  the  fixed  z- 
direction.  The  translation  step  varied  in  magnitude  from 
0.44  to  0.6,  and  the  rotations  were  about  the  vertical  (y) 
direction,  and  had  magnitude  1.5°.  In  the  camera  coor¬ 
dinate  system,  the  translation  was  always  in  the  current 
z-direction.  The  image  noise  was  Gaussian,  with  a  =  0.3 
pixels,  modelling  a  flow  error  a  =  0.6. 

The  average  3D  error  for  the  recursive  algorithm,  dis¬ 
played  in  Table  2.1,  decreases  swiftly  and  essentially 
monotonically,  contrasting  with  the  2  frame  error  shown 
in  the  Table.  These  results  were  <u;hieved  even  though 
the  translational  motion  was  largely  into  i,  with  a  very 
small  baseline  to  triangulate  with.  If  one  outlier  point 
is  excluded,  the  average  structure  error  for  our  algo¬ 
rithm  improves  significantly;  after  the  last  iteration,  it 
is  only  1.7  units.  Table  2.2  displays  the  depths  recov¬ 
ered  by  the  recursive  algorithm  after  the  last  frame  for 
all  points  prior  to  using  absolute  orientation.  Since  for 
this  sequence  the  motion  is  relatively  well  determined, 
the  depth  estimates  for  the  estimated  pose  correspond 
fairly  well  to  the  ground  truth.  The  sises  of  the  trans¬ 
formations  computed  by  absolute  orientation  are  shown 
in  Table  2.3.  Compared  to  the  two-frame  results,  a  much 
smaller  transformation  is  required  to  register  the  recur¬ 
sive  structure  estimate  with  the  ground  truth.  Note  that 
the  recursive  results  do  not  seem  to  be  strongly  affected 
by  the  large  error  produced  by  the  two-frame  measure¬ 
ment  in  iteration  8. 

The  second  experiment  employs  an  image  sequence 
previously  discussed  in  [15].  The  first  image  of  this  se¬ 
quence  is  displayed  in  Figure  1.  The  motion  consisted  of 
a  box  rotating  around  its  approximately  vertical  body 
axis  in  steps  of  about  3.6°,  with  a  stationary  camera. 
The  average  3D  error  for  the  recursive  algorithm  de¬ 
creases  dramatic<dly  compared  to  the  two-frame  result 
after  just  3-4  iterations  (Table  3.1).  This  fast  decrease 
can  be  attributed  to  correctly  combining  successive  mea¬ 
surements  to  obtain  an  effectively  wider  baseline.  Note 
that  the  errors  in  the  final  iterations  are  comparable  to 
the  accuracy  of  the  ground  truth,  estimated  to  be  about 
1.5  mm.  This  accuracy  is  high  considering  the  large 


depths  600  mm). 

In  Table  3.2,  the  depths  after  registration  a,ie  displayed 
for  the  last  frame.  There  are  no  outliers.  Finally,  the 
sises  of  the  transformations  for  absolute  orientation  are 
shown  in  Table  3.3. 

For  the  third  experiment,  the  rocket-field  image  se¬ 
quence  from  the  IEEE  motion  workshop  database  was 
used.  This  is  an  outdoor  sequence  produced  using  a 
camera  mounted  on  an  autonomous  vehicle;  the  first  im¬ 
age  is  displayed  in  Figure  2.  Ground  truth  was  available 
for  12  of  the  15  points  tracked  over  11  frames  [4].  Points 
were  mostly  tracked  by  hand  but  some  were  obtained  by 
running  Anandan’s  optical  flow  algorithm  [1].  The  robot 
motion  was  a  constant  forward  translation  (in  steps  of 
about  0.9  m)  and  very  small  rotations. 

In  Table  4.1  the  average  3D  distance  error  for  10  of 
the  12  tracked  points  is  displayed  without  registration; 
absolute  orientation  could  not  be  performed  since  there 
was  no  valid  closed  form  solution.  One  of  the  2  points 
excluded  was  too  far  away  (almost  four  times  as  far  as 
the  next  most  distant  point)  and  the  other  was  highly 
erroneous  as  it  remained  near  the  FOE.  For  comparison, 
the  results  of  simply  averaging  the  two-frame  structures 
are  also  shown.  The  averaging  was  done  recursively  by 
using  the  2-frame  motion  estimate  to  move  the  previous 
average  into  the  current  estimated  camera  frame,  where 
it  was  then  averaged  with  the  new  structure  estimate. 
The  percentage  errors  in  depth  are  displayed  in  Table 
4.2. 

Though  the  two-frame  error  fluctuates,  the  recursive 
error  falls  rapidly  and  consistently,  in  contrast  with  the 
results  reported  in  [3]  for  a  recursive  algorithm  which 
does  not  include  the  effects  of  motion  error.  After  ten 
frames,  it  is  better  by  a  factor  of  two  than  the  result 
of  blind  averaging.  This  is  so  despite  the  fact  that 
this  is  not  a  favorable  case  for  the  recursive  algorithm, 
since  the  translational  motion  into  the  scene  gives  simi¬ 
lar  rather  than  complementary  structure  estimates  from 
each  frame  pair,  as  compared,  for  example,  with  the  box 
sequence. 

References 

[1]  P.  Anandan,  Measuring  Visual  Motion  from  Image 
Sequences,  PhD  Thesis,  COINS  Tech.  Report  TR 
87-21,  Univ.  Of  Mass,  at  Amherst,  MA.,  1987. 

[2]  T.  J.  Broida  and  R.  CheUappa,  “Estimating  the 
Kinematics  and  Structure  of  a  Rigid  Object  from 
a  Sequence  of  Monocular  Images”,  IEEE  Transac¬ 
tions  on  Pattern  Analysis  and  Machine  Intelligence, 
vol.  13,  no.  6,  pp.  497-513,  1991. 

[3]  N.  Cui,  J.  Weng  and  P.  Cohen,  “Extended  Structure 
and  Motion  Analysis  from  Monocular  Image  Se¬ 
quences,”  Proceedings  Srd  IEEE  International  Con¬ 
ference  on  Computer  Vision,  Osaka,  Japan,  1990, 
pp.  222-229. 

[4]  R.Dutta,  R.Manmatha  and  L.R. Williams,  “A  Data 
Set  for  Quantitative  Motion  Analyris,”  CVPR,  San 
Diego,  California,  pp.  159-164,  1989. 


511 


[5]  R.  Dutta  and  M.  Snyder,  “Robustnesss  of 
Correspondence- Based  Structure  from  Motion,” 
Proceedings  Srd  IEEE  Intemaiional  Conference  on 
Computer  Vision,  Osaka,  Japan,  Dec.  1990. 

[6]  O.  D.  Faugeras,  N.  Ayache,  and  B.  Faverjon, 
“Building  Visual  Maps  by  Combining  Noisy  Stereo 
Measurements,”  IEEE  International  Conference  on 
Robotics  and  Automation,  San  Francisco,  CA,  pp. 
1433-1438,  1986. 

[7]  Technical  Staff,  The  Analytical  Sciences  Corp.,  and 
A.  Gelb,  ed..  Applied  Optimal  Estimation,  MIT 
Press,  1986. 

[8]  J.  Heel,  ‘Dynamic  Motion  Vision,"  Image  Under¬ 
standing  Workshop,  Palo  Alto,  CA,  pp.  702-713, 
1989. 

[9]  B.  K.  P.  Horn,  “Relative  Orientation,”  Interna¬ 
tional  Journal  of  Computer  Vision,  Vol.  4,  pp.  59- 
78,  1990. 

[10]  B.  K.  P.  Horn,  “Closed  Form  Solution  of  Absolute 
Orientation  Using  Unit  Quaternions,”  J.  Opt.  Soc. 
Am.  A,  vol.  4,  pp.  629-642,  1987. 

[11]  R.  V.  R.  Kumar,  A.  Tirmalai  and  R.C.  Jain,  “A 
Nonlinear  Optimisation  Algorithm  for  the  Estima¬ 
tion  of  Structure  and  Motion  Parameters,”  CVPR, 
San  Diego,  CA,  pp.  136-143,  1989. 

[12]  L.  Matthies,  R.  Sseliski,  and  T.  Kanade,  “Incre¬ 
mental  Estimation  of  Dense  Depth  Maps  from  Im¬ 
age  Sequences,”  CVPR,  Ann  Arbor,  Michigan,  pp. 
366-374,  1988. 


[13]  L.  Matthies,  T.  Kanade,  and  R.  Sseliski,  “Kalman 
Filter-Based  Algorithms  for  Estimating  Depth  from 
Image  Sequences,”  International  Journal  of  Com¬ 
puter  Vision,  vol  3,  pp.  209-236,  1989. 

[14]  J.  Oliensis  and  J.  I.  Thomas,  “Incorporating  Motion 
Error  in  Multi-frame  Structure  from  Motion,”  Pro¬ 
ceedings  IEEE  Workshop  on  Visual  Motion,  Prince¬ 
ton,  pp  8-13,  1991. 

[15]  H.  S.  Sawhney,  J.  Oliensis,  and  A.  R.  Hanson,  “De¬ 
scription  and  Reconstruction  &om  Image  Trajecto¬ 
ries  of  Rotational  Motion”,  in  ICCV,  Osaka,  Japan, 
December,  1990,  pp.  494-498. 

[16]  M.  Spetsakis  and  J.  Aloimonos,  “A  Multi-frame  Ap¬ 
proach  to  Visual  Motion  Perception,”  Proc.  IEEE 
Workshop  on  Motion,  Irvine,  CA,  March,  1989. 

[17]  J.  Inigo  Thomas  and  J.  Oliensis,  “Fusing  Structure 
by  Kalman  Filtering,”  TR  90-93,  COINS,  UMASS, 
May  1990. 

[18]  J.  Inigo  Thomas  and  J.  Oliensis,  “Incorporating 
Motion  Error  in  Multiframe  Structure  from  Mo¬ 
tion,”  7th  Scandinavian  Conference  on  Image  Anal¬ 
ysis,  Denmark,  pp.  950-957,  1991. 

[19]  C.  Tomasi  and  T.  Kanade,  “Shape  and  Motion  with¬ 
out  Depth”,  lUW,  Pittsburgh,  PA,  pp.  258-270, 
1990. 


Sequence 

#  Frames 

#  Pts. 

Depth  Range 

FOV 

Focal  length 

Image 

Synthetic 

13 

20 

43-65 

45' 

16  mm 

512  X  512 

Box 

8 

12 

(23.4',22.4') 

16  mm 

256  X  242 

Rocket- Field 

11 

15 

H  il  1  111 

(71.9',  56.8') 

6  mm 

512  X  512 

Table  1.  Parameters  for  the  three  motion  sequences  used. 


Iter. 

IHB 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

2  Frame 

22.2 

gvjrii 

15.? 

25.9 

10.2 

17.9 

14.5 

24.0 

10.3 

14.8 

11.9 

Recur. 

11.5 

11.9 

9.9 

9.3 

8.6 

7.6 

5.9 

5.0 

4.5 

3.6 

3.4 

Table  2.1.  Average  3D  error  after  absolute  orientation  for  synthetic  sequence.  For  the  recursive  ^gorithm,  most 

of  the  error  in  9-11  is  due  to  a  single  outlier  point. 


Ttue 

37.0 

43.2 

59.4 

46.8 

56.9 

39.8 

44.9 

46.9 

38.0 

2  Frame 

37.3 

37.4 

61.1 

48.9 

72.1 

61.0 

31.0 

38.8 

40.1 

Recursive 

36.7 

44.4 

56.5 

45.4 

54.1 

39.4 

43.1 

46.6 

37.7 

•ftne 

52.9 

41.9 

39.2 

45.7 

44.7 

41.8 

46.9 

55.0 

56.8 

55.8 

2  Frame 

43.3 

34.5 

44.1 

38.2 

14.9 

22.7 

89.6 

61.6 

77.3 

60.2 

Recursive 

52.9 

42.0 

39.2 

45.6 

30.3 

38.4 

46.4 

55.2 

56.1 

50.2 

Table  2.2.  Unregistered  depths  in  last  frame  for  20  tracked  points. 
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Table  2.3.  Magnitudes  of  the  rotation,  translation,  scale  used  for  registration  in  the  last  10  iterations. 


2 

3 

4.88 

13.97 

10.9 

3.02 

6 

7 

6.37 

5.87 

1.51 

1.7 

Table  3.1.  Average  3D  error  after  absolute  orientation  for  box  sequence. 


Ikue 

598.7 

611.9 

589.1 

600.6 

654.9 

629.9 

644.8 

665.8 

669.7 

1  687.8  1 

629.7 

616.2 

2  Fr. 

607.0 

613.9 

595.4 

596.2 

649.0 

635.2 

648.3 

668.6 

657.8 

631.2 

611.3 

Rec. 

600.3 

611.7 

591.0 

600.5 

652.8 

630.4 

644.0 

665.0 

667.9 

687.1 

632.6 

gTi?iri 

Table  3.2.  Depths  (mm)  after  registration  in  last  frame  for  12  tracked  points. 


2  Pr.  Rot.  (dg) 


2  Fr.  Trans,  (mm) 


Recur.  Trans. 


2  Fr.  Scale 


Recur.  Scale 


21.79  46.76  160.20  25.08 


21.98  152.97  26.64  3.64 


24.07  44.27  22.89 


3.55  14.09  24.48 


0.94  0.84  1.14 


1.06  0.98  1.10 


Table  3.3.  Magnitudes  of  the  registration  transformations. 


Iteration 


2  Frame  (m) 


Recursive 


Table  4.1.  Average  3D  error  for  10  points  in  rocket-field  sequence. 


Iteration 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

2  Frame 

26.1 

12,1 

28.5 

14.8 

10.2 

15.8 

21.1 

27.9 

25.6 

21.3 

Blind  Avg 

26.1 

14.8 

17.3 

14.8 

13.5 

14.1 

12.6 

13.7 

13.7 

KQj] 

Recursive 

26.1 

mm 

12.0 

12.5 

10.4 

8.2 

8.2 

7.1 

6.4 

5.8 

l^ble  4.2.  Average  oercentaue  error 

;hs  for  10  Doints. 
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Abstract 

Let  a  binocular  camera  system  move  through  the  envi¬ 
ronment.  It  is  possible  to  determine  a  three  dimensional 
field  of  vectors,  where  each  vector  is  parallel  to  the  ac¬ 
tual  inducing  3D  velocity  of  an  imaged  point  (relative  to 
the  moving  camera)  scaled  by  the  depth  of  that  point. 
This  representation  enables  a  more  realistic  description 
of  the  relative  3D  velocity  between  the  environment  and 
the  camera  than  is  afforded  by  considering  the  informa¬ 
tion  in  either  the  optic  flow  field  or  the  disparity  field 
alone.  This  vector  field  (termed  the  p-field)  may  then 
be  employed  in  the  processing  of  a  sequence  of  binocu¬ 
lar  images  to  determine  the  camera  motion  parameters 
and  depth  of  the  viewed  scene.  In  this  paper,  we  present 
this  p-field,  and  discuss  its  computation  from  the  image 
measurable  quantities:  optic  flow  and  stereo  disparity. 

1  Introduction 

The  computation  of  depth  is  important  for  many  aspects 
of  vision.  One  popular  source  of  depth  information  is 
stereo  imagery.  Another  source  is  time-varying  imagery 
bom  which  the  motion  of  the  environment  relative  to 
the  camera  system  is  determined  and  from  this  depth  is 
computed.  Our  goal  is  the  integrated  processing  of  these 
two  sources  of  depth  information. 

One  of  the  earliest  uses  of  the  term  stereoscopic 
motion  was  in  [Regan  et  al.,  79].  They  provide  evidence 
to  support  models  of  neural  organisations  in  the  human 
visual  system  that  are  binocularly  triggered  by  chang¬ 
ing  disparity  for  the  purpose  of  detecting  motion  in 
depth.  The  computational  model  developed  here  has 
been  strongly  motivated  by  such  psychophysical  evi¬ 
dence. 

There  have  been  many  differing  attempts  to  in¬ 
tegrate  stereo  and  motion  processing.  Some,  such  as 
the  work  of  [Jenkin,  84]  and  [Waxman  et  al.,  86],  have 
concentrated  on  addressing  the  early  (and  similar)  cor¬ 
respondence  issues.  Jenkin  used  both  stereo  and  mo¬ 
tion  information  in  a  prediction-correction  formulation, 
while  Waxman  et  al.  established  a  correlation  be- 
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tween  the  stereo  disparity  and  the  relative  flow  be¬ 
tween  the  stereo  pair.  Other  work  has  primarily  ad¬ 
dressed  the  problem  of  relative  motion  and  depth  re¬ 
covery.  Mitiche  [Mitiche,  88]  used  depth  from  stereo 
with  information  from  the  optic  flow  to  impose  rigid¬ 
ity  constraints  on  the  2ictual  3D  motion.  Current  ap¬ 
proaches  [Matthies,  88,  Zhang  et  al.,  88]  typically  em¬ 
ploy  the  stereo  module  as  the  main  source  of  depth  in¬ 
formation.  This  is  used  to  guide  the  motion  analysis 
over  a  sequence  of  images,  as  well  as  to  plan  the  robot 
motions  that  would  simplify  the  stereo  correspondence 
problem.  Often  approaches  are  embedded  in  a  Kalman 
Filter  framework  in  order  to  bound  the  errors  over  a  se¬ 
quence  of  stereo  bame-pairs. 

The  model  of  binocular  motion  processing  that  we 
propose  in  this  paper  is  motivated  by  a  need  to  develop  a 
single  framework  for  binocular  motion  processing  within 
which  wV  can  then  address  each  of  these  issues. 

1.1  Our  Approach 

In  this  work,  we  consider  the  dynamics  of  a  scene  as 
viewed  by  a  stereo  pab  of  cameras.  This  subsumes  both 
the  analysis  of  static  stereo  imagery  at  one  time  instant 
to  obtain  the  static  disparity  between  the  two  images 
and  thereby  depth,  and  the  analysis  of  a  monocular  mo¬ 
tion  pab  to  obtiun  the  optic  flow  for  a  pair  of  bames  and 
thereby  relative  motion  and  depth.  We  are  interested  in 
utilising  binocular  motion  in  order  to  obtain  integrated 
additional  constraints  that  can  then  be  employed  to  de¬ 
termine  the  motion  and  depth  without  depending  on  a 
single  (and  possibly  erroneous)  source  for  the  depth  or 
on  mere  juxtaposition  of  the  results  bom  independent 
processing  of  the  motion  and  stereo  information. 

The  model  presented  here  depends  on  a  relation  be¬ 
tween  the  ratio  of  the  relative  optic  flow  in  a  binocular 
pab  of  images  and  the  disparity,  derived  originally  in 
[Waxman  et  al.,  86].  Employing  this  result,  we  demon¬ 
strate  that  it  is  possible  to  derive  the  vector  parallel 
to  the  real  instantaneous  3D  velocity  vector  scaled  by 
the  depth  of  the  point,  located  at  the  image  of  the  3D 
point.  Thu  vector  u  derived  using  purely  image  measur¬ 
able  quantities,  i.e.,  the  optic  flow  and  disparity.  Thu 
field  of  seeded  3D  vectors  is  referred  to  as  the  p-field,  a 
term  originally  coined  in  [Scott,  86],  although  not  in  the 


context  of  binoculai  motion.  The  p-field  has  significant 
implications  about  the  phenomenon  of  binocular  motion 
since  it  implies  that  at  the  lowest  levels  of  visual  process¬ 
ing,  where  traditionally  only  the  2D  entities  located  on 
the  image  plane  were  considered,  it  is  now  possible  to  ex¬ 
amine  and  exploit  ike  nature  of  3D  phenomena  directly. 

We  are  currently  interested  in  the  use  of  the  infield 
as  a  framework  within  which  to  represent  the  problems  of 
flow/disparity  computation,  the  computation  of  the  3D 
motion  as  well  as  independent  object  motion.  For  in¬ 
stance,  observing  that  the  p-vector  is  the  scaled  real  3D 
velocity  vector,  it  is  shown  [Balasubramanyam,  92]  to 
be  more  appropriate  to  impose  smoothness  constraints 
(spatial  and  temporal)  on  this  vector  field  than  on  the 
optic  flow  field.  This  was  examined  by  [Scott,  86]  in  the 
context  of  spatial  smoothing,  but  not  within  the  frame¬ 
work  of  binoculiir  motion.  It  was  also  not  recognised 
that  observing  the  changing  environment  with  a  binocu¬ 
lar  system  enables  us  to  actually  compute  this  3D  field. 

We  are  ako  interested  in  the  interpretation  of  avail¬ 
able  flow  and  disparity  information  for  the  estimation  of 
the  motion  parameters  and  depth  within  this  domain. 
For  instance,  in  the  case  of  ideal  pure  translation,  the 
p-field  directly  yields  the  direction  of  translation.  Since 
the  assumption  of  ideal  translation  is  not  very  realistic  in 
most  experimental  situations  [Dutta  et  al.,  88],  our  cur¬ 
rent  work  is  focused  on  examining  the  effects  that  small 
rotations  have  on  such  an  assumption  of  pure  transla¬ 
tion.  In  the  case  of  general  motion,  several  possible  al¬ 
gorithms  for  the  computation  of  the  motion  parameters 
are  being  examined. 

2  The  Mathematical  Model 

In  this  section,  we  briefly  describe  a  model  for  combining 
information  from  motion  and  stereo  when  the  binocular 
sensor  is  in  motion  in  the  environment.  More  det^  may 
be  found  in  [Baluubramanyam,  91].  It  is  shown  that  the 
temporal  changes  in  the  two  images  of  the  stereo  camera 
system  have  a  very  specific  relationship  to  each  other, 
and  to  the  changing  environment.  An  instantaneous 
description  of  the  relative  motion  between  the  camera 
system  and  the  environment  is  employed  in  order  to  es¬ 
tablish  the  relationship  between  the  various  image  mea¬ 
surable  entities  available  from  binocular  imagery.  The 
measured  image  entities  are  the  two  separate  optic  flow 
fields  for  the  left  and  right  image  pairs,  and  the  stereo 
disparity  at  the  first  time  instance. 

Consider  a  camera  system  with  a  fixed  right  hand 
cartesian  coordinate  reference,  (JT®,  T*,  Z®).  Without 
any  loss  of  generality,  we  can  assume  that  the  optical 
axis  coincides  with  the  Z''-  axis,  and  that  the  focal  length 
is  normalised  to  unity,  i.e.,  all  distances  are  measured  in 
units  of  “focal  length”.  The  center  of  projection  is  at 
the  camera  origin,  and  the  image  is  formed  on  the  image 
]flane  I,  Z‘  =  1,  Let  the  camera  system  have  instanta¬ 
neous  motion  relative  to  the  environment  described  by 
a  translational  velocity,  t  =  {Tx,Ty,Tz),  and  an  an¬ 
gular  velocity,  u  =  about  an  axis  passing 


Figure  1:  Let  A,d  be  two  possible  instantaneous  3D  ve¬ 
locity  vectors  of  point  P  suck  that  they  project  to  the  same 
2D  image  velocity  vector  f.  The  resulting  p- vectors  are  £,b 
respectively,  disambiguating  between  the  two  inducing  3D 
instantaneous  velocity  vectors. 

through  the  origin  of  the  coordinate  reference  frame  at¬ 
tached  to  the  camera  system. 

The  motion  of  a  stationary  environmental  point 
P  =  (JC,  y,  Z),  in  the  coordinate  system  attached  to  the 
camera,  is  given  by  : 

P  =  -(fa»  X  P  -1- 1).  (1) 

The  p-vector,  p(x,  y)  is  defined  [Scott,  86]  as 


where  Z  is  the  depth  of  the  point  P.  This  is  the  vec¬ 
tor  parallel  to  the  instantaneous  SD  velocity  vector  of  the 
environmental  point  relative  to  the  camera  and  scaled  in¬ 
versely  by  the  depth  of  the  point.  We  denote  the  set  of 
vectors,  n  =  {p(z,  y);  (z,  y)  €  2>},  where  denotes  the 
image  domain  (assume  a  discrete  grid  of  points  (z,  y)), 
as  the  p-vector  field.  Now,  under  3D  rotations  of  the 
coordinate  system,  the  p-vector  does  not  transform  like 
a  vector.  This  rotational  non-covariant  property  implies 
that  the  p-vector  is  not  really  a  vector;  nevertheless  the 
terminology  has  achieved  some  legitimacy,  and  is  pre¬ 
served  here. 

Assuming  perspective  projection,  the  components 
of  the  p-vector  are: 

wzy  \ 

wx  -  tj>z3i  -  I 

WyZ  -  wxy  -zJ 

This  p-vector  is  considered  with  reference  to  the  image 
of  the  3D  point  of  interest  on  the  image  plane  (z,y), 
and  has  components  in  the  image-plane  (ps,Py)  as  well 
as  a  component  orthogonal  to  the  image  plane,  p,  (see 
Figure  1). 

The  image  velocity  field  (approximated  in  com¬ 
putation  by  the  optic  flow  field)  is  denoted  as  $  = 
{(u(z,  y),  w(z,  y));  (z,  y)  G  V},  where  V  denotes  the  im¬ 
age  domain  as  before. 

Consider  a  parallel-axes  binocular  camera  system 
(i.e.,  no  vergence).  The  translational  displacement  of 
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the  two  2ixes  is  described  by  a  horizontal  baseline  of  2hx, 
and  a  vertical  baseline  of  26y.  We  assume  /  =  1  for 
both  cameras.  A  (virtual)  cyclopean  coordinate  refer¬ 
ence  frame  is  placed  in  the  middle,  between  the  left  and 
right  camera  coordinate  reference  frames.  Superscripts 
l,r,c  denote  the  entities  with  respect  to  the  left,  right 
and  cyclopean  references,  respectively.  With  respect  to 
the  cyclopean  origin,  the  displacement  of  the  left  ref¬ 
erence  origin,  O*  is  given  by  =  (6xiA»iO)’'i  and  the 
displacement  of  the  right  reference  origin.  O',  is  given 
by  =  (-6,,-6,,0)^. 

A  point  in  the  environment  projects  to  (z',}/)  in 
the  left  image  and  to  (x',]/')  in  the  right  image.  The 
disparity  field  for  environmental  points  between  their 
corresponding  left  and  right  images  is  denoted  by  A  = 

y):  (*. y)  e  2?}  =  {(^x(z,y),^»(z,yp:(z,y)  e  v}, 

where  the  image  domain  V  can  be  considered  synony¬ 
mous  with  the  left,  right  or  cyclopean  image  planes.  In 
the  following,  the  cyclopean  image  domain  is  chosen  as 
the  reference  domain.  The  disparity  vector  is  then  given 
by  the  stereo  triangulation  equations: 


2.1  The  ]>-vector 

Assume  that  the  binocular  system  :ias  the  geometry  de¬ 
scribed  above,  and  let  the  cyclopean  coordinate  refer¬ 
ence  frame  have  motion  described  by  an  angular  veloc¬ 
ity  w'  =  (ti>x,UY,Uz)  about  an  axis  passing  through 
the  cyclopean  origin,  and  a  translational  velocity  t*^  = 
(T^,7y,7'|),  with  respect  to  the  stationary  environ¬ 
ment.  The  instantaneous  velocity  of  a  point  P  — 
(A’®,y®,Z')  can  be  given,  as  before,  by: 

pc  =  _(„<=  X  P®  + 1®).  (5) 


Denoting  the  image  velocity  fields  for  the  left  and 
right  images  by  and  #'’(x',y®)  respectively, 

the  difference  of  the  image  velocities  in  the  two  images, 
♦'(x'ly®)  —  #*(x',y“)  can  then  be  shown  to  be  (see 
[Balasubramanyam,  91]  for  details): 

u'  —  u*  _  u'  —  w* 

6x  6y 

=  -y‘<^x  +  (®) 


Comparing  this  relative  flow  with  Equation  3,  we  see 
that 

u'  -  u*  u'  -  v‘ 


Here,  pt  is  the  component  of  the  p- vector  that  describes 
the  scaled  relative  3D  velocity  along  the  optical  axis  (or¬ 
thogonal  to  the  image  plane). 


In  [Waxman  et  al.,  86],  the  binocular  camera  ge¬ 
ometry  modelled  consists  of  only  relative  horizontal  dis¬ 
placement  between  the  camera  origins  (i.e.,  by  =  0),  and 


Equation  6  is  termed  the  relative  flow.  This  result  is 
employed  in  [Balasubramanyam,  87]  to  recover  the  mo¬ 
tion  in  depth  parameters,  (b;^,aiy,T|).  It  can  be  seen 
that  accomodating  a  vertical  displacement  in  the  cam¬ 
era  geometry,  and  thus  accounting  for  a  vertical  dispar¬ 
ity,  merely  provides  an  additional  source  of  information 
about  the  relative  flow. 


Thus,  given  the  image  measurable  quantities, 
(«*, t/*),  (u',v’'),  and  (Sx,6y)  for  an  image  point 

obtain  pi,  which  in  turn  c£in  be 
used  with  the  flow  to  compute  the  remaining  components 
of  the  p-vector  with  respect  to  the  left  and  right  image 

coordinates  as  [p*  =  (Px.P^.P,)^.?'’  =  (f^.Pj.pO^l. 
where  the  components  are  given  cis: 


Pi 


U>  X^  Px 

+  y’ps 


I  for  j  =  I,  r. 


(8) 


By  symmetry,  we  determine  the  cyclopean  p-vector 


The  p-vector  yields  an  elegant  natur<d  description  of 
the  inducing  instantaneous  velocity  when  viewed  binocn- 
larly.  If  the  camera  geometry  is  assumed  to  have  no  ver¬ 
tical  (Y)  displacement  between  the  two  camera  origins, 
then  the  relative  vertical  flow  between  the  two  images  be¬ 
comes  indeterminate,  implying  that  the  computation  of 
p,  depends  entirely  on  the  relative  horizontal  flow  com¬ 
ponent.  In  general,  in  the  limit  as  the  baseline  between 
the  two  cameras  tends  to  sero,  and  the  geometry  col¬ 
lapses  to  a  monocular  motion  situation,  the  relative  flow 
components  (both  horizontal  and  vertical)  become  in¬ 
determinate.  Thus,  the  orthogonal  p*  component  (and 
thereby  the  other  two  components,  p,,  and  Py)  cannot 
be  determined.  In  a  similar  vein,  as  the  baseline  between 
the  cameras  decresues,  the  error  in  determining  the  px 
component  will  increase  dramatically,  exhibiting  unsta¬ 
ble  behavior  due  to  small  values  for  both  the  relative  flow 
components  and  the  stereo  disparity.  A  large  baseline, 
relative  to  the  depth  of  interest,  is  important  for  robust 
p-field  computation. 

[Balasubramanyam,  91]  contains  a  discussion  of  the 
properties  exhibited  by  the  p-fleld  for  pure  rotational 
as  weU  as  pure  translational  motion.  This  paper  also 
briefly  discusses  an  algorithm  for  determining  the  mo¬ 
tion  parameters  in  the  case  of  general  motion,  including 
preliminary  results  on  real  data. 

In  ongoing  work,  the  p-field  model  is  being  devel¬ 
oped  for  a  binocular  viewing  geometry  accomodating 
vergence  between  the  two  camera  axes.  This  model  is 
consistent  (as  is  to  be  expected)  with  the  development 
here  in  that  it  demonstrates  that  the  relative  motion  be¬ 
tween  the  binocular  image  planes  over  time  yields  the 
(scaled)  component  of  the  3D  motion  along  the  optical 
axis,  and  that  this  can  be  further  employed  to  describe 
the  complete  (scaled)  3D  motion. 
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3  Uncertainty  Analysis 

It  is  seen  from  equation  9  that  optic  flow  estimates  from 
the  left  and  right  image  pairs  as  well  as  stereo  disparity 
estimates  are  required  to  compute  the  p-fleld.  Given  that 
these  match  estimates  are  susceptible  to  noise  in  real  im¬ 
age  scenarios,  it  is  important  to  understand  the  depen¬ 
dence  of  the  uncertainty  in  the  p-vector  estimate  on  the 
uncertainty  in  the  flow  and  disparity  estimates.  The  sec¬ 
ond  order  statistics  of  the  p-vector  estimate  are  crucial  to 
all  algorithms  that  use  the  p-fleld  for  the  recovery  of  mo¬ 
tion,  depth  and  motion  boundaries.  We  briefly  present 
the  covariance  analysis  for  the  computed  p-vector.  More 
details  may  be  found  in  [Balasubramanyam,  92]. 

The  required  initial  optic  flow  estimates  are  ob¬ 
tained  from  the  hierarchical  correlation  algorithm  de¬ 
scribed  in  [Anandan,  89].  A  version  of  the  same  algo¬ 
rithm  is  modified  to  search  along  a  narrow  window  cen¬ 
tered  on  the  epipolar  constraint  line  in  order  to  obtain 
the  initial  disparity  estimates. 

The  (3  X  3)  covariance  matrix  associated  with  the 
computed  p-vector  is  obtained  by  propagating  the  (2X2) 
covariance  matrices  associated  with  the  flow  and  dispar¬ 
ity  estimates.  We  assume  that  a  specific  matching  al¬ 
gorithm  [Anandan,  89]  is  employed  to  obtain  the  flow 
and  disparity  estimates  and  the  associated  covariances. 
However,  the  analysis  applies  equally  to  any  algorithm  as 
long  as  it  returns  the  displacement  estimates  with  their 
2x2  covariance  matrices  or  provides  a  way  of  determin¬ 
ing  these  matrices. 

In  the  matching  algorithm  described 
in  [Anandan,  89],  the  Hessian  of  the  sum-of-squared- 
differences  (SSD)  error  surface  is  used  to  obt<iin  the  con¬ 
fidence  weights  and  confidence  directions  associated  with 
the  displacement  match  estimate.  If  H  is  the  Hessian  of 
the  SSD  error  surface,  and  is  the  variance  of  uncor¬ 
related  Gaussian  noise,  then  it  is  shown  in  [Szeliski  88] 
that  for  a  given  match  displacement  estimate,  d,  the  as¬ 
sociated  2x2  covariance  matrix  is  given  by: 

Sdd  =  2cr2H-^  (10) 

Thus,  with  each  flow  and  disparity  match  estimate  d,  the 
2x2  covariance  matrix  E^d  (from  equation  10). 

Let  Jpx  denote  the  Jacobian  that  defines  the  functional 
dependence  of  the  random  variable  p  =  (Px,Pp,P*)^  on 
the  random  variable  x  =  (u‘,v‘,u',v^,Sx)^.  This  may 
be  derived  employing  equation  9.  Sxx  denotes  the  co- 
variance  matrix  for  the  variable  x,  and  may  be  obtained 
by  applying  equation  10  to  each  of  the  flow  and  dispar¬ 
ity  match  estimates,  assuming  that  these  estimates  are 
obtained  independently.  Thus,  the  covariance  ■  for  the 
match  estimates  can  be  propagated  to  obtrun  the  covari¬ 
ance  matrix  for  the  p-vector  estimate,  Spp,  as: 

Spp  =  Jpx^  Sxx  Jpx-  (11) 

In  general,  the  covariance  matrix  for  the  p-vector 
Epp  is  not  a  diagonal  matrix  since  the  elements  are  cor¬ 
related.  It  can  be  shown  [Balasubramanyam,  92]  that 


Figure  2:  The  two  stereo  image  pairs  viewed  by  a  binocu¬ 
lar  camera  system  which  is  undergoing  nearly  translational 
motion  at  time  instances  tj  and  tt. 

this  covariance  matrix  can  be  used  to  establish  a  con¬ 
fidence  ellipsoid  around  the  estimated  p-vector  value, 
where  the  semi-axes  of  the  ellipsoid  are  determined  by 
diagonalizing  Spp. 

4  Experimental  Results 

The  grey-scale  stereo  images  at  two  time  instances  are 
part  of  a  data  set  from  an  indoor  run  with  the  “Harvey” 
robot  vehicle  at  UMass.  The  binocular  parallel  axes  of 
the  camera  pair  had  a  baseline  of  20  inches.  The  distance 
from  the  vehicle  center  to  the  corner  point  of  intersection 
of  the  two  walls  with  floor  (Figure  2)  was  approximately 
18  feet.  The  motion  of  the  vehicle  was  toward  the  cor¬ 
ner.  Absolute  ground  truth  for  the  camera  motion  is 
not  known.  The  motion  u  a  largely  translationrd  motion 
directed  along  the  optical  axis  of  the  camera  pair.  The 
hierarchical  correlation  algorithm  [Anandan,  89]  is  em¬ 
ployed  to  obtain  the  flow  and  disparity  estimates  for  im¬ 
age  resolution  of  128  X  128.  No  smoothing  is  performed 
on  the  optic  flow  or  disparity  estimates  that  are  used  to 
compute  the  p-field  (shown  in  Figure  3).  It  can  be  ob¬ 
served  that  since  the  motion  of  the  camera  is  toward  the 
viewed  scene,  the  p-vectors  for  most  of  the  pixels  are  di¬ 
rected  correctly  parallel  to  the  optical  axis  (since  p,  and 
p^  components  fs  0).  Significant  errors  occur  at  points  in 
the  left  portion  of  the  left  image  where  considerable  oc¬ 
clusion  occurs  between  the  stereo  pair  as  well  as  between 
the  right  motion  pair.  Other  sources  of  significant  error 
are  horizontal  lines  in  the  four  intensity  images  since 
they  lead  to  erroneous  horizontal  flow  and  stereo  dispar¬ 
ity  estimates,  and  thereby  erroneous  p^  components. 
Covariance  analysis  on  this  p-field  yields  estimates  for 
the  uncertainty  in  the  computed  p-vectors.  This  is  em¬ 
ployed  as  a  filter  on  the  computed  p-field  and  the  result¬ 
ing  filtered  field  is  shown  in  Figure  4.  This  filtering  step 


Figure  3:  The  p-vectors,  shown  at  every  fourth  image  row 
and  column  location,  computed  from  unsmoothed  optic  flow 
and  disparity  estimates. 


Figure  4:  The  p-vectors,  shown  at  every  second  image  row 
and  column  location,  with  uncertainty  estimates  used  to  filter 
out  significant  errors. 


is  adopted  here  only  to  demonstrate  that  the  covariance 
analysis  enables  the  elimination  of  many  vectors  with 
errors.  In  current  work  we  are  looking  at  spatial  and 
temporal  smoothing  of  the  p-fields  computed  over  mul* 
tiple  stereo  sequences  using  the  uncertainty  estimates  to 
weight  the  individual  vectors. 

5  Conclusion 

The  p-field  is  proposed  as  a  model  for  binocular  mo¬ 
tion  processing.  It  provides  a  natural  description  of  the 
3D  motion  of  a  camera  system  relative  to  the  environ¬ 
ment  since  it  retains  the  directional  motion  information. 
Hence,  the  p-field  can  be  used  to  model  smooth  3D  mo¬ 
tion.  Rigidity  constraints  can  be  imposed  more  naturally 
on  such  a  3D  field  and  we  can  examine  the  problem  of 
independent  object  motion  detection  and  occlusion  de¬ 
tection  from  this  context  as  well.  We  present  the  p-field 
as  a  framework  to  examine  issues  in  binocular  motion 
processing. 
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Abstract 

We  propose  here  a  new  approach  to  addressing 
problems  related  to  visual  motion,  namely  the 
purposive  approach  [4],  Instead  of  considering 
the  various  visual  motion  tasks  as  applications 
of  the  general  structure  from  motion  module, 
we  consider  them  as  independent  problems  and 
we  directly  seek  solutions  for  them.  As  a  re¬ 
sult  we  can  achieve  unique  and  robust  solutions 
without  having  to  compute  optic  flow  and  with¬ 
out  requiring  a  full  reconstruction  of  the  visual 
space,  because  it  is  not  needed  for  the  tasks.  In 
the  course  of  the  exposition,  we  present  novel 
solutions  to  various  important  visual  tasks  re¬ 
lated  to  motion,  such  as  the  problems  of  motion 
detection  by  a  moving  observer,  passive  naviga¬ 
tion,  relative-depth  computation,  3-D  motion 
estimation,  and  visual  interception,  using  as  in¬ 
put  only  the  spatial  and  temporal  derivatives 
of  the  image  intensity  function.  It  turns  out 
that  the  spatiotemporal  derivatives  of  the  im¬ 
age  (i.e.  the  so-called  normal  flow)  do  not  seem 
to  be  capable  of  solving  the  general  “structure 
from  motion”  problem.  They  are,  however,  suf¬ 
ficient  to  provide  robust  algorithms  for  the  so¬ 
lution  of  many  interesting  visual  tasks  that  do 
not  require  the  full  solution,  but  only  part  of  it. 

The  ability  to  create  robust  nontrivial  behav¬ 
iors  suggests  the  possibility  that  visual  percep¬ 
tion  could  be  studied  as  intelligent  behavior. 

We  point  out  some  of  the  benefits  and  draw¬ 
backs  of  this  paradigm  that  studies  vision  as  a 
set  of  behaviors  that  recover  the  visible  world 
partially,  but  well  enough  to  carry  out  a  task 
(purposive,  animate  or  behavioral  vision),  and 
we  contrast  it  to  the  traditional  paradigm  of 
treating  vision  as  a  general  recovery  problem. 

1  Introduction  and  Motivation 

The  problem  of  structure  from  motion  has  attracted  a  lot 
of  attention  in  the  past  several  years  [23,  30,  33]  because 
of  the  general  usefulness  that  a  potential  solution  to  this 
problem  would  have.  Important  navigational  problems, 
such  as  detection  of  independently  moving  objects  by 
a  moving  observer,  passive  navigation,  obstacle  detec¬ 


tion,  target  pursuit,  and  many  other  problems  related  to 
robotics,  teleconferencing,  etc.,  would  be  simple  applica¬ 
tions  of  a  structure  from  motion  module.  The  problem 
has  been  formulated  as  follows;  Given  a  sequence  of  im¬ 
ages  taken  by  a  monocular  observer  (the  observer  and/or 
parts  of  the  scene  could  be  moving),  to  recover  the  shapes 
(and  relative  depths)  of  the  objects  in  the  scene,  as  well 
as  the  (relative)  3-D  motions  of  independently  moving 
bodies. 

The  problem  has  been  formulated  and  usually  treated 
as  an  aspect  of  the  general  task  of  recovering  3-D  in¬ 
formation  from  motion  [25,  19].  The  majority  of  the 
proposed  solutions  to  date  are  based  on  the  following 
modular  and  hierarchical  approach: 

1.  First,  one  computes  the  optic  flow  on  the  image 
plane,  i.e.  the  velocity  with  which  every  image  point 
appears  to  be  moving.* 

2.  Then  segmentation  of  the  flow  field  is  performed  and 
different  moving  objects  are  identified  on  the  image 
plane.  From  the  segmented  optic  flow  one  then  com¬ 
putes  the  3-D  motion  with  which  each  visible  sur¬ 
face  is  moving  relative  to  the  observer.  (Assuming 
that  an  object  moves  rigidly,  a  monocular  observer 
can  only  compute  its  direction  of  translation  and  its 
rotation,  but  not  its  speed.) 

3.  Finally,  using  the  values  of  the  optic  flow  along  with 
the  results  of  the  previous  step,  one  computes  the 
surface  normal  at  each  point,  or  equivalently,  the 
ratio  Zi/Zj  of  the  depths  of  any  two  points  »  and  j. 

The  reason  that  most  approaches  have  followed  the 
above  three-step  approach  is  two-fold.  The  first  is  due 
to  the  formulation  of  the  problem,  which  insists  on  re¬ 
covering  a  complete  relative  depth  map  and  accurate 
three-dimensional  motion.  The  second  is  due  to  the 
fact  that  the  constraints  relating  retinal  motion  to  three- 
dimensional  structure  involve  3-D  motion  in  a  nonlinear 
manner  that  does  not  allow  separability.  For  examples 
of  such  approaches,  see  [1,  34,  23,  30].  However,  the 
past  work  in  this  paradigm,  despite  its  mathematical  el¬ 
egance,  is  far  from  being  useful  in  real-time  navigational 
systems,  and  such  techniques  have  found  few  or  no  prac- 


*For  clarity,  we  consider  only  the  differential  ca.se.  In  the 
case  of  long  range  motion  one  computes  discrete  displace¬ 
ments,  but  the  analysis  remains  essentially  the  same. 
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a:  feature  befon  matiao 
P:  feature  after  motioB 


Figure  1:  The  aperture  problem.  Point  A  could  have 
moved  to  B,  C,  D,  E.  However,  whatever  the  value  of 
the  image  motion  vector  is,  its  projection  on  the  normal 
to  a  is  always  AD  (known). 

tical  applications.^  Consequently,  this  approach  cannot 
be  used  yet  to  explain  the  ability  of  biological  organisms 
to  handle  visual  motion. 

There  exist  many  reasons  for  the  limitations  of  the  op¬ 
tic  flow  approach,  related  to  all  three  steps  listed  above. 
To  begin,  the  computation  of  optic  flow  is  an  ill-posed 
problem,  i.e.  unless  we  impose  additional  constraints,  we 
cannot  estimate  it  [19].  Such  constraints,  however,  im¬ 
pose  a  relationship  on  the  values  of  the  flow  field  which 
is  translated  into  an  assumption  about  the  scene  in  view 
(for  example,  smooth).  Thus,  even  if  we  are  capable  of 
obtaining  an  algorithm  that  computes  optic  flow  in  a 
robust  manner,  the  algorithm  will  work  only  for  a  re¬ 
stricted  set  of  scenes.  The  only  available  constraint  at 
every  point  {x,y)  of  the  changing  image  I{x,y,t)  for  the 
flow  («,  v)  is  the  constraint  /*«-!-/,  v-b/j  =  0  [21],  where 
the  subscripts  denote  partial  differentiation.  This  means 
that  we  can  only  compute  the  projection  of  the  flow  on 
the  gradient  direction  ((/*,/y)  •  («,w)  =  -It),  i-e-  the 
so-called  normal  flow.  More  graphically,  it  means  that 
if  a  feature  (for  example,  an  edge  segment)  in  the  im¬ 
age  moves  to  a  new  position,  we  don’t  know  where  every 
point  of  the  segment  moved  to  (see  Figure  1);  we  only 
know  the  normal  flow,  i.e.  the  projection  of  the  flow  on 
the  image  gradient  at  that  point. 

A  second  reason  has  to  do  with  the  very  essence  of 
optic  flow.  An  optic  flow  field  is  the  vector  field  of  ap¬ 
parent  velocities  that  are  associated  with  the  variation 
of  brightness  on  the  image  plane.  Clearly,  the  scene  is 
not  involved  in  this  definition.  One  would  hope  that 
optic  flow  would  be  equivalent  to  the  so-called  motion 
field  [19],  which  is  the  (perspective)  projection  on  the 
image  plane  of  the  three-dimensional  velocity  field  as¬ 
sociated  with  eau:h  point  of  the  visible  surfaces  in  the 
scene.  However,  the  optic  flow  field  and  the  motion  field 
are  not  equal  in  general.  Verri  and  Poggio  [36]  reported 
some  general  results  in  an  attempt  to  quantify  the  differ¬ 
ence  between  the  optic  flow  and  motion  fields.  Although 
we  don’t  yet  have  necessary  and  sufficient  conditions  for 
the  equality  of  the  two  fields,  it  is  clear  that  they  are 
equal  only  under  specific  sets  of  restrictive  conditions. 

A  third  reason  is  related  to  the  second  step  of  the  ex¬ 
isting  algorithms  for  structure  from  motion.  These  algo¬ 
rithms  attempt  to  first  recover  three-dimensional  motion 

^Possible  exceptions  are  photogrammetry  and  semiau- 
tonomons  applications  requiring  a  human  operator. 


before  they  proceed  to  recover  relative  depth,  and  this 
problem  of  3-D  motion  appears  to  be  very  sensitive  in 
the  presence  of  small  amounts  of  noise  in  the  input  (flow 
or  displacements)  [31,  38,  1,  2]. 

In  [31]  several  experiments  as  well  as  comparisons  with 
various  algorithms  were  made  and  the  finding  was  that 
an  average  error  of  1%  to  2%  in  the  input  (retinal  corre¬ 
spondence)  can  create  an  error  of  about  100%  in  the  es¬ 
timated  parameters.  An  important  question  to  ask  then 
is  what  makes  this  problem  unstable,  and  to  seek  ways 
to  address  any  inherent  instabilities  that  might  arise. 
There  is  recent  work  towards  this  direction  but  difficult 
questions  still  remain. 

But  while  theoretical  research  on  the  principles  of 
structure  from  motion  continues  in  its  quest  for  opti¬ 
mal  recovery,  we  can  also  follow  an  alternative  approach. 
We  can  ask  the  following  simple  question:  if  we  had  a 
robust  structure  from  motion  module,  what  would  we 
use  it  for?  The  answer  of  course  lies  in  a  taxonomy  of 
visual  tasks  involving  motion,  i.e.  navigational  tasks.  A 
few  such  generic  navigational  tasks  are,  for  example,  the 
following: 

•  Detection  of  independently  moving  objects  in  the  en¬ 
vironment,  by  a  moving  observer.  This  is  a  nontriv¬ 
ial  task,  as  everything  moves  on  an  image  obtained 
by  a  moving  observer,  thus  making  it  hard  to  dis¬ 
tinguish  independent  motion.  Although  many  gen¬ 
eral  schemes  have  been  proposed  for  segmentation 
of  a  flow  field  into  areas  corresponding  to  differently 
moving  objects,  there  are  still  problems  in  practi¬ 
cal  applications  involving  more  than  one  indepen¬ 
dently  moving  object.  Other  approaches  of  interest 
are  those  that  combine  measurements  of  flow  with 
some  3-D  interpretation  which  can  then  he  used  for 
incremental  improvements  to  segmentation  in  an  it¬ 
erative  manner.  However,  no  practical  robust  sys¬ 
tem  for  detecting  independently  moving  objects  in 
general  environments  and  based  on  optic  flow  has 
been  demonstrated  to  date. 

•  Passive  navigation.  Passive  navigation  is  a  term 
used  to  describe  the  processes  by  which  a  system 
can  determine  its  motion  with  respect  to  the  envi¬ 
ronment.  This  is  important  for  kinetic  stabilization 
which,  in  its  simplest  form,  requires  a  system  to 
maintain  a  fixed  position  and  attitude  in  space  in 
the  presence  of  perturbing  influences.  More  gener¬ 
ally,  stabilization  can  refer  to  any  conditions  placed 
on  the  motion  parameters;  for  instance,  the  sys¬ 
tem  might  be  required  to  translate  without  rotation. 
The  two  abilities  are  interrelated  because  stabiliza¬ 
tion  is  generally  achieved  by  bringing  the  motion 
parameters  to  certain  specified  values.  I’he  capacity 
for  passive  navigation  is  prerequisite  for  any  other 
navigational  ability.  In  order  to  guide  the  system, 
some  idea  of  the  present  motion  and  some  method 
of  setting  it  to  known  values  must  be  available.  In 
present  robot  systems  the  necessary  information  is 
often  explicitly  available  as  a  result  of  a  built-in 
coordinate  system.  For  an  autonomously  moving 
system,  however,  there  must  be  an  active  sensing 
capacity.  It  is  possible  to  obtain  the  required  in- 
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formation  mechanically  as  is  done  by  the  inertial 
guidance  systems  in  guided  missiles.  However,  the 
task  can  also  be  performed  by  visual  means  and  it 
is  this  problem  that  we  address  here. 

•  Obstacle  avoidance.  Obstacle  avoidance  refers,  sim¬ 
ply,  to  the  ability  to  utilize  sensory  information  to 
maneuver  in  an  environment  containing  physical  ob¬ 
jects  without  striking  them.  This  can  be  considered 
a  second-level  ability.  It  requires  some  capacity  for 
passive  navigation,  but  little  else,  and  could  thus 
be  considered  the  lowest  level  of  active  navigation. 
This  task  can  be  performed  non-visually  by  range 
sensing  methods,  and  it  has  been  generally  proposed 
that  the  problem  be  solved  visually  with  a  similar 
algorithm  utilizing  depth  data  from  a  scene  recon¬ 
structed  by  the  structure  from  motion  module. 

•  Avoidance  of  collision  with  a  moving  object.  A  ro¬ 
bust  structure  from  motion  module  can  detect  the 
3-D  motion  of  a  moving  object,  calculate  its  3-D  po¬ 
sition  with  the  aid  of  a  binocular  system  and  predict 
its  three-dimensional  trajectory.  Thus,  it  can  detect 
any  possibilities  for  collision,  by  reconstructing  the 
3-D  trajectory  of  the  moving  object. 

•  Understanding  of  relative  depth.  Visual  motion  pro¬ 
vides  a  very  rich  amount  of  information  about  the 
relative  depths  of  objects  in  the  environment  (which 
object  is  closer).  Clearly;  this  is  one  of  the  outputs 
of  the  structure  from  motion  module. 

•  Visual  pursuit.  A  three  dimensional  visual  pursuit 
system  consists  of  an  eye  (camera(s)),  a  subject, 
an  object  and  a  mind.  The  mind  uses  information 
from  the  eye  in  order  to  control  the  movement  of 
the  subject  so  that  it  will  collide  with  (intercept, 
catch)  the  object.  Under  the  traditional  paradigm 
of  considering  vision  as  a  recovery  problem,  visual 
pursuit  is  just  another  application  of  the  structure 
from  motion  module.  In  such  a  case,  the  camera 
would  reconstruct  the  three  dimensional  positions 
and  motions  of  the  camera,  the  subject  and  the  ob¬ 
ject  and  then  this  information  would  be  utilized  by 
a  planning  module  to  generate  correct  control  of  the 
subject. 

Given  the  lack  of  success  in  developing  a  robust  structure 
from  motion  module,  it  would  seem  reasonable  to  con¬ 
sider  simpler  problems.  There  are  visual  problems,  such 
as  the  above,  which  do  not  require  the  full  realization  of 
the  structure  from  motion  capability,  yet  which  are  both 
nontrivial  and  possess  the  sort  of  environmental  invari¬ 
ance  that  would  give  them  general  utility.  To  consider 
a  few  examples  from  biological  navigation,  the  housefly 
can  maneuver  visually  in  three  dimensions  in  a  complex 
environment  without  striking  obstacles;  a  number  of  bees 
and  wasps  can  recognize  and  return  to  a  particular  lo¬ 
cation  in  their  environment;  and  the  frog  can  extend  its 
tongue  and  catch  flying  insects.  Human  beings  can  also 
perform  such  tasks,  but  obviously  they  can  be  performed 
with  far  less  computational  equipment  than  humans  pos¬ 
sess.  We  propose  here  to  consider,  in  the  context  of  navi¬ 
gational  tasks,  some  of  the  above  problems,  more  specific 


and  more  restricted  than  the  general  structure  from  mo¬ 
tion  problem,  with  a  view  towards  producing  examples 
of  visual  systems  that  have  the  potential  for  robustness. 
This  approach  is  termed  purposive  [4]. 

We  show  later  that  specific  questions  such  as  the  ones 
above  can  be  answered  without  having  to  go  through  the 
estimation  of  optic  flow.  The  derivatives  of  the  image  in¬ 
tensity  function  are  enough  for  the  task.  The  approach 
taken  in  this  paper  calls  for  the  solution  of  specific  vi¬ 
sual  tasks,  such  us  the  ones  above,  in  such  a  way  that  the 
solution  does  not  have  more  power  than  it  is  supposed 
to  have.  For  example,  the  procedure  that  provides  rela¬ 
tive  depth  is  designed  only  for  that  purpose  and  cannot 
be  used  to  solve  the  passive  navigation  problem,  or  the 
problem  of  3-D  motion  estimation.  Of  course,  if  infor¬ 
mation  about  3-D  motion  is  known,  it  can  be  effectively 
utilized  in  the  estimation  of  relative  depth,  but  this  is 
of  no  concern  to  us  here.  When  building  a  system  that 
can  deal  with  visual  motion  problems,  we  can  visualize 
it  as  consisting  of  many  processes  working  in  a  coopera¬ 
tive  manner  to  solve  various  problems.  For  example,  the 
theories  described  in  this  paper  could  be  used  to  design 
a  process  that  computes  relative  depth  from  image  mea¬ 
surements,  independently  of  the  process  that  computes 
3-D  motion.  However,  after  a  number  of  computational 
steps,  when  results  about  relative  depth  and  3-D  motion 
become  available  from  the  two  independent  processes, 
they  can  be  exchanged  and  the  constraints  relating  them 
can  be  effectively  utilized  so  that  the  results  are  as  con¬ 
sistent  as  possible. 


Most  visual  navigation  tasks,  including  the  ones  de¬ 
scribed  above,  have  been  considered  to  be  subproblems 
for  the  reconstructive  school.  The  connection  is  a  natural 
one  since  most  of  these  tasks  involve  shape  and  distance 
relationships  between  the  system  and  the  environment 
which  can  be  expressed  in  terms  of  the  quantitative  id¬ 
iom  of  the  reconstructive  school.  This  perception  has 
tended  to  discourage  explicit  research  on  such  specific 
problems  by  classifying  them  as  special  cases  of  an  im¬ 
portant  general  problem.  It  has  also  tended  to  obscure 
the  fact  that  many  of  the  operations  necessary  to  imple¬ 
ment  specific  visual  tasks  can  be  expressed  in  qualitative 
terms  which  are  more  aptly  described  in  terms  of  the 
recognition  idiom.  Consider,  for  example,  the  problem 
of  passive  navigation.  It  is  not  necessary  to  know  exactly 
how  the  system  is  moving  with  respect  to  the  environ¬ 
ment  but  only  whether  it  is  rotating  or  translating  at 
all,  and  if  so,  in  what  direction  forces  must  be  applied 
to  reduce  the  motion.  In  the  case  of  obstacle  avoidance, 
the  most  relevant  information  is  not  the  exact  distance 
in  centimeters  from  the  observer  to  each  point  in  the 
environment,  but  whether  the  observer  is  on  a  collision 
course  with  a  nearby  obstacle  and  if  so,  in  which  direc¬ 
tion  it  should  move  to  avoid  the  danger  of  a  crash.  The 
common  factor  in  these  examples  is  that  they  do  not  re¬ 
quire  precise  quantitative  information  and  that  in  each 
case,  the  information  neces.sary  to  carry  out  the  task  can 
be  represented  in  a  space  having  only  a  few  degrees  of 
freedom. 


2  Qualitative  Methods 
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3  Organization  of  the  Paper 

We  wish  to  develop  the  mathematics  that  will  give  rise  to 
general  solutions  to  the  specific  problems  of  detection  of 
independent  motion,  passive  navigation,  relative  depth 
estimation,  obstacle  avoidance,  estimating  whether  an 
object  is  on  a  collision  course  with  the  observer,  and  vi¬ 
sual  pursuit  using  the  derivatives  of  the  image  as  input, 
as  opposed  to  considering  them  as  applications  of  the 
structure  from  motion  module.  Section  4  is  devoted  to 
the  description  of  the  input  and  Sections  5-9  describe 
general  solutions  to  the  specific  tasks  mentioned  above. 
Finally,  Section  10  is  devoted  to  the  presentation  of  some 
experimental  results.  It  should  be  pointed  out  that  here 
we  are  mostly  interested  in  the  theoretical  principles  be¬ 
hind  these  perceptual  processes  and  the  geometry  of  the 
normal  flow.  We  seek  solutions  that  have  uniqueness 
properties  using  normal  flow  as  input,  since  normal  flow 
is  well  defined,  while  optic  flow  is  not.  Thus,  we  only 
present  the  computational  theory  behind  each  process. 
For  various  properties  of  the  solutions  of  the  individual 
problems,  a  theoretical  error  analysis  and  an  extensive 
implementation,  see  [16,  18,  22,  29].  It  will  become  clear 
that  solving  the  abovementioned  problems  using  normal 
flow  (which  contains  less  information  than  optic  flow)  be¬ 
comes  possible  only  through  the  employment  of  an  active 
visual  agent  [5].  The  reason  is,  of  course,  that  some  of 
the  computational  burden  is  transferred  to  the  activity 
of  the  agent. 

4  The  Input 

Our  motivation  is  by  now  clear.  We  wish  to  avoid  using 
optic  flow  as  the  input  to  visual  motion  tasks.  On  the 
other  hand,  we  must  utilize  some  description  of  the  im¬ 
age  motion.  As  such  a  description  we  choose  the  spatial 
and  temporal  derivatives  ff  of  the  image  inten¬ 

sity  function  I{x,y,t).  These  quantities  define  the  nor¬ 
mal  flow  at  every  point,  i.e.  the  projection  of  the  optic 
flow  on  the  direction  of  the  gradient  {Ix,Iy).  Clearly,  es¬ 
timating  the  normal  flow  is  much  easier  than  estimating 
the  actual  optic  flow.  But  how  is  normal  flow  related  to 
the  three-dimensional  motion  field?  Is  the  normal  optic 
flow  field  equal  to  the  normal  motion  field,  and  under 
what  conditions?  This  question  was  first  investigated  by 
Verri  and  Poggio  [36] 

Let  /(ar,  y,  t)  denote  the  image  intensity,  and  consider 
the  optic  flow  field  v  =  («,«)  and  the  motion  field 
V  =  (u,ti)  at  a  point  (x,y)  where  the  local  (normalized) 

intensity  gradient  is  n  =  (/*,  /y)/y^/J  -t-  ly.  The  normal 
motion  field  at  point  (x,  y)  is  by  definition 

«„  =  u  •  n  or 


Similarly,  the  normal  optic  flow  [21]  is 

«n  = 


Figure  2:  The  active  observer. 


Thus  ~ 

From  this  equation  it  follows  that  if  the  change  of  in¬ 
tensity  of  an  image  patch  during  its  motion  (^)  is  small 
enough  (which  is  a  reasonable  a.ssumption)  and  the  lo¬ 
cal  intensity  gradient  has  a  high  magnittide,  then  the 
normal  optic  flow  and  motion  fields  are  approximately 
equal.  Thus,  provided  that  we  measure  normal  flow  in 
regions  of  high  local  intensity  gradients,  the  normal  flow 
measurements  can  safely  be  used  for  inferring  3-D  struc¬ 
ture  and  motion. 

We  are  now  ready  to  describe  our  solution  to  the  vari¬ 
ous  motion  related  tasks.  Since  the  input  to  the  percep¬ 
tual  process  is  the  normal  flow,  and  the  normal  flow  field 
contains  less  information  than  the  motion  field,  in  order 
to  solve  various  problems  we  need  to  transfer  much  of  the 
computation  to  the  activity  of  the  observer  [5].  A  geo¬ 
metric  model  of  the  observer  is  given  in  Figure  2.  Notice 
that  the  camera  is  resting  on  a  platform  (“neck”)  with 
six  degrees  of  freedom  (actually  only  one  of  the  degrees 
is  used)  and  the  camera  can  rotate  around  its  x  and  y 
axes  (saccades). 

5  Passive  Navigation 

5.1  A  qualitative  solution 

The  problem  of  passive  navigation  (kinetic  stabilization) 
has  attracted  a  lot  of  attention  in  the  past  t<'n  years 
[13,  23,  24,  34,  31,  33]  because  of  the  generality  of  a  po¬ 
tential  solution.  The  problem  has  been  formulated  as 
follows:  Given  a  sequence  of  images  taken  by  a  monoc¬ 
ular  observer  undergoing  unrestricted  rigid  motion  in  a 
stationary  environment,  to  recover  the  .3-0  motion  of 
the  observer.  In  particular,  if  {U,  V,  IT)  and  (.1,  B,  C) 
are  the  translation  and  rotation,  respectively,  compris¬ 
ing  the  general  rigid  motion  of  the  observer,  the  problem 
is  to  recover  the  following  five  numbers:  the  direction  of 
translation  (^,  ^)  and  the  rotation  (.4,  H,  C)  (see 
Figure  3).  The  problem  has  thus  been  formulated  as  the 
general  3-D  motion  estimation  problem  (kinetic  depth 
or  structure  from  motion)  and  its  solution  would  solve 
several  other  problems. 

Consider  a  model  for  a  monocular  observer  as  in  Fig¬ 
ure  3.  We  assume  that  the  observer  moves  forward.  It 
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should  be  noted  that  the  observer  is  equipped  with  in¬ 
ertial  sensors  which  provide  the  rotation  (A,  B,  C)  of 
the  observer  at  any  time.  As  the  observer  moves  in 
its  environment,  normal  flow  fields  are  computed  in  real 
time.  Since  optic  flow  due  to  rotation  does  not  depend 
on  depth  but  on  image  position  (x,  y),  we  know  (and 
can  compute  in  real  time)  its  value  («^,  v^)  at  every 
image  point  along  with  the  normal  flow.^  That  means 
that  we  know  the  geometrical  locus  of  the  optic  flow  due 
to  translation  (see  Figure  4).  Since  the  observer  moves 
forward  in  a  static  scene,  it  is  approaching  anything  in 
the  scene  and  the  flow  is  expanding.  From  Figure  4,  it  is 
clear  that  the  focus  of  expansion  (FOE)  (^,  (when 
the  gradient  space  of  directions  (^,  ^)  is  superimposed 
with  the  image  space)  lies  in  the  hmf  plane  defined  by 
line  €.  Clearly,  at  every  point  we  obtain  a  constraint  line 
which  constrains  the  FOE  to  lie  in  a  half  plane.  If  the 
FOE  lies  on  the  image  plane  (i.e.  the  direction  of  trans¬ 
lation  is  anywhere  inside  the  sector  OABCD  (Figure  5)) 
then  the  FOE  is  constrained  to  lie  in  an  area  on  the  im¬ 
age  plane  and  thus  it  can  be  localized  (see  Figure  6). 
When  the  FOE  does  not  lie  inside  the  image,  a  closed 
area  cannot  be  found,  but  the  votes  collected  by  the  half 
planes  indicate  its  general  direction.  In  this  case  the  ob¬ 
server,  with  a  “saccade”  (a  rotation  of  the  camera),  can 
bring  the  FOE  inside  the  image  and  localize  it  (Figure  7 
explains  the  process). 


^If  computation  of  normal  flow  at  some  points  is  unreli¬ 
able,  we  just  don’t  compute  normal  flow  there. 


Figure  5:  Consider  the  camera  coordinate  system.  If  the 
translation  vector  (U,  V,  W)  is  anywhere  inside  the  solid 
OABCD  defined  by  the  nodal  point  of  the  eye  and  the 
boundaries  of  the  image,  then  the  FOE  is  somewhere  on 
the  image. 


Figure  6:  (a)  From  a  measurement  u  of  the  normal  flow 
due  to  translation  at  a  point  {x,y)  of  the  image,  every 
point  of  the  image  belonging  to  the  half  plane  defined  by 
€  that  does  not  contain  u  is  a  candidate  for  the  position 
of  the  focus  of  expansion,  and  collects  one  vote.  The 
voting  is  done  in  parallel  for  every  image  measurement, 
(b)  If  the  FOE  lies  within  the  image  boundaries,  then 
the  area  containing  the  highest  number  of  votes  is  the 
area  containing  the  FOE.  Using  only  a  few  measurements 
can  result  in  a  large  area.  Using  many  mea.siirements  (all 
possible)  results  in  a  small  area  (in  our  experiments  an 
area  of  at  most  three  or  four  pixels). 
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Figure  7:  (a)  If  the  area  containing  the  highest  number  of 
votes  has  a  piece  of  the  image  boundary  as  part  of  its  bound¬ 
ary,  then  the  FOE  is  outside  the  image  plane  (b).  (b)  The 
position  of  the  area  containing  the  highest  number  of  votes 
indicates  the  general  direction  in  which  the  translation  vector 
lies,  (c)  The  camera  (“eye”)  rotates  so  that  the  area  contain¬ 
ing  the  highest  number  of  votes  becomes  centered.  With  a 
rotation  around  the  x  and  y  axes  only,  the  optical  axis  can  be 
positioned  anywhere  in  space.  The  process  stops  when  the 
highest  vote  area  is  entirely  inside  the  image. 

5.2  The  algorithm 

We  assume  that  the  computation  of  the  normal  flow, 
the  voting  and  the  localization  of  the  area  contain¬ 
ing  the  highest  number  of  votes  can  happen  in  real 
time.  In  this  paper  we  don’t  get  involved  with  real 
time  implementation  issues  as  we  wish  to  analyze  the 
theoretical  aspects  of  the  technique.  However,  it  is 
quite  clear  that  computation  of  normal  flow  can  hap¬ 
pen  in  real  time  (there  already  exist  chips  performing 
edge  detection).  According  to  the  literature  on  Hough 
transforms  and  connectionist  networks  [9],  voting  could 
also  happen  in  real  time.  Let  5  denote  the  area  with 
the  highest  number  of  votes.  Let  L{S)  be  a  Boolean 
function  that  is  true  when  the  intersection  of  S  with 
the  image  boundary  is  the  null  set,  and  false  other¬ 
wise.  Then  the  following  algorithm  finds  area  S.  We 
assume  that  the  inertial  sensors  provide  the  rotation 
and  thus  we  know  the  normal  flow  due  to  translation. 

1.  begin  { 

2.  find  area  S 

3.  repeat  until  L{S) 

4.  {  rotate  camera  around  x,  y  axes 

so  that  the  optical  axis  passes 
through  the  center  of  5  (saccade) 

5.  find  area  S 

} 

output  S 

} 

If  the  camera  has  a  wide  angle  lens,  then  image  points 
can  represent  many  orientations,  and  only  one  saccade 
may  be  necessary.  But  if  we  have  a  small  angle  lens. 


then  we  may  have  to  make  more  than  one  saccade.'^ 

5.3  Improvement  of  the  solution 

It  is  clear  that  the  technique  just  described  provides  as 
an  answer  an  area  on  the  image  containing  the  FOE. 
How  large  or  small  this  area  can  be  depends  on  the  dis¬ 
tribution  of  surface  markings  and  thus  on  the  measured 
norm^d  flow.  If  the  FOE  lies  in  a  featureless  area,  the  re¬ 
sulting  area  will  not  be  small.  For  some  applications  the 
knowledge  of  area  S  might  be  enough  to  accomplish  the 
task.  We  can,  however,  narrow  down  a  more  accurate 
solution,  with  S  providing  one  constraint. 

Assuming  that  inertial  sensors  provide  us  with  the  ro¬ 
tation,  we  can  derotate  the  normal  flow  field.  Thus, 
assuming  a  translational  normal  flow  field  tin(x,y),  we 
have:  t;„  =  u  •  nr  -f  vriy,  where  («,u)  is  the  optic  flow 
and  (nx,ny)  the  direction  of  the  gradient  at  that  point. 
Since  we  have  derotated.  the  optic  flow  is 


U-xW  V- yW 


This  is  a  linear  equation  in  the  FOE  (^.  ^)  and  the 
time  to  collision  with  every  scene  point. 

If  we  consider  a  small  image  patch  P  with  Zav  the 
average  depth  of  the  scene  points  giving  rise  to  the  patch 
under  consideration,  then  the  above  equation,  for  every 
point  (*j,y,)  €  P  with  depth  z,-,  can  be  written  as 

_  ^»v  ^2^.  j.  J.  ( ^i.  \ 

v„w'^v„w  W  ~  \W  w) 

The  expected  value  of  the  la.st  term  in  the  above  equation 
is  zero,  and  assuming  that  we  can  correctly  compute 
(ur,ny)  and  Vn,  equations 

Ur  P  Uy  V  Zav  Ur  Uy 

at  every  point  (x,y)  €  P  constitute  a  linear  system  in 
the  unknowns  ^  and  Solving  such  systems  for 
several  patches  robustly  provides  the  FOE  and  a  haz¬ 
ard  map  (showing  different  time-to-collision  values).  The 
patches  need  at  least  three  normal  flow  measurements, 
and  so  they  can  be  quite  small. 

5.4  Analysis  of  the  method 

We  have  assumed  that  the  inertial  sensors  will  pro¬ 
vide  the  observer  with  accurate  information  about  ro¬ 
tation.  Although  expensive  accelerometers  can  achieve 
very  high  accuracy,  the  same  is  not  true  for  inexpen¬ 
sive  inertial  sensors  and  so  we  are  bound  to  have  some 
error.  Thus  we  must  assume  that  some  unknown  rota¬ 
tional  part  still  exists  and  contributes  to  the  value  of  the 

*Up  to  this  point  the  algorithm  is  similar  to  [20].  However, 
as  will  become  clear  later,  it  works  even  when  rotation  is 
present,  while  in  [20]  the  solution  works  only  for  translational 
motion. 
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normal  flow.  As  a  result,  the  method  for  finding  the  FOE 
(previous  section)  which  is  based  on  translational  normal 
flow  information  (since  we  have  “derotated”)  might  be 
affected  by  the  presence  of  some  rotational  flow.  In  this 
section,  we  study  the  effect  of  rotation  (the  error  of  the 
inertial  sensor)  on  the  technique  for  finding  the  FOE.  At 
the  same  time  we  provide  a  technique  for  bounding  the 
FOE  given  a  normal  flow  field  containing  both  rotation 
and  translation. 

In  order  to  avoid  artificial  problems  introduced  by  per¬ 
spective  distortions  in  the  case  of  a  planar  retina  and  to 
simplify  the  formulas  without  loss  of  generality,  we  em¬ 
ploy  a  spherical  retina.  Let  a  sphere  with  radius  /  and 
center  O  (Figure  8)  represent  the  spherical  retina  (with 
O  the  nodal  point  of  the  eye)  and  a  coordinate  system 
OXYZ  attached  to  it.  Let 

Vu  =  {X,  Y,  Z)  be  a  world  point 
r  s=  (*,  y,  z)  be  its  image  on  the  image  plane. 

Then 


Figure  9: 


Thus,  the  translational  flow  is 


Mt 


-  5  -If  + 


f)l 


||uJ|(r,  which  represents  the 


7  ~  ^  “  llj^ll  =  y/KTK, 

In  the  sequel  we  derive  expressions  for  optic  (normal) 
flow  in  the  new  configuration. 

If  the  velocity  of  the  world  point  r^,  is  given  by 

Vu  =  —t  —  i3  X  r^ 

where  ^  translation  (f  =  {U,  V,  FF)) 

w  is  rotation  (u  =  (u/cWy,^;)) 

then  f  = 


R  ~  ^  •  »toi)  =  ^  ^  •  r^^)  = 

We  have 


r 

7 


or 


■  r„) 

W-  f) 


while  the  rotational  flow  is  given  by 
u/j  =  —u  X  r 

Without  loss  of  generality  we  can  set  /  =  1 . 

At  this  point  we  define  two  quantities  that  will  be  of 
use  later.  They  are  r  =  -4-,  which  we  term  time  to 

l|t|| 

collision,  and  k  = 

lltll 

effective  ratio  of  rotation  and  translation. 

The  geometry  of  the  spherical  projection  is  then  given 
in  Figure  9.  It  has  been  shown  [28]  that  a  full  (360®) 
visual  field  simplifies  motion  analysis.  However,  what  we 
usually  have  is  just  a  piece  of  the  surface  of  the  sphere 
(due  to  a  limited  field  of  view).  Assume  then  that  the 
image  (the  part  that  we  see)  is  projected  on  the  surface 
patch  5.  Obviously,  voting  for  the  estimation  of  the  FOE 
can  be  performed  for  all  points  on  5. 

5.4.1  Principles  of  voting 
Consider 

Tj  =  (*,  y,  z),  a  point  in  5, 

Hi  =  (nr,njf,nt),  the  image  gradient  direction 
at  point  Fi, 

Fi  =  Ui  =  (u,r,Uy,u,),  the  flow  at  point  Fj,  and 
=  (n,-  •  u,)  •  Hi,  the  normal  flow  at  fJ. 

Then  (see  Figure  10)  if  r  =  {x,y,z)  is  a  point  in  5,  a 
feature  point  fJ  will  vote  for  r  being  the  FOE  (direction 
of  translation)  iff  uj*(f’—  fj)  <  0  (see  Figure  10). 

If  represents  the  number  of  votes  collected  at 
point  r,  then  it  is  easy  to  see  that 


where 


Uix) 


■{o’ 


Ties 
r  >  0 

I  <  0 


(Heaviside  function) 


=  [-<7+ ^(r-r)] 


X  r 


Let  S'  =  |flVr'  €  S,  F[f]  >  be  the  set  of  points 

that  have  acquired  the  maximum  number  of  votes.  There 
are  two  cases; 
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Figure  10: 


Case  1:  S'  does  not  intersect  the  border  of  S,  in  which 
case  the  FOE  is  in  S'. 

Case  2:  S'  touches  the  border  of  S,  in  which  case  the 
FOE  could  be  outside  S. 

It  should  be  clear  that  if  there  is  no  rotation,  then  S' 
will  always  contain  the  FOE  or  give  the  direction  of  the 
FOE — i.e.  the  direction  towards  which  we  need  to  ro¬ 
tate.  The  size  of  S'  depends  on  the  distribution  of  feai- 
tures. 

We  can  investigate  the  performance  of  the  voting 
scheme  in  the  presence  of  rotation.  In  particular  we  can 
ask  how  large  area  S  is  when  rotation  is  present.  It  has 
been  shown  that  this  depends  on  the  angle  6^  between 
the  direction  of  translation  and  the  axis  of  rotation  as 
well  as  on  the  rotation-to- translation  ratio  k.  In  partic¬ 
ular,  ffu>  distorts  area  S'  and  k  enlarges  it  as  it  grows. 
The  interested  reader  can  consult  [16]. 

5.4.2  Correctness  of  voting 

The  normal  flow  (as  well  as  the  actual  flow)  is  very 
small  in  the  region  close  to  the  FOE,  and  in  the  directions 
close  to  orthogonal  to  the  directions  of  the  flow.  Conse¬ 
quently,  even  when  only  translation  is  present,  in  order 
to  avoid  inaccuracies  that  might  arise  in  the  estimated 
direction  of  the  normal  flow — numerical  manipulation  of 
very  small  quantities  is  unstable — we  are  going  to  dis¬ 
card  any  normal  flow  whose  magnitude  is  less  than  some 
threshold  Tt.  Later,  it  will  turn  out  that  choosing  this 
threshold  greatly  facilitates  the  geometrical  analysis  of 
the  technique.  Considering  an  actual  flow  u  at  a  point 
A  (see  Figure  11)  we  can  compute  the  locus  of  gradient 
directions  n  along  which  the  normal  flow  (i.e.  the  pro¬ 
jection  of  u  on  n)  is  bigger  than  the  threshold  Tt.  In 
Figure  11  they  are  all  directions  inside  angle  BAC  de¬ 
fined  by  00  =  arccos  for  <  1,  or  there  are  no 
H“ll  ll«ll  ~ 
such  directions  for  >  1. 

I|u|| 

We  now  develop  a  condition  that  needs  to  be  satisfied 
in  order  for  voting  at  a  point  to  be  correct  in  the  presence 
of  rotation. 

Voting  will  clearly  be  correct  only  if  the  direction  of 
the  translational  normal  flow  is  the  same  as  the  direction 
of  the  actual  normal  flow,  that  is  when 

(n  ■  Uf)(n  ■  S)  >  0  (2) 

In  addition,  since  we  consider  only  normal  flows 


greater  than  threshold,  we  need 

|nu|>Tt  (3) 

Inequality  (2)  becomes 

(n  •  ti,  '■  I  •  u)  =  (n  •  t7*)(n  •  Ui  -f  n  •  ««)  = 

=  (n  u*)* -f-(n  «,)(«  ««)>  0 

So,  if  we  set  |n  •  Ufl|  =  Tt,  then  there  are  two  possibil¬ 
ities:  either  |n  •  uj  is  below  the  threshold,  in  which  case 
it  is  of  no  interest  to  voting,  or  the  sign  of  n  ■  u  is  the 
same  as  the  sign  of  n  •  ut.  In  other  words,  if  we  can  set 
the  threshold  equal  to  the  maximum  value  of  the  normal 
rotational  flow,  then  our  voting  will  always  be  correct. 
But  at  point  r  of  the  sphere  the  rotational  flow  is 

In  •  ««!  <  l|n||  •  Iluftll  =  llti/ill  =  ||w  X  fll  = 

=  <NI 

Thus  if  we  choose  T»  =  l|u;||,  then  the  sign  of  n  •  u 
(actual  normal  flow)  is  equal  to  the  sign  of  ut  •  n  (trans¬ 
lational  normal  flow)  for  any  normal  flow  of  magnitude 
greater  than  Tt. 

5.4.3  The  case  of  dominant  rotation 

Although  the  technique  described  in  this  paper  was 
derived  to  solve  the  problem  of  kinetic  stabilization  it 
turns  out  that  it  has  general  applicability.  It  can  be 
modified  to  handle  the  case  of  dominant  rotation  with 
translation. 

Consider  a  pattern  of  optic  flow  in  the  case  of  pure 
rotation.  On  a  spherical  retina  the  optic  flow  will  corre¬ 
spond  to  vectors  tangent  to  the  circles  around  the  axis 
of  rotation  w.  The  point  at  which  the  axis  of  rotation 
passes  through  the  image  will  be  called  the  AOR.  If  there 
is  circular  optic  flow  in  the  image  (due  to  pure  rotation) 
the  center  of  all  the  circles  is  the  AOR.  If  we  take  an 
arbitrary  optic  flow  vector  ur  at  the  point  fj  then  we 
can  say  that  a  point  r  is  a  candidate  for  the  AOR  if 

(fi  X  VR)r<  0. 

This  inequality  expresses  the  fact  that  the  feature  point 
and  the  flow  vector  at  the  point  span  the  plane  p  which 
cuts  the  sphere  in  two  hemispheres  where  one  contains  all 
possible  candidate  points  for  the  AOR  (and  all  of  them 
satisfy  the  previous  inequality).  Furthermore,  all  possi¬ 
ble  positions  of  the  AOR  lie  on  the  great  circle  which 
is  normal  (on  the  sphere)  to  the  great  circle  which  is 
the  intersection  of  the  plane  p  and  the  image  sphere.  In 
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other  words  if  we  replace  Sr  with  the  normal  flow 
the  inequality  will  still  hold. 

Very  similar  reasoning  applies  in  the  case  of  a  flat 
retina  (perspective  projection).  Given  an  optic  flow 
(u,v)  at  the  feature  point  (zi,yt)  possible  candidate 
points  for  the  AOR  are  on  the  right  of  the  line  pass¬ 
ing  through  (xi,y«)  and  parallel  to  (u,  v).  Furthermore, 
they  all  lie  on  the  line  normal  to  (u,  v)  and  originating 
at  (xi,  y,).  In  other  words  candidate  points  {x,  y)  for  the 
AOR  satisfy  the  inequality 

((«,w,0)  X  (ar-a:i,y-y<,0))(0,0, 1)<  0. 


This  inequality  indicates  that  the  z  component  of  the 
vector  product  of  the  optic  flow  vector  and  the  difference 
of  the  candidate  AOR  point  and  the  feature  point  must 
be  negative.  As  in  the  case  of  a  spherical  retina  this  holds 
even  when  the  optic  flow  (u,  v)  is  replaced  by  the  normal 
flow  («",«").  As  was  done  in  the  case  of  translation, 
voting  can  be  performed.  Points  with  maximum  votes 
are  candidates  for  the  AOR.  If  a  minimum  is  sought 
then  the  opposite  direction  will  be  found.  If  the  area 
is  closed  then  the  AOR  is  localized  as  before;  otherwise 
its  general  direction  will  be  indicated  by  the  area  with 
maximum  votes. 

An  analysis  (on  a  spherical  retina)  similar  to  the 
one  performed  for  the  case  of  dominant  translation  can 
be  performed  again.  This  time,  however,  the  threshold 
should  be  set  to  Tj  =  r  =  (time  to  collision).  If  the 

magnitude  of  the  normal  flow  is  greater  then  Tt  then  it 
must  have  the  same  sign  (and  direction)  as  rotational 
normal  flow. 

When  b>  and  t  are  parallel  the  angular  radius  of  the 
uncertainty  region  is  equal  to  0r„  where  cottf,,  = 


The  difference  in  the  angular  radii  of  the  uncertainty 
areas  around  the  FOE  and  the  AOR  is  that  the  tangent  is 
replaced  by  the  cotangent.  When  flui  >  0  the  uncertainty 
area  around  the  AOR  changes  shape  in  a  similar  manner 
as  the  uncertainty  area  around  the  FOE.  It  extends  in 
the  direction  CSxt  with  the  growth  of  0^  and  gets  closer 
to  the  AOR  in  the  opposite  direction. 


6  Active  Detection  of  Independent 
Motion 

Among  the  more  significant  papers  devoted  exclusively 
to  detecting  moving  objects  is  [32]  by  Thompson  and 
Pong.  It  recognizes  the  difficulty  of  motion  detection 
using  only  visual  information  in  the  form  of  optic  flow, 
and  considers  additional  constraints  that  may  have  to 
be  applied  for  motion  detection,  e.g.  knowledge  about 
camera  motion,  moving  object  tracking,  and  information 
about  scene  depth.  Though  it  presents  a  good  discussion 
of  the  various  trade-offs  involved,  all  techniques  proposed 
still  depend  on  the  computation  of  the  optic  flow. 

The  two  approaches  that  are  closest  to  the  technique 
described  here  (emphasizing  qualitative  techniques  for 
particular  situations)  are  [17]  and  [27].  In  [17]  Bhanu 
et  al.  identify  a  fuzzy  FOE  (see  also  [14])  and  propose 
a  rule-based  qualitative  analysis  of  the  motion  of  scene 
points.  However,  this  requires  point  correspondences 


Figure  12:  (a)  If  the  observer  translates  along  its  optical 
axis,  then  the  normal  flow  field  has  the  property  that  it 
points  away  from  the  origin  (FOE)  at  every  point.  This 
normal  flow  field  is  as  expected  and  it  does  not  signify 
independent  motion  (although  it  might  exist),  (b)  There 
exist  values  of  the  normal  flow  that  do  not  point  away 
from  the  FOE.  They  are  not  as  expected  and  thus  signify 
independent  motion. 


that  are  difficult  to  obtain  in  general  and  involves  con¬ 
siderable  “high-level”  (and  hence  expensive)  reasoning, 
which  would  seem  to  be  inappropriate  for  the  relatively 
“low-level”  task  of  motion  detection.  In  [27]  Nelson  gives 
motion  detection  techniques  based  on  normal  flow  and 
pattern  recognition  that  can  be  used  in  situations  when 
the  observer  motion  is  specific,  and  when  the  object  mo¬ 
tion  changes  rapidly  in  comparison  with  the  changes  in 
camera  motion  (termed  “animate  vision”;  see  also  [10]). 

The  basis  of  the  technique  described  here  lies  in  de¬ 
viations  from  expectations.  If  the  observer  moves  in  a 
stationary  environment  then  he/she  expects  to  receive  a 
normal  flow  field  that  obeys  some  properties  (see  Fig¬ 
ure  12).  If  there  exist  independently  moving  visible 
objects  in  the  scene  then  some  of  these  properties  will 
not  hold  in  parts  of  the  normal  flow  image;  these  unex¬ 
pected  “anomalies”  signify  the  existence  of  independent 
motion.®  However,  it  is  possible  that  the  normal  flow 
field  appears  as  expected  while  there  still  exists  indepen¬ 
dent  motion.  In  the  sequel  we  will  examine  the  problem 
in  more  detail. 

The  motion  field  and  hence  the  optic  flow  is  due  to 
the  motion  of  the  observer  (inducing  a  flow  u®8)  and 
the  motion  of  independent  objects  in  view,  inducing  a 
flow  Then  the  normal  flow  at  every  point  is:  v„  = 
where  are  the  normal  components 

of  5*8  and  u’"'*  respectively. 

We  consider  the  case  where  the  motion  of  the  observer 
is  translation  (if  there  is  rotation,  the  observer’s  inertial 
sensors  can  provide  it;  then  we  can  derotate  the  normal 
flow  field  and  thus  consider  only  translation).  Also,  the 
previous  algorithm  (Section  5)  provides  the  FOE  (or  an 
area  containing  it).  To  simplify  the  exposition  we  first 
assume  that  the  FOE  is  a  known  point  but  we  can  easily 
generalize  to  the  case  where  the  FOE  lies  in  an  area  5. 

To  make  the  image  acquisition  active,  we  assume  that 
the  camera  can  be  given  very  small  translation,  whose 

*Thi8  principle  of  deviations  from  expectations  and 
anomalies  is  very  powerful  and  can  be  used  in  many  other 
situations. 
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Figure  13: 


net  result  is  a  momentary  shift  of  the  FOE  in  the  image. 
We  sdso  assume  that  this  can  be  done  in  a  controlled 
manner,  so  that  the  FOE  can  be  moved  to  a  desired 
position(with  a  given  accuracy).  These  small  shifts  are 
called  jitters. 

The  engineering  basis  for  using  the  “jitter”  is  that  the 
shift  in  the  FOE  helps  in  motion  detection  (on  the  ba¬ 
sis  of  purely  geometric  considerations,  to  be  discussed), 
while  the  fact  that  it  causes  only  momentary  and  con¬ 
trolled  displacement  about  an  equilibrium  position  elim¬ 
inates  the  need  for  point  correspondence.  We  assume 
that  the  motors  responsible  for  moving  the  camera  have 
the  dynamic  control  capability  needed  for  producing  the 
jitter.  We  believe  that  with  a  suitably  designed  camera 
system  this  should  be  possible  ®.  We  do  not  concern 
ourselves  here  with  the  details  of  the  motor  control  in¬ 
volved  but  consider  instead  only  the  effect  of  the  result¬ 
ing  shifts  of  the  FOE.  Note  that  the  “jitter”  does  not 
effect  the  dominant  motion  of  the  observer  (i.e.  that  of 
the  mobile  platform).  Thus  its  effect  on  the  image  flow 
is  only  “additive” .  That  is,  the  image  flow  pattern  due 
to  the  egomotion  is  modified  by  the  addition  of  the  flow 
due  to  the  jitter  at  a  point.  This  constrains  the  nature 
of  the  changes  to  the  flow  that  can  be  brought  about,  as 
will  be  explained  later. 

6.1  The  computational  theory 

Consider  Figure  13,  which  represents  the  normal  flow  at 
two  points  A,  B  with  O  being  the  FOE.  Clearly,  if  the 
normal  flow  points  towards  the  FOE  (i.e.  the  FOE  lies 
in  the  half  plane  defined  by  the  line  normal  to  the  flow), 
then  this  particular  point  (B)  is  moving  independently  of 
the  sensor.  If,  however,  the  normal  flow  points  away  from 
the  FOE  (as  in  A),  this  could  be  due  to  egomotion  or 
to  a  combination  of  egomotion  and  independent  motion. 
Thus  further  constraints  need  to  be  applied  to  always 
be  able  to  detect  independent  motion.  At  this  juncture 
additional  information  from  the  image  sequence  could 
be  used,  for  example,  the  value  of  u„,  but  in  accordance 
with  our  goal  of  devising  a  strategy  that  uses  only  the 
“sign”  of  u„,  we  have  to  define  some  additional  activity 
that  may  make  motion  detection  possible.  It  is  easy 

‘After  all,  human  eyes  are  perpetually  active  and  can  be 
moved  very  efficiently  with  the  help  of  extensive  groups  of 
muscles.  Any  artificial  system  that  purports  to  emulate  hu¬ 
man  performance — at  least  in  achieving  navigational  goals 
(for  example) — should  have  similar  “active”  capabilities  [10], 


to  see  that  the  following  conditions  are  necessary  and 
sufficient  for  detecting  independent  motion  at  a  point 
for  a  particular  position  of  the  FOE. 

(a)  points  toward  the  FOE. 

(b)  The  length  of  is  less  than  the  length  of 

In  general,  the  conditions  above  will  not  be  satisfied 
at  every  point  in  the  image.  The  only  “tool”  that  we 
allow  ourselves  at  this  point  is  shifting  of  the  FOE  by 
small  “jitters”  as  explained  earlier.  The  question  we  ask 
then  is,  what  exploratory  action  (a  sequence  of  shifts  of 
the  FOE)  will  guarantee  motion  detection  at  all  points  of 
the  image.  One  condition  that  any  exploratory  activity 
guaranteeing  “completeness”  will  have  to  satisfy  is  es¬ 
tablished  by  the  following  observation;  If  Oi ,  O2, . . . ,  O* 
is  a  sequence  of  new  FOE  locations  (formetl  by  an  ex¬ 
ploratory  action),  and  the  convex  hull  of  the  set  of  points 
encloses  the  entire  region  of  interest,  then 
complete  detection  of  independent  motion  is  guaranteed. 
This  constitutes  a  necessary  condition  for  the  complete¬ 
ness  of  detection. 

We  can  also  observe  that  if  the  region  of  interest  is 
the  entire  rectangular  image,  and  the  FOE  is  shifted  at 
least  to  the  four  corners  of  the  image,  then  the  necessary 
condition  for  guaranteeing  detection  is  satisfied.  Up  to 
now  we  were  mostly  concerned  with  condition  (a).  Be- 
lore  we  establish  conditions  under  which  both  conditions 
(a)  and  (b)  are  satisfied,  let  us  consider  condition  (b).  If 
it  is  violated,  then  the  length  of  is  larger  than  the 
length  of  or  ||u^*ll  >  A"y  exploratory  ac¬ 

tion  (since  we  have  no  control  over  u”"^)  would  attempt 
to  decrease  The  following  two  exploratory  activities 
attempt  to  satisfy  condition  (b). 

If  the  point  of  interest  is  point  y4(x^,  y^)  then  we  can 
either  move  the  FOE  close  to  A,  or  decrease  the  angle 
between  the  line  connecting  the  FOE  to  point  A  and 
the  gradient  of  the  image  at  point  A.  The  first  action 
decreases  the  flow  due  to  egomotioii  (egomotion  flow  at 
the  FOE  is  zero)  and  the  second  action  decreases  the 
normal  flow  due  to  egomotion. 

6.2  The  algorithm 

For  a  typical  robot  task,  detecting  the  motions  of  objects 
that  are  small,  distant,  or  slow  is  not  very  important.  On 
the  other  hand,  detecting  the  motions  of  objects  that  are 
large,  close,  or  fast  may  be  critical  for  the  robot,  and 
any  useful  motion  detection  strategy  should  guarantee 
the  detection  of  such  motions.^ 

^For  example,  for  safe  navigation  a  mobile  robot  needs 
to  detect  any  sharp  changes  in  nearby  objects  that  are  large 
enough  to  be  important  (e.g.  another  robot  or  a  human  that 
may  move  across  its  path),  while  other  moving  objects  may 
not  be  of  immediate  interest  if  they  are  distant  (e.g.  they 
will  not  affect  the  robot’s  planned  path)  or  if  they  are  too 
small  (e.g.  a  fly).  Of  course  this  may  represent  only  a  typical 
scenario;  under  other  circumstances  or  for  other  missions  all 
motion  may  be  critically  important,  and  it  would  be  justified 
to  pay  the  cost,  which  consists  of  increased  exploratory  activ¬ 
ity  (careful  scanning  of  the  scene)  or  a  decrease  in  the  overall 
speed  of  the  robot.  An  analogy  from  the  biological  world  can 
easily  be  made.  When  a  deer  or  other  animal  senses  danger  it 
slows  down  or  even  stops  completely  and  looks  around  care- 
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There  is  obviously  a  trade-off  involved  between  the  ac¬ 
tivity  required  and  the  parameters  that  describe  the  sen¬ 
sitivity  of  motion  detection  under  different  conditions. 
Computation  of  normal  flow  proceeds  in  real  time.  Nor¬ 
mal  flows  pointing  toward  the  FOE  are  classified  as  mov¬ 
ing  independently.  Additional  activity  by  the  observer 
(moving  the  FOE  at  least  in  the  four  image  boundaries) 
may  uncover  additional  independently  moving  points. 

If  we  consider  a  point  A(x,  y)  on  the  image  plane,  then 
the  length  of  the  flow  tZ®*  is  ||u®®||  =  ^  •  r,  where  r  is  the 
distance  of  A  from  the  FOE.  Assume  further  that  the 
detection  paradigm  is  such  that  for  the  point  of  interest 
A  at  least  one  position  of  the  FOE  lies  within  distance 
d  from  A.  That  is  r  <  d  for  point  A.  The  length  of  the 
corresponding  egomotion  flow  is  thus  ||tr8||  <^  ■  d  and 
consequently  the  normal  egomotion  component  obeys 


For  independent  motion  to  be  detected,  we  need  both 
conditions  (a)  and  (b)  to  be  satisfied.  One  way  to  guar¬ 
antee  that  (a)  will  be  satisfied  is  to  move  the  FOE  to 
a  new  position  so  that  A  will  be  inside  the  segment  de¬ 
fined  by  the  two  FOE  positions.  For  condition  (b)  to  be 
satisfied  we  need  (worst  case)  that 

so  that  Rn**  > 

or 

If  the  above  inequality  is  satisfied,  then  we  are  guaran¬ 
teed  to  detect  independent  motion  at  point  A. 

In  the  above  inequality,  d  basically  represents  the  cost 
involved  in  detecting  motion  using  an  exploratory  strat¬ 
egy  that  guarantees  detection.  Obviously  the  cost  of  the 
exploration  decreases  (i.e.  d  increases)  when  the  time 
to  collision  with  the  environment  is  small  (large  depth, 
small  W).  On  the  other  hand,  if  |lu||  is  the  smallest  reti¬ 
nal  motion  (due  to  independent  3-D  motion)  that  can  be 
detected,  then 


If  Hi?"** II  <  ||u||,  then  there  is  no  guarantee  of  detection. 

This  formalizes  the  earlier  intuitive  discussion  and  also 
indicates  a  way  to  control  the  performance  of  the  motion 
detection  strategy.  At  any  stage  a  higher  precision  (lower 
||u||)  can  be  achieved  without  changing  the  exploratory 
action  (parameterized  by  d),  but  by  decreasing  the  dom¬ 
inant  speed  of  the  robot. 

When  the  purpose  of  motion  detection  is  to  serve  as 
an  early  warning  system  to  detect  independently  moving 
objects  in  the  scene  it  is  not  necessary  to  guarantee  the 
detection  of  all  moving  (feature)  points.  The  detection 
of  a  few  moving  points  (that  satisfy  some  criteria,  to 
eliminate  “false  alarms”)  should  suffice  since  it  can  trig¬ 
ger  a  more  detailed  analysis  (perhaps  over  a  narrower 

fully,  alert  to  the  slightest  movement  (and  may  “jump”  even 
for  a  falling  leaf),  whereas  normally  it  is  less  sensitive  to  the 
motions  around  it.  It  would  be  desirable  to  equip  a  robot 
with  a  similar  mechanism  for  motion  detection  that  would 
have  a  variable  level  of  sensitivity. 


region),  again  depending  on  the  task  at  hand.  We  show 
how  the  cost  of  motion  detection  may  be  dramatically 
reduced  when  the  requirement  is  to  guarantee  detection 
of  a  compact  moving  object  of  at  least  a  minimum  pro¬ 
jected  size.  In  particular,  the  cost  of  the  exploratory 
activity  can  be  linked  to  the  minimum  (image)  diani'  <■  r 
of  the  objects  of  interest. 

As  discussed  earlier,  it  may  be  reasonable  to  assume 
that  the  boundary  of  the  image  of  a  compact  object  in 
the  scene  forms  a  closed  contour.  In  particular,  this  im¬ 
plies  that  all  the  points  on  the  boundary  of  the  object 
are  features,  and  would  be  successful  candidates  for  our 
motion  detection  paradigm  provided  the  projected  mo¬ 
tion  t?"'*  is  sufficiently  large  (||tr"‘*ll  >  ||u||).  We  define 
the  diameter  e  of  an  arbitrary  object  as  the  diameter 
of  the  largest  circle  that  can  be  inscribed  in  the  closed 
contour  that  forms  the  boundary  of  the  projected  im¬ 
age  of  the  object.  Now  it  becomes  clear  that  any  ex¬ 
ploratory  paradigm  that  “covers”  the  image  so  that  it 
guarantees  the  detection  of  all  points  distance  e  apart 
will  guarantee  the  detection  of  at  least  a  few  points  on 
an  object  having  features.  The  points  are  guaranteed  to 
be  detected  by  an  appropriate  sequence  of  FOE  shifts  as 
discussed  earlier.  Moreover,  because  the  boundaries  of 
objects  are  locally  smooth,  the  points  thus  detected  will 
be  clustered  together,  so  that  it  may  be  possible  to  elim¬ 
inate  false  alarms  arising  due  to  various  noise  sources 
that  result  in  isolated  points  appearing  to  have  indepen¬ 
dent  motions.  In  practice,  the  presence  of  larger  features 
on  the  objects,  e.g.  lines  separating  regions,  would  make 
the  detection  even  easier.  Thus  the  effort  in  the  ex¬ 
ploratory  activity  can  be  reduced  when  the  objects  of 
interest  have  image  diameters  greater  than  some  thresh¬ 
old  and  when  there  need  be  no  guarantee  of  detecting 
objects  having  projected  diameters  below  that  thresh¬ 
old.  This  is  most  appropriate  when  the  purpose  of  the 
robot  is  such  that  larger  and  nearer  objects  are  more 
interesting  than  smaller  and  farther  objects,  as  may  be 
true  for  many  typical  robot  tasks  such  as  safe  navigation 
in  a  dynamic  environment.  However,  at  any  stage  the 
precision  of  the  detection  can  be  increa.sed  by  decreasing 
the  diameter  and  threshold  using  closer  FOE  shifts  in 
the  exploratory  action. 

7  Estimating  3-D  Motion 

Assume  an  observer  imaging  an  object  moving  in  an  un¬ 
restricted  rigid  manner.  The  motion  of  the  object  can 
be  described  as  the  sum  of  a  rotation  plus  a  translation. 
We  can  choose  a  point  through  which  the  rotation  axis 
passes;  this  gives  a  unique  rotation  and  translation  de¬ 
scribing  the  rigid  motion  (in  general  there  are  infinitely 
many  combinations  of  rotations  and  translations  describ¬ 
ing  the  same  rigid  motion).  In  many  visual  tasks  we  are 
only  interested  in  the  translation  of  the  moving  object 
and  we  need  no  information  about  how  it  rotates  around 
itself.  This  section  describes  how  we  can  estimate  the  di¬ 
rection  of  the  object’s  translation  without  being  able  to 
recover  the  rotation  using  a  technique  based  on  normal 
flow.  Assume  that  the  object  is  trauislating  with  veloc¬ 
ity  V  =  {U,  V,  W)  and  rotating  with  angular  velocity 
n  =  {A,B,C)  around  a  point  P  =  (A'o,Vb,Zo)  on  the 
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object.  Point  P  is  on  the  object  and  its  exact  choice  will 
be  made  clear  later. 

Point  P  is  visible  in  the  image  (p  =  (ro,!/o))  and  we 
attach  a  coordinate  system  onto  the  object  at  point  P 
with  axes  parallel  to  the  observer’s  coordinate  system. 
We  express  the  motion  of  the  object  in  this  “object- 
based”  coordinate  system.  The  velocity  of  a  point  Q 
on  the  object  is 
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Then  the  normal  flow  along  direction  (n*,  n„)  at  point 
(i,y)is  Vn  =  un^  +  vny 

where  («,v)  is  the  motion  field.  Expressing  (u,v)  in 
terms  of  3-D  motion  we  get 
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Consider  a  small  patch  of  the  image  around  point  p  = 
(xo,yo)  and  let  us  assume  that  the  average  depth  there 
is  Zav  If  we  add  the  quantity  k  (^)  —  ^  to  both  sides 
of  the  above  equation,  we  get 

fix  U  ny  V  Zav  rix  Hy  ( 1.  ^  \ 

One  can  verify  that  the  mean  of  the  last  term  in  the 
equation  above  is  zero  (assuming  that  the  mean  of  x  is 
xo  and  of  y,  yo). 

We  can  thus  consider  several  linear  equations: 

fix  U  fly  P  Zav  _  fix  ,  fly 

in  the  neighborhood  around  P.  Solution  of  the  system 
provides  the  FOE. 

The  reader  must  have  realized  that  it  was  the  choice 
of  the  coordinate  system  in  which  we  expressed  the  mo¬ 
tion  that  allowed  us  to  isolate  the  translational  part  of 
the  problem.  Since  P  is  the  center  of  rotation,  the  rota¬ 
tional  flow  at  point  p  =  (xo,  yo)  is  zero.  In  other  words 


the  above  equation  is  exact  at  point  (xo,  yo)  and  approx¬ 
imate  in  its  neighborhood.  The  error  terms,  however, 
have  zero  mean.  This  provides  the  potential  for  robust 
estimation.  The  time  to  collision  is  also  estimated. 

It  is,  however,  clear  that  the  technique  for  addressing 
the  passive  navigation  problem  (Section  5)  cannot  be 
used  for  the  3-D  motion  of  an  object  estimation  prob¬ 
lem  (while  they  are  both  the  same  problem  if  considered 
as  general  recovery  problems).  For  example,  voting  for 
the  values  of  normal  flow  produced  by  the  motion  of  an 
object  can  provide  a  very  large  solution  area. 

8  Obstacle  Avoidance — Relative  Depth 

One  of  the  most  elementary  forms  of  navigation  is  ob¬ 
stacle  avoidance  by  a  moving,  compact  sensor.  It  is  a 
prerequisite,  however,  for  many  more  complex  abilities 
since  any  system  performing  a  more  complicated  task 
must  avoid  obstacles  in  the  process.  Obstacle  avoidance 
is  thus  one  specific  problem  for  which  a  general  solution 
is  highly  desirable.  In  this  context,  a  general  solution 
refers  to  a  system  that  works  effectively  in  a  wide  range 
of  real  environments.  This  implies,  among  other  things, 
that  the  system  performance  does  not  depend  upon  ar¬ 
tificial  constraints  on  the  nature  of  objects  in  the  en¬ 
vironment  such  as  assuming  planar  or  smoothly  curved 
surfaces,  rigid  or  unmoving  objects,  mathematically  uni¬ 
form  textures,  and  so  forth. 

The  concept  of  “obstacleness”  is  a  relative  one.  When 
we  move  about  in  our  environment,  every  object  might 
represent  a  potential  obstacle,  depending  on  its  position 
and  our  direction  of  motion,  and  depending  on  its  size. 
In  addition,  time  plays  an  important  role.  When  we 
move  towards  a  building,  the  building  itself  represents 
a  potential  obstacle  if  our  intent  was  to  go  beyond  it. 
In  other  words,  an  object  represents  an  obstacle  if  the 
observer  is  on  a  collision  course  with  it,  its  size  is  com¬ 
parable  to  the  observer’s  size  and  the  time  to  collision  is 
smaller  than  some  value  which  depends  on  the  particular 
aspects  of  the  problem  under  consideration. 

Thus,  we  consider  the  problem  of  obstacle  avoidance 
as  synonymous  to  the  problem  of  computing  the  times 
to  collision  to  different  parts  of  the  scene,  or  finding  rel¬ 
ative  depth  at  places  of  interest.  This  section  is  devoted 
to  computing  time  to  collision  and  relative  depth  from 
normal  flow.  The  technique  of  the  previous  section  will 
be  used.  V7e  consider  the  most  general  case,  where  an 
object  in  view  is  moving  rigidly  (rotation  pins  transla¬ 
tion). 

8.1  Computing  time  to  collision  for  a  moving 
object 

Recalling  the  last  equation  of  the  previous  section,  which 
is  exact  at  the  position  p  =  (xq,  yo),  we  have 

Wf  U  V\  W, 

v"' 1?  w  j  ~ 

with  all  terms  defined  as  previously.  If  ^  is  already 
computed,  the  quantity  ^  is  directly  available. 
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8.2  Computing  relative  depth 


Assume  two  objects  A  and  B  moving  in  a  rigid  manner, 
while  an  active  observer  has  the  task  of  finding  which 
one  is  closer.  The  camera  is  active  and  can  undergo 
a  short  abrupt  motion  along  its  optical  axis.  Let  us 
assume  that  at  time  ti  the  camera  is  stationary  and  then 
it  moves  with  velocity  We  at  time  Assuming  that  the 
velocities  of  the  two  objects  remain  unchanged  during 
the  time  interval  we  obtain  (as  in  Section  7)  for 

times  <1  and  ti 


^  =Ai 


7® 


where  Ai,Bi,Ai,Bi  are  known. 

From  these  equations,  assuming  that  dt  is  very  small, 
we  obtain  AiAi(Bi  -  By) 

ZB  ~  BxBi{Ai-A^y 

and  hence  relative  depth. 


9  Visual  Pursuit  ® 


In  a  general  three  dimensional  visual  pursuit  system, 
we  find  an  agent,  whose  motion  is  under  the  control  of 
our  system;  a  camera,  which  is  used  to  generate  useful 
visual  information  to  control  the  agent;  and  an  object, 
which  may  be  moving.  If  we  could  find  the  three  dimen¬ 
sional  positions  and  motion  parameters  [19]  [25]  cf  the 
camera,  the  object,  and  the  agent,  it  would  be  a  simple 
arithmetic  problem  to  predict  and  guide  the  collision  of 
the  agent  with  the  object.  However,  we  shall  show  here 
that  it  is  not  necessary  to  recover  these  parameters. 

When  we  try  to  solve  the  visual  pursuit  problem 
through  3D  recovery,  we  estimate  much  more  than  we 
actually  need  in  order  to  perform  this  generic  visual  task. 
Taking  a  purposive  viewpoint  we  develop  a  robust,  qual¬ 
itative  solution  to  the  problem  that  does  not  require  cor¬ 
respondence  or  full  3D  recovery. 

There  are  two  general  cases  of  the  pursuit  problem. 
The  camera  can  be  mounted  separately  from  the  agent 
and  the  object,  or  the  camera  can  be  mounted  on  the 
agent  or  the  object.  In  a  situation  where  a  human  agent 
pursues  a  flying  ball,  both  of  these  problems  are  involved. 
The  “camera”  is  mounted  on  the  agent  (the  human’s 
body)  which  is  intended  to  collide  with  the  object.  When 
the  human  is  sufficiently  close  to  the  ball,  the  “camera”, 
which  is  mounted  on  the  head,  is  independent  of  the 
agent  (the  hand,  possibly  carrying  a  tool  such  as  a  bat), 
and  the  hand  is  to  collide  with  the  ball.  Thus  the  solu¬ 
tion  of  both  problems  would  provide  a  theoretical  basis 
for  an  integrated  mobile  “hunting”  system,  or  for  a  base¬ 
ball  player! 

From  a  mathematical  viewpoint  it  is  equivalent 
whether  the  camera  is  mounted  on  the  agent  or  the  cam¬ 
era  is  mounted  on  the  object.  In  such  systems  the  col¬ 
lision  is  solely  determined  by  the  relative  motion  of  the 
object  and  the  agent,  and  it  is  equivalent  whether  we 
are  controlling  the  motion  of  the  entity  that  the  camera 

*Thi8  section  demonstrates  that  depth  recovery  is  not  nec¬ 
essary  for  motion  coordination  problems. 


is  mounted  on,  or  the  entity  that  is  moving  separately 
from  the  camera.  However,  they  may  have  different  ap¬ 
plications.  An  example  of  the  case  when  the  camera 
is  mounted  on  the  agent  is  an  airplane  that  is  attack¬ 
ing  a  target.  An  example  of  the  case  when  the  camera 
is  mounted  on  the  object  is  a  camera  that  is  guiding  a 
plane  to  land  near  the  camera. 

Let  us  Eissume  a  Cartesian  coordinate  system  with  its 
origin  at  the  focus  of  the  camera,  with  the  z-axis  pointing 
towards  the  general  direction  of  the  agent  and  the  object, 
such  that  both  the  object  and  the  agent  are  in  the  full 
view  of  the  camera. 

Assume  that  the  agent  is  located  at  (A'j ,  Y, ,  Z,)^  with 
a  velocity  of  (Yx,,  Vy,,  Yj,)^,  and  the  object  is  located  at 
(Xo,Yo,Zo)^  with  a  velocity  of  (Yxo,  Y^o,  Y^o)^.  Tf  the 
agent  or  the  object  is  also  rotating  at  the  time,  we  can 
choose  the  rotation  axis  to  go  through  visible  points  on 
the  surface  of  the  agent  or  the  object,  chosen  such  that 
the  rotation  parameters  are  irrelevant  in  the  prediction 
and  guidance  of  a  collision.  However,  for  simplicity  we 
assume  that  the  motion  is  instantaneously  translational. 
In  the  general  case  the  analysis  remains  es.sentially  the 
same,  but  the  formulae  become  more  complicated. 

The  agent  and  the  object  will  collide  after  time  t  pro¬ 
vided  that 


X,  -  X.  ^  y,  -  Yo  ^  z,  -  Zo 
Yx.  -  Yx.  Vye  -  Vy,  Veo  -  v;,  ^ 


(5) 


If  the  projection  of  the  agent  (i.e.  a  point  of  it)  on  the 
image  plane  is  {x,,y,),  and  the  projection  of  (he  object 
is  (io,l/o)i  assuming  unit  focal  length  and  perspective 


projection,  we  have 

a;.  = 

Xa 

Za 

(6) 

y.  = 

Ya 

Za 

(7) 

Xg  = 

Xg 

Zg 

(8) 

Vo  = 

Yg 

Zg 

(9) 

II 

■» 

Yx,  Yx, 

X, 

(10) 

II 

•j 

Yy,  Y„ 

Z, 

(11) 

Vxo  — 

Yx.  Yx„ 

X.  X. 

(12) 

II 

o 

V  1' 

Vyo  » 

(13) 

where  iVra,Vy,),  {Vrg,Vyg 

,)  is  the  flow 

produced  by  the 

agent  and  the  object  at  points  and  (i’o.jto),  re¬ 

spectively.  Combining  (6-9)  with  (5),  we  obtain  the  fol¬ 
lowing  relation  for  the  prediction  of  collision; 


t 


XgZg  X  qZ 0 

Yx.  -  V^x, 
ys^a  yo^o 


Ys,o  -  Yy, 
^a  ~  Zg 

Vao  -  Y., 


(14) 

(15) 

(16) 
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>  0 


(17) 

We  call  (14-17)  the  Visual  Consirainis  of  Collision. 
The  visual  pursuit  problem  is  solved  if  we  can  guide  the 
system  to  satisfy  these  constraints.  Using  the  processes 
of  Sections  7  and  8  we  can  estimate  the  locomotive  in- 
trinsics  (i.e.  the  direction  of  translation  and  the  time  of 
collision).  In  what  follows,  we  solve  the  visual  pursuit 
problem  in  the  case  when  the  camera  is  mounted  on  the 
object,  using  only  the  signs  of  the  three  locomotive  in- 
trinsics,  and  then  we  present  a  solution  in  the  case  when 
the  camera  is  mounted  separately  to  supervise  the  agent, 
using  the  locomotive  intrinsics,  relative  depth,  and  the 
direction  of  motion. 

9.1  Camera  mounted  on  the  object 

The  problem  is  equivalent  whether  the  camera  is 
mounted  on  the  agent  or  on  the  object.  For  simplic¬ 
ity  here  we  2issume  that  the  camera  is  mounted  on  the 
object  and  that  we  can  control  the  velocity  of  the  agent. 
We  choose  a  Cartesian  coordinate  system  with  its  origin 
at  the  focus  of  the  camera,  and  with  its  z-axis  pointing 
towards  the  general  direction  of  the  agent,  such  that  the 
agent  is  in  the  full  view  of  the  camera. 

As  the  camera  is  mounted  on  the  object,  the  co¬ 
ordinates  of  the  object  on  the  image  plane  are  zero, 
as  is  its  velocity.  We  have  (Xo,Ya,  Zo)^  =  0  and 
(Yxo,Vyo,Vto)^  =  0,  as  well  as  (xo,yo)  =  0  and 
(vxo,t;y<,)  =  0.  In  the  following,  when  we  write  Zg  and 
Z,,  we  always  mean  E(Zo)  and  E(Z,)  (i.e.  the  aver¬ 
age  depth  around  the  neighborhood),  unless  otherwise 
specified.  Thus  (14-17)  can  be  simplified  to 


t  = 

x,Z, 

~  V4. 

(18) 

- 

y,z, 

Vy. 

(19) 

z. 

(20) 

— 

~v^ 

> 

0 

(21) 

From  (18)  and  (20)  we 

have  X,  =  V4,/\4,. 

From 

(19)  and  (20)  we  have  y,  =  Vy,/Vt,.  Thus  if  we  draw 
a  line  from  the  origin  through  the  focus  of  expansion 
iYxt/Vzt,  Vy»IVi,),  or  the  first  two  locomotive  intrinsics, 
on  the  image  plane,  we  have  a  set  of  all  the  points  that 
will  collide  with  the  origin.  In  order  to  collide  the  agent 
with  the  object,  we  should  control  the  motion  of  the 
agent  so  that  the  focus  of  expansion  lies  inside  the  im¬ 
age  of  the  agent.  The  third  locomotive  intrinsic 
is  the  negative  of  the  time  to  collision  (see  (21)).  Note 
that  since  1  >  0,  the  third  locomotive  intrinsic  should  be 
negative  for  the  collision  to  occur.  In  this  case,  Vz,  <  0, 
that  is  the  agent  should  be  coming  closer  to  the  camera. 

Since  we  have  an  active  camera,  for  simplicity  we  can 
rotate  the  camera  such  that  the  image  of  the  agent  will 
be  in  the  center.  The  agent  will  collide  with  the  object 
if  we  can  keep  the  focus  of  expansion  at  the  origin  and 
keep  an  expanding  pattern  of  normal  flows. 

If  the  focus  of  expansion  is  not  at  the  origin,  we  can 
devise  a  control  strategy  to  guide  the  focus  of  expan¬ 
sion  towards  the  origin  of  the  image  plane  according  to 


the  signs  of  the  three  locomotive  intrinsics,  indicating 
whether  the  velocity  of  the  agent  needs  to  be  increased 
or  decreased  at  any  time  instant. 

•  If  Z/Vi  <  0  and  =  0,  a  collision 

will  occur. 

•  If  ZjVi  =  0,  a  collision  has  occurred; 

•  If  ZjVz,  >  0,  the  agent  is  going  away.  Decrease  V., 
and 

-  If  VxzIVzz  =  0,  do  not  change 

-  If  YxsIVz,  >  0,  decrease  Vx,; 

-  If  Vx,/Vz,  <  0,  increase 

-  If  Vy,fVz,  =  0,  do  not  change  Vy,\ 

-  Wyt/Vz,  >  0,  decrease  Uy,; 

-  If  VyJVzz  <  0,  increase  Vy,\ 

•  If  ZjVz,  <  0,  the  agent  is  coming  closer.  Do  not 
change  14,  and 

“  If  YxtlYtt  =  0,  do  not  change  Vx,\ 

-  If  >  0,  increase  14,; 

-  If  VxzIVz,  <  0,  decrease  14,; 

-  If  14,714,  =  0,  do  not  change  I4,; 

-  If  Vy,/Vz,  >  0,  increase  14>; 

-  If  Vy,/Vz,  <  0,  decrease  Vy,. 

This  constitutes  a  qualitative  paradigm  for  colliding 
the  agent  with  the  object  when  the  camera  is  mounted  on 
the  object.  We  only  use  the  sign  of  the  three  locomotive 
intrinsics.  We  can  predict  the  collision,  and  if  a  collision 
will  not  occur  we  qualitatively  control  the  velocity  of 
the  agent  towards  a  state  such  that  the  agent  will  collide 
with  the  object. 

9.2  Camera  mounted  separately 

When  the  camera  is  mounted  separately,  the  camera  may 
be  stationary  or  in  motion  relative  to  the  world  coordi¬ 
nate  system.  But  for  simplicity,  we  choose  a  coordinate 
system  with  its  origin  at  the  focus  of  the  camera  and  its 
z-axis  pointing  towards  the  general  direction  of  the  agent 
and  the  object  such  that  both  the  agent  and  the  object 
are  in  full  view  of  the  camera.  In  this  coordinate  sys¬ 
tem,  the  camera  is  stationary,  and  velocity  is  measured 
relative  to  the  camera. 

9.2.1  Object  coming  towards  the  camera 

The  special  case  when  the  object  is  coming  towards 
the  camera  may  need  to  be  handled  differently.  If  we 
can  correctly  identify  cases  when  the  object  is  coming 
towards  the  camera,  we  may  want  to  move  the  camera 
away  from  the  pathway  of  the  object  and  then  proceed  as 
usual,  or  when  the  object  is  small  and  is  not  destructive, 
we  may  just  put  the  agent  in  front  of  the  camera  to 
receive  the  object. 

We  can  detect  whether  the  object  is  coming  towards 
the  camera  using  the  analysis  in  the  previous  section. 
When  the  object  is  coming  towards  the  camera,  the  focus 
of  expansion  of  the  object  lies  inside  the  image  of  the 
object  and  the  third  locomotive  intrinsic  is  negative. 

If  we  send  the  agent  to  the  front  of  the  camera,  we  have 
the  case  studied  in  the  last  section.  It  can  be  determined 
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from  the  time  to  collision  of  the  agent  and  the  object 
whether  the  agent  is  moving  fast  enough  to  intercept 
the  object. 

In  the  following  general  analysis,  we  assume  that  the 
object  is  not  coming  towards  the  camera,  but  moving  in 
any  other  direction. 


9.2.2  General  case  of  camera  mounted 
separately 


From  (14-15),  we  obtain 
((aJjV^o  ~  y*lxo)  ~  ~  yiVxs))^a~ 

((XoVyo  VoVxo)  ~  ~  yoVxt))Zo  —  0 


(22) 


Note  that  if  (x,,y,),  (V„/V„,  Fy,/V;,),  (i<,,yo),  and 
(Vxo/Vxo,  VyolVio)  are  a  group  of  parallel  vectors,  (22) 
will  be  satisfied.  Thus  in  the  general  case  of  a  separately 
mounted  camera,  we  first  obtain  the  direction  of  motion 
of  the  object;  then  we  rotate  the  2-axis  of  the  camera 
such  that  it  will  be  in  the  direction  of  the  object.  Then 
we  can  move  the  agent  in  the  direction  parallel  to  the 
direction  of  motion  of  the  object.  This  is  a  group  of 
sufficient  conditions  for  the  collision  of  the  agent  and 
the  object  when  we  have  good  control  of  the  original 
position  of  the  agent. 

To  satisfy  (16-17),  we  need  to  find  the  time  to  collision 
t  and  make  it  equal  to  (16).  Combining  (10-13)  with 
(14-17),  we  obtain 


t  = 


t  ^ 


(v»o  +  ®o"^)f^  ~  +  *»^) 

_ y«  ~  yp  ^ _ 

(vyo  +  yo^)f^  -  (vy,  4-  y»^) 

1_  Za 

^  Z. 

nmzznn: 

Zc  Zf  z$ 


>  0 


(23) 

(24) 

(25) 

(26) 


If  we  can  find  a  point  on  the  agent  and  a  point  on  the 
object  which  have  the  same  normal  direction  (nx.Ry), 
from  (23-24)  we  find  the  time  to  collision  as  follows: 


1.  Rotating  the  camera.  Through  the  point  (xo,y,) 
draw  a  line  in  the  image  plane  with  direction 
{yxolV,,,VyolV,o). 

•  If  the  line  goes  through  the  origin,  proceed  to 
the  next  phase; 

•  If  the  origin  is  on  the  lower  left  portion  of  the 
image  plane,  rotate  the  camera  up  and  to  the 
right; 

•  If  the  origin  is  on  the  upper  right  portion  of  the 
image  plane,  rotate  the  camera  downward  and 
to  the  left; 

2.  Position  the  agent. 

•  If  the  line  drawn  above  goes  through  the  agent, 
proceed  to  the  next  phase; 

•  If  the  agent  is  on  the  lower  left  portion  of  the 
image  plane,  move  the  agent  up  and  to  the 
right; 

•  If  the  agent  is  on  the  upper  right  portion  of  the 
image  plane,  move  the  agent  downward  and  to 
the  left; 

3.  Move  the  agent  parallel  to  the  image  plane.  Change 
the  velocity  of  the  agent,  such  that  Vx,IVy,  =  x,ly,. 

•  If  the  agent  is  on  the  left  of  the  object,  14,  and 
14,  should  be  increased; 

•  If  the  agent  is  on  the  right  of  the  object,  14, 
and  Vya  should  be  decreased; 

•  If  the  agent  and  the  object  collide  on  the  image 
plane,  do  not  change  V4,  and  Vys< 

4.  Positive  time  to  collision.  Proceed  to  the  next  phase 
after: 

•  If  Zo/Z,  =  1,  do  not  change  !(,  at  present; 

•  If  Zo/Z,  >  1,  adjust  14,  such  that 

Z.  Z,  Z  o' 

•  If  Zo/Z,  <  1,  adjust  14,  such  that 


i  = 

_ n.».+n,V,-(n.x.+n,y.)|^ _  (27) 

Similar  equations  can  be  obtained  if  we  can  find  two 
normal  directions  from  the  agent  and  the  object  which 
are  perpendicular.  Combining  with  (25),  we  have 


Thus,  control  of  the  agent  is  achieved  by  varying 
Vxt/Vy,  in  order  to  satisfy  (22)  and  14,  in  order  to  sat¬ 
isfy  (26)  and  (28).  According  to  these  equations,  we  can 
devise  a  system  for  qualitative  control  of  the  motion  of 
the  agent,  so  that  the  agent  will  collide  with  the  object. 
This  scheme  can  be  accomplished  through  six  sequential 
phases  as  follows.  The  first  three  phases  are  devis'd  to 
satisfy  (22).  The  next  two  phases  are  devised  to  s<.ii-..rv 
(26)  and  (28).  The  last  phase  tests  to  see  if  ti.<*  '  \ 

will  collide  with  the  object  without  further  control  ot'  the 
agent. 


5.  Move  the  agent  perpendicular  to  the  image  plane. 

•  If  (28)  is  satisfied,  proceed  to  the  next  phase. 

•  If  14,/^,  is  larger  in  (28),  decrease  14,; 

•  If  Vta/Z,  is  smaller  in  (28),  increase  14,; 

6.  Predicting  collision.  The  .agent  will  collide  with  the 
object  if  the  following  conditions  are  met,  or  other¬ 
wise  repeat  from  phase  1: 


yo  ^  y,  ^ 

•  so  s  t 


and 


z„ 


)>0 


Vz.  ^V:o  (It-  l)(t>no  If 

Z,  Zo  nr(x,  -  Xo)  +  Tiyiy,  -  y„) 
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In  summary,  in  the  case  when  the  camera  is  mounted 
on  the  the  object  or  the  agent,  we  have  devised  a  qual¬ 
itative  strategy  to  predict  and  guide  the  collision  of  the 
agent  and  the  object.  We  only  used  the  signs  of  the  three 
locomotive  intrinsics  (FOE  and  time  to  contact)  to  qual¬ 
itatively  control  the  velocity  of  the  agent  such  that  the 
visual  constraints  of  collision  will  be  satisfied. 

In  the  case  when  the  camera  is  mounted  separately  to 
supervise  the  agent  to  collide  with  the  object,  we  have 
devised  a  set  of  sufficient  conditions  to  satisfy  the  visual 
constraints  of  collision.  This  set  of  sufficient  conditions 
can  be  reached  by  a  qualitative  scheme  of  control  without 
any  exact  3D  depth  or  velocity  information  of  the  agent 
and  the  object. 

Our  method  can  also  be  used  to  control  the  collision 
even  when  the  object  is  rotating  in  addition  to  having 
instantaneous  translational  motion. 

10  Recapitulation  and  Experiments 

We  have  presented  solutions  to  several  problems  related 
to  visual  motion  using  normal  flow  as  the  input.  Al¬ 
though  we  have  not  solved  the  general  structure  from 
motion  (sfm)  problem  using  normal  flow,  we  have  pre¬ 
sented  solutions  to  various  important  problems  that  are 
simple  applications  of  the  sfm  module.  The  robustness  of 
the  proposed  algorithms  relies  heavily  on  the  robustness 
of  the  computation  of  normal  flow,  i.e.  spatiotemporal 
derivatives  of  the  image  intensity  function.  But  even 
without  using  any  elaborate  schemes  for  computing  the 
normal  flow  (after  all,  some  of  the  techniques  presented 
only  require  its  sign)  we  have  performed  several  experi¬ 
ments.  We  report  here  a  few  of  them: 

(a)  Egomotion  estimation 

We  have  performed  several  experiments  with  both  syn¬ 
thetic  and  real  image  sequences  in  order  to  demonstrate 
the  stability  of  our  method.  From  experiments  on  real 
images  it  was  found  that  in  the  case  of  pure  transla¬ 
tion  or  pure  rotation  the  method  computes  the  focus  of 
expansion  or  the  axis  of  rotation  very  robustly.  In  the 
case  of  general  motion  it  was  found  from  experiments 
on  synthetic  data  that  the  behavior  of  the  method  is  as 
predicted  by  our  theoretical  analysis  (see  [16]). 

Figure  14  shows  one  of  the  images  from  a  dense  se¬ 
quence  collected  in  our  laboratory  using  an  Merlin  Amer¬ 
ican  Robot  arm  that  translated  while  acquiring  images 
with  the  camera  it  carried  (a  Sony  miniature  TV  cam¬ 
era).  Figure  15  shows  the  last  frame  in  the  sequence 
and  Figure  16  shows  the  first  frame  with  the  solution 
area  (where  the  FOE  lies),  which  agrees  with  the  ground 
truth.  Figures  17  and  18  show  the  first  and  last  frames  in 
a  sequence  of  images  collected  through  a  rotation  of  the 
sensor  and  provided  by  the  University  of  Massachusetts 
for  the  Workshop.  Figure  19  shows  the  first  frame  of  the 
sequence  with  the  solution  area  for  the  AOR. 

(b)  Detection  of  independent  motion 

Figure  20  shows  the  experimental  setting  for  testing 
the  algorithm  for  motion  detection  from  a  translating, 
active  camera.  The  CCD  camera  is  mounted  on  a  slide 
to  simulate  pure  translation,  and  can  be  given  small  ro¬ 
tations  around  a  revolving  platform  to  simulate  the  ex¬ 
ploratory  activity.  The  model  board  simulates  an  out- 


Figure  14; 


Figure  15: 


Figure  16; 


Figure  17: 
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Figure  18: 


Figure  19; 


door  scene.  The  image  sequence  is  captured  by  Data 
Translation  QuickCapture  on  a  Macintosh  Ilci.  Figures 
21(a)-21(d)  show  the  results  of  the  experiment:  (a)  is  a 
sequence  of  closely  sampled  images  taken  from  a  moving 
and  active  camera;  (b)  shows  the  output  of  the  motion 
detection  algorithm  without  any  exploratory  activity;  (c) 
shows  the  output  after  four  shifts  of  the  FOE  as  part  of 
a  simple  exploratory  activity;  and  (d)  shows  the  motion 
detection  output  (dark)  overlaid  on  the  image  (light). 

(c)  Relative  Depth 

We  have  performed  several  experiments  on  both  syn¬ 
thetic  and  real  data  in  order  to  test  the  feasii  ility  and 
stability  of  our  approach.  We  report  here  some  experi¬ 
ments  on  real  data.  The  setup  for  our  experimental  work 
with  real  images  consists  of  a  CCD  camera  mounted  on  a 
slide  so  that  it  can  purely  translate  along  its  optical  axis. 
The  camera  is  viewing  a  scene  consisting  of  a  toy  (“Mrs. 
Potatohead”)  and  a  toy  robot  arm  (Radio  Shack).  The 
arm  (carrying  the  “vision”  of  Mrs.  Potatohead)  is  ini¬ 
tially  placed  closer  to  the  camera.  Figures  22  and  23  are 
taken  with  the  camera  stationary  and  the  arm  moving 
toward  Mrs.  Potatohead.  Figure  24  shows  the  nor¬ 
mal  flow  produced  from  the  motion  of  the  arm,  using 
the  straightforward  gradient  technique  [19].  Figure  25  is 
taken  after  the  camera  has  moved  forward  and  Figure  26 
shows  the  normal  flow  produced. 

Using  the  algorithms  in  Sections  5  and  7,  we  estimated 
the  relative  depth  of  the  toy  and  the  arm.  We  computed 
the  quantity  Z/Uc,  where  Z  is  the  depth  of  a  point  and  14 
is  the  speed  of  the  camera,  and  considered  the  median 
value  for  the  arm  and  the  toy.  It  was  found  that  this 
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value  was  7.553544  for  the  arm  and  9.118339  for  the  toy, 
which  agrees  with  the  ground  truth. 

We  performed  the  same  experiment  with  the  arm  at 
the  same  distance  as  the  toy.  We  found  that  the  value  of 
the  median  relative  depth  (^/Vc)  was  10.230856  for  the 
arm  and  10.145772  for  the  toy,  which  again  agrees  with 
the  ground  truth. 

11  Purposive,  Behavioral,  Active  Vision 

Vision  has  been  studied,  for  the  most  part,  as  a  general 
recovery  problem,  i.e.  its  goal  has  been  to  reconstruct  an 
accurate  representation  of  the  visible  world  and  its  prop¬ 
erties,  for  example,  to  recover  boundaries,  shape  from 
texture,  shading,  motion,  etc.  Following  this  point  of 
view,  we  consider  the  “brain” — or  any  intelligent  sys¬ 
tem  possessing  vision — as  consisting  of  vision  and  every¬ 
thing  else  (planning,  reasoning,  memory,  etc.).  In  other 
words,  we  view  the  role  of  vision  as  that  of  creating  a 
central  database  which  stores  accurate  3-D  information 
about  the  scene.  Then  other  cognitive  processes  (such 
as  planning,  for  example)  can  access  this  database,  ex- 
tr2ict  whatever  information  they  need  and  modify  it  to 
suit  their  needs.  This  central  database  is  created  by  vi¬ 
sual  modules — such  as  the  sfm  module — that  have  been 
integrated  in  some  way  [3]. 

But  if  the  analysis  in  this  paper  is  valid,  it  demon¬ 
strates  that  we  can  solve  many  interesting  problems, 
without  creating  a  very  accurate  or  full  representation 
of  the  scene  and  its  properties.  Clearly,  when  a  problem 
is  simpler  and  more  restricted,  it  is  easier  to  solve.  How¬ 
ever,  these  simpler  problems  (in  our  case,  simpler  than 
the  general  sfm  problem) — namely,  passive  navigation, 
motion  detection,  3-D  translation  estimation,  obstacle 
avoidance,  relative  depth,  visual  interception — are  quite 
important  and  not  very  specific.  They  are  generic  in  the 
sense  that  they  have  environmental  invariance.  In  other 
words,  developing  such  visual  motion  capabilities  consti¬ 
tutes  theoretical  research.  The  fact  that  we  may  be  able 
to  robustly  solve  many  less  general  problems — which, 
of  course,  cannot  replace  the  reconstructive  modules- 
demonstrates  that  we  are  capable  of  building  machines 
that  robustly  achieve  various  behaviors.  By  putting  such 
behaviors  together,  can  we  achieve  “intelligent  systems”? 
If  this  is  possible,  it  provides  an  alternative  way  to  study 
perception.  A  few  publications  over  the  past  few  years 
[6,  8,  11,  12,  15,  26,  35]  have  supported  such  an  ap¬ 
proach,  which  has  acquired  various  names  such  as  purpo¬ 
sive,  task-based,  behavioral,  active,  animate,  utilitarian, 
etc.  In  this  section  we  attempt  to  describe  the  paradigm 
in  more  detail  and  we  point  out  its  drawbacks  as  well  as 
its  potential  usefulness. 

11.1  An  attempt  to  formalize 

With  the  realization  that  behavioral  vision  has  as  its 
goal  the  development  of  robust,  non-primitive  behaviors 
displayed  by  a  robotic  agent,  we  should  be  able  to  for¬ 
malize  the  concept  of  behavior  and  the  concept  of  an 
agent.  At  the  same  time  we  need  to  be  able  to  provide  a 
formal  way  of  generating  new  behaviors  and  a  calculus 
of  behaviors  or  purposes. 


If  there  is  a  similarity  of  this  approach  to  old  ideas  of 
goal-b^^sed  vision — where  systems  using  knowledge  at  all 
levels,  including  domain-specific  knowledge,  were  built 
and  it  turned  out  that  many  corners  had  to  be  cut  and 
many  oversimplified  assumptions  had  to  be  made — it  ex¬ 
ists  only  in  spirit.  An  inlelligent  ageni  (observer)  is  a 
system  that  has  a  set  of  goals  or  purposes,  at  all  times. 
To  pursue  these  goals,  it  has  to  erhibii  a  set  of  behav¬ 
iors.  Not  all  agents  have  the  same  purposes;  some  are 
more  sophisticated  than  others  and  they  display  different 
behaviors. 

It  would  be  hard  to  give  a  general  definition  for  an 
agent  (or  such  a  definition  would  be  so  general  that  it 
wouldn’t  be  useful  at  an  engineering  level).  We  are  sur¬ 
rounded  by  agents.  They  are  basically  entities  that  in¬ 
teract  with  the  world  around  them  and  act  appropriately 
in  each  situation.  As  they  act  and  sense,  they  display 
behaviors  and  fulfill  purposes. 

Coming  back  to  the  basic  question  of  formalizing  be¬ 
haviors,  we  realize  that  there  is  a  very  rich  set  of  them. 
Some  are  primitive,  others  more  sophisticated  and  oth¬ 
ers  quite  complex.  In  such  situations,  it  is  nice  to  be 
able  to  start  from  primitives,  that  is  a  set  of  behaviors 
from  which  all  others  can  be  constructed.  But  it  is  not 
at  all  clear  which  behaviors  are  the  primitive  ones. 

To  avoid  a  potential  philosophical  snare,  we  sidestep 
the  question  and  we  ask:  how  can  we  formalize  behav¬ 
iors,  and  then  generate  new  and  more  complex  ones  from 
old  ones  and  from  learning? 

A  behavior  is  a  sequence  of  perceptual  events  and  ac¬ 
tions  whose  task  is  to  accomplish  a  goal.  Visual  input  is 
received  in  a  continuous  manner  and  various  processes 
(such  as  those  described  here  and  others)  work  together 
in  order  to  recognize  perceptual  events  and  take  appro¬ 
priate  action  (an  action  could  be  a  motion  (navigation, 
manipulation),  or  a  change  of  an  internal  state  of  the 
agent  displaying  the  behavior).  The  problem  is  then  to 
control  such  a  system.  It  must  be  emphasized  that  the 
processes  performing  the  visual  analysis  in  order  to  rec¬ 
ognize  the  perceptual  events  perform  only  partial  recov¬ 
ery  of  the  world,  i.e.  to  accomplish  some  behaviors  we 
do  not  need  an  accurate  and  full  scene  representation. 

In  abstract  terms,  a  behavior  of  an  agent  is  a  sys¬ 
tem  broadly  known  as  discrete  event  process  [7].  How¬ 
ever,  despite  numerous  results  in  the  literature,  there  is 
at  the  present  time  apparently  no  unifying  theory  for 
the  control  of  discrete  event  processes.  Nor  is  it  very 
clear  what  such  a  theory  should  accomplish.  Numerous 
approaches  to  the  modeling  of  discrete  proces.ses  have 
appeared  in  the  literature  (Boolean  models,  Petri  nets, 
formal  languages,  temporal  logic,  port  automata,  and 
flow  networks). 

An  interesting  model  proposed  recently  [.37]  treats  the 
controlled  set  of  processes  as  the  generator  of  a  for¬ 
mal  language  (an  automaton  taking  various  actions)  and 
studies  how  the  recognizer  of  a  specific  (target)  lan¬ 
guage  (another  machine  recognizing  perceptual  events) 
may  be  employed  as  a  controller,  incorporating  the  de¬ 
sired  closed-loop  sy.stem  behavior,  and  it  is  .shown  how  to 
construct  such  a  controller  under  some  assumptions.  It 
is  also  shown  how,  given  two  such  controllable  systems 
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(let’s  say  behaviors  Bi  and  B2),  to  create  the  shuffle 
operation  Bi||fl2>  so  that  we  can  create  more  complex 
behaviors  from  existing  ones.  However,  the  main  conclu¬ 
sion  of  such  control  theoretic  work  may  be  paraphrased 
by  saying  that  “supervisors  must  be  modeled  on  the  task 
to  be  accomplished”.  In  other  words,  there  does  not  ap¬ 
pear  to  be  a  general  universal  way  to  accomplish  behav¬ 
iors  (control  them)  or  make  new  ones  from  old  ones;  it 
appears  that  the  problem  depends  on  what  has  to  be 
accomplished. 

11.2  Object  recognition 

Although  it  is  not  hard  to  see  how  to  study  navigation  in 
this  paradigm  of  behavioral  vision,  it  might  seem  hard 
to  apply  this  point  of  view  to  recognizing  objects.  What 
would  it  mean  to  have  behaviors  that  recognize  objects? 

This  difficulty  can  be  easily  avoided  by  attempting  to 
solve  an  easier  problem,  namely  that  of  recognizing  tl|p 
function  of  an  object®  (there  may  exist  many  functions 
for  a  single  object)  that  is  required  to  accomplish  the 
behavior  under  consideration  (an  agent  always  executes 
a  behavior).  So,  recognition  can  be  considered  in  the 
context  of  an  agent  performing  it  in  an  environment, 
while  executing  a  behavior. 

An  object  can  fulfill  a  function,  suit  a  purpose.  If  the 
agent  recognizes  this,  it  has  recognized  the  object.  In 
fact,  it  has  not  recognized  an  object  in  the  sense  that  it 
can  name  it  as  a  human  would,  but  it  has  recognized  it 
“well  enough”  to  act  on  it  (for  example,  use  it,  avoid  it, 
eat  it,  mate  with  it,  etc.).  But  in  most  cases,  deducing 
an  object’s  purpose  with  regard  to  the  current  behavior 
can  be  done  by  testing  the  existence  of  some  perceptual 
properties  of  the  image  of  the  object.  Usually,  to  find  out 
if  an  object  can  fulfill  a  function  we  need  to  perform  vari¬ 
ous  partial  recovery  tasks.  Thus,  without  reconstructing 
the  world  fully,  we  can  recognize  many  objects  to  the  de¬ 
gree  that  we  can  utilize  them  (examples:  big  and  moving 
closer  (danger),  man-made,  graspable,  movable,  of  cer¬ 
tain  size,  with  a  concavity  (cup),  etc.). 

Although  such  an  approach  does  not  address  all  as¬ 
pects  of  object  recognition,  it  seems  to  be  well  suited  to 
the  design  of  robots. 

12  Conclusions 

We  have  presented  the  foundations  behind  a  set  of  pro¬ 
cesses  that  interpret  visual  motion  in  a  purposive  man¬ 
ner.  We  showed  that  an  active  observer  can  solve  a  series 
of  important  problems  through  the  use  of  the  deriva¬ 
tives  of  the  image  intensity  function.  In  particular,  we 
presented  direct  solutions  for  the  problems  of  kinetic  sta¬ 
bilization  (passive  navigation),  detection  of  independent 
motion,  obstacle  avoidance,  relative  depth  and  3-D  mo¬ 
tion  (translation)  computation  and  visual  interception. 
Although  the  abovementioned  problems  are  applications 
of  the  general  structure  from  motion  problem,  we  ad¬ 
dressed  them  as  independent  problems  in  their  own  right 
and  produced  solutions  that  depend  on  data  which  can 
be  measured. 

®I.e.  not  recognizing  the  object  but  finding  out  enough 
information  about  it  to  utilize  it. 


The  possibility  that  important  behaviors  can  be  real¬ 
ized  by  the  cooperation  of  processes  that  recognize  per¬ 
ceptual  events  without  having  to  create  a  full  represen¬ 
tation  of  the  outside  world  suggests  that  vision  can  be 
studied  as  a  part  of  a  system  that  has  purposes  which 
translate  into  behaviors.  This  point  of  view  opens  several 
interesting  research  areas,  all  related  to  the  development 
of  intelligent  visual  behaviors.  We  have  pointed  out  var¬ 
ious  possible  formalizations  for  this  approach,  as  well  as 
the  associated  problems. 

Research  in  this  paradigm  will  become  more  interdis¬ 
ciplinary  with  time,  since  the  basic  premise  is  that  vision 
should  not  be  studied  in  isolation  but  as  a  part  of  an  in¬ 
telligent  system.  New  questions  about  control  arise,  and 
the  integration  of  vision  with  planning,  manipulation, 
memory  and  learning  will  provide  interesting  research 
avenues. 

Whether  this  behavioral  vision  paradigm  is  the  nat¬ 
ural  evolution  of  the  field  is  still  questionable.  This 
will  certainly  depend  on  the  results  that  are  generated. 
Behavioral  vision  addresses  a  normative  question  (what 
should  be),  i.e.  how  should  we  best  design  robots  for  a 
set  of  tasks.  Reconstructive  vision  addresses  a  theoreti¬ 
cal  question  (what  could  be),  i.e.  what  range  of  possible 
mechanisms  could  exist  in  vision  systems.  The  empirical 
question  (what  is),  i.e.  how  actual  biological  systems  are 
designed,  is  addressed  by  other  communities!  psychol¬ 
ogy,  neuroanatomy,  etc.),  while  the  normative  and  the¬ 
oretical  questions  are  studied  by  computer  vision.  And 
although  these  three  questions  do  not  neceasarily  have 
the  same  answers,  they  are  closely  related. 
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Abstract 

This  paper  presents  a  computer  algorithm  which, 
given  a  dense  temporal  sequence  of  intensity  imag¬ 
es  of  multiple  moving  objects,  will  separate  the  im¬ 
ages  into  regions  showing  distinct  objects,  and  for 
those  objects  which  are  rotating,  will  calculate  the 
three-dimensional  structure  and  motion.  The  meth¬ 
od  integrates  the  segmentation  of  trajectories  into 
subsets  corresponding  to  different  objects  with  the 
determination  of  the  motion  and  structure  of  the  ob¬ 
jects.  Trajectories  are  partitioned  into  groups  corre¬ 
sponding  to  the  different  objects  by  fitting  the 
trajectories  from  each  group  to  a  hierarchy  of  in¬ 
creasingly  complex  motion  models.  This  grouping 
algorithm  uses  an  efficient  motion  estimation  jdgo- 
rithm  based  on  the  factorization  of  a  measurement 
matrix  into  motion  and  structure  components.  Ex¬ 
periments  are  reported  using  two  real  image  se¬ 
quences  of  SO  frames  each  to  test  the  algorithm. 

1  Introduction 

This  paper  is  concerned  with  three-dimensional 
structure  and  motion  estimation  for  scenes  contain¬ 
ing  multiple  independently  moving  rigid  objects. 
Our  algorithm  uses  the  image  motion  to  separate  the 
multiple  objects  from  the  background  and  from 
each  other,  and  to  calculate  the  three-dimensional 
structure  and  motion  of  each  such  object.  The  two- 
dimensional  motion  in  the  image  sequence  is  repre¬ 
sented  by  the  image  plane  trajectories  of  feature 
points.  The  motion  of  each  object,  which  describes 
the  rotation  and  translation  of  the  object  between  the 
images  of  the  sequence,  is  computed  from  the  ob¬ 
ject’s  feature  trajectories.  If  the  object  on  which  a 
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particular  group  of  feature  points  lie  is  rotating,  the 
relative  three-dimensional  positions  of  the  feature 
points,  called  the  structure  of  the  object,  can  also  be 
calculated. 

Our  algorithm  is  based  on  the  following  assump¬ 
tions:  (1)  the  objects  in  the  scene  are  rigid,  i.e.,  the 
three-dimensional  distance  between  any  pair  of  fea¬ 
ture  points  on  a  particular  object  is  constant  over 
time,  (2)  the  feature  points  are  orthographically  pro¬ 
jected  onto  the  image  plane,  and  (3)  the  objects 
move  with  constant  rotation  per  frame.  This  algo¬ 
rithm  integrates  the  task  of  segmenting  the  images 
into  distinctly  moving  objects  with  the  task  of  esti¬ 
mating  the  motion  and  structure  for  each  object. 
These  tasks  arc  performed  using  a  hierarchy  of  in¬ 
creasingly  complex  motion  models,  and  using  an  ef- 
fleient  and  accurate  factorization-based  motion  and 
structure  estimation  algorithm. 

This  paper  makes  use  of  an  algorithm  for  factoriza¬ 
tion  of  a  measurement  matrix  into  separate  motion 
and  structure  matrices  as  reported  in  [1].  In  [2],  To- 
masi  and  Kanade  present  a  similar  factorization- 
based  method  which  allows  arbitrary  rotations,  but 
does  not  have  the  capability  to  process  trajectories 
starting  and  ending  at  arbitrary  frames.  Further¬ 
more,  it  appears  that  some  assumptions  about 
the  magnitude  or  smoothness  of  motion  are  still 
necessary  to  obtain  feature  trajectories.  Kanade 
points  out  [3]  that  with  our  assumption  of  con¬ 
stant  rotation  we  are  absorbing  the  trajectory 
noise  primarily  in  the  structure  parameters 
whereas  their  algorithm  absorbs  them  in  both 
the  motion  and  structure  parameters. 

Most  previous  motion-based  image  sequence  seg¬ 
mentation  algorithms  use  optical  flow  to  segment 
the  images  based  on  consistency  of  image  plane  mo¬ 
tion.  Adiv  in  [4]  and  Bergen  et  al  in  [5]  instead  seg- 
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ment  on  the  basis  of  a  fit  to  an  affine  model.  Adiv 
further  groups  the  resulting  regions  to  fit  a  model  of 
a  planar  surface  undergoing  3-D  motions  in  per¬ 
spective  projection.  In  [6]  Boult  and  Brown  show 
how  Tomasi  and  Kanade’s  motion  factorization 
method  can  be  used  to  split  the  measurement  matrix 
into  parts  consisting  of  independently  moving  rigid 
objects. 

2  Structure  and  Motion  Estimation 

In  this  section  we  describe  the  motion  and  structure 
parameters  which  are  used  in  Section  4,  and  we 
briefly  review  our  rotational  motion  and  structure 
estimation  algorithm,  the  details  of  which  are  pre¬ 
sented  in  [1],  [7],  and  [8].  The  input  to  the  structure 
and  motion  estimation  algorithm  is  a  set  of  trajecto¬ 
ries  of  orthographically  projected  feature  points  ly¬ 
ing  on  a  single  rigid  object  rotating  around  a  fixed- 
direction  axis  and  translating  along  an  arbitrary 
path.  If  these  constraints  do  not  hold  exactly  the  al¬ 
gorithm  will  produce  structure  and  motion  parame¬ 
ters  which  only  approximately  predict  the  input  tra¬ 
jectories.  Given  a  collection  of  trajectories  (possibly 
all  beginning  ai  d  ending  at  different  frames)  for 
which  the  constraints  do  hold,  our  algorithm  finds 
accurate  estimates  of  the  relative  three-dimensional 
positions  of  the  feature  points  at  the  start  of  the  se¬ 
quence  and  the  angular  and  translational  velocities 
of  the  object.  The  algorithm  also  produces  a  confi¬ 
dence  number,  in  the  form  of  an  error  between  the 
predicted  and  the  actual  feature  point  image  posi¬ 
tions.  Aside  from  an  SVDs,  the  algorithm  is  closed 
form  and  requires  no  iterative  optimization. 


positions  Op  (j)  change  over  time  due  only  to  the  ro¬ 
tation  of  the  object  and  can  be  expressed  in  tenns  of 

6p(0)  as  6p(j)  =  cfOpiO)  where  to  =  n  n  and 


G  = 


cosCD  -siiKD  0 
sinco  cosco  0 
0  0  1 


(1) 


Since  n  contains  only  two  degrees  of  freedom,  we 
replace  it  with  the  two  dimensional  foreshortening 

vector  F  defined  as  0  if  =  Jnl  +  =  o  or 
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>x 
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Otherwise,  where  the  sign  is  chosen  such  that 


F^>o.  Then  if 
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and  co(/)  =  [co,c/)  co^(/)]  ,  the  image  plane  co¬ 
ordinates  of  point  p  in  frame/ are  (see  [8]) 


Ppif)  =  HG 


Op/0) 


+  Co(/) 


(4) 


Equation  (4)  is  the  starting  point  for  the  next  section 
which  briefly  describes  our  algorithm  for  finding 
the  motion  and  structure  parameters. 


2.1  The  Motion  and  Structure  Models  2.2  The  Algorithm 


We  define  the  world  coordinate  system  such  that  its 
x-y  axes  are  the  image  plane  x-y  axes.  The  object  on 
which  the  feature  points  are  located  rotates  with  a 

constant  angular  velocity  of  n  and  an  arbitrary 
translation  described  by  the  sequence  of  locations 

Co  (/) .  We  define  an  object  centered  coordinate  sys¬ 
tem  whose  z  axis  is  parallel  or  anti-parallel  to  n 
(such  that  it  lies  in  the  positive  z  halfplane)  and 
which  translates  but  does  not  rotate  with  the  object. 
We  also  let  n  be  the  object  coordinate  system  vector 

T 

[o  0  U  expressed  in  world  coordinates.  Then,  let¬ 
ting  Op  (/)  be  the  position  of  point  p  in  object  cen¬ 
tered  coordinates  at  the  time  frame /is  recorded,  the 


Our  algorithm  uses  sinusoidal  retrieval  methods  [9] 
which  extract  multi-dimensional  sinusoidal  signals 
from  noisy  data.  In  our  problem,  the  multi-dimen¬ 
sional  sinusoidal  signal  is  the  rotational  component 
of  the  motion,  and  the  frequency  of  the  recovered  si¬ 
nusoid  is  (0,  while  the  amplitudes  and  phases  deter¬ 
mine  the  structure  and  orientation  of  the  points.  We 
start  by  taking  the  differences  in  point  positions 

yp,(/)  =  P,(/)  -Pp(/)  to  eliminate  co(/)  in  Equa¬ 
tion  (4).  The  resulting  purely  rotational  motion  can 
be  expressed  as  a  system  of  state  equations 

ipf  (/•+!)=  Gip,  (/)=C/*  'ip,  (0) 
ypf  (/)  =  Hxp,  (/) 

where 


(/)  = 


o„(/)-o,,(/) 

0,y{f)-0py{f) 


L(0„(n -0,.(/)) 


(6) 


y^,  (/)  are  arranged  in  a  data  matrix  D  which 
can  be  factored  using  a  SVD  into  two  matrices,  one 
depending  only  on  the  rotational  parameters,  and 
the  other  depending  only  on  the  structure  parame¬ 
ters.  The  construction  of  D  and  the  method  of  find¬ 
ing  the  structure  and  motion  from  the  factors  of  D  is 
detailed  in  [1],  [7],  and  [8]. 

The  algorithm  described  in  this  section  provides  an 
efficient  way  to  incorporate  information  from  many 
frames  and  points  into  the  calculation  of  the  rota¬ 
tional  motion  and  structure  parameters.  When  there 
is  no  rotational  motion,  we  use  a  simple  linear  least 
squares  fit  to  a  translational  motion  model  (de¬ 
scribed  further  in  [7]  and  [8])  to  determine  the  mo¬ 
tion.  The  following  section  describes  how  this  trans¬ 
lational  motion  estimation  algorithm  and  the  rota¬ 
tional  motion  estimation  algorithm  are  used  to 
segment  the  input  trajectories  into  groups  corre¬ 
sponding  to  the  different  rigid  objects  and  to  deter¬ 
mine  the  structure  and  motion  parameters  for  each 
object. 


3  Image  Sequence  Segmentation  and 
Motion  and  Structure  Estimation 

The  segmentation  of  the  feature  point  trajectories 
into  groups  corresponding  to  the  differently  moving 
3D  objects  and  the  estimation  of  the  structure  and 
motion  of  these  objects  are  highly  interrelated  pro¬ 
cesses:  if  the  correct  segmentation  is  not  known,  the 
motion  and  structure  of  each  object  cannot  be  accu¬ 
rately  computed,  and  if  the  3D  motion  of  each  ob¬ 
ject  is  not  accurately  known,  the  trajectories  cannot 
be  segmented  on  the  basis  of  their  3D  motion.  To 
circumvent  this  circular  dependency,  we  integrate 
the  segmentation  and  the  motion  and  structure  esti¬ 
mation  steps  into  a  single  step,  and  we  incremental¬ 
ly  improve  the  segmentation  and  the  motion  and 
structure  estimates  as  each  new  frame  is  received. 

The  general  segmentation  paradigm  is  split  and 
merge.  Each  group  of  trajectories  (or  region)  in  the 
segmentation  has  associated  with  it  one  of  three  re¬ 
gion  motion  models,  two  of  which  describe  rigid 
motion  (the  translational  and  rotational  motion 
models),  and  the  third  (unmodeled  motion)  which 
accounts  for  all  motions  which  do  not  fit  the  two  rig¬ 
id  motion  models  and  do  not  contain  any  local  mo¬ 


tion  discontinuities.  When  none  of  these  motion 
models  accurately  account  for  the  motion  in  the  re¬ 
gion,  the  region  is  split  using  a  region  growing  tech¬ 
nique.  When  splitting  a  region,  a  measure  of  motion 
consistency  is  computed  in  a  small  neighborhood 
around  each  trajectory  in  the  region.  If  the  motion  is 
consistent  for  a  particular  trajectory,  we  assume  that 
the  trajectories  in  the  neighborhood  all  arise  from 
points  on  a  single  object  Thus  the  initial  subregions 
for  the  split  consist  of  groups  of  trajectories  with  lo¬ 
cally  consistent  motion,  and  these  are  grown  out  to 
include  the  remaining  trajectories. 

Initially  all  the  trajectories  are  in  a  single  region. 
Processing  then  continues  in  a  uniform  fashion:  the 
new  point  positions  in  each  new  frame  are  added  to 
the  trajectories  of  the  existing  regions,  and  then  the 
regions  are  processed  to  make  them  compatible 
with  the  new  data.  The  processing  of  the  regions  is 
broken  into  four  steps:  (1)  if  the  new  data  does  not 
fit  the  old  region  motion  model,  find  a  model  which 
does  fit  the  data  or  split  the  region,  (2)  add  any  new¬ 
ly  visible  points  or  ungrouped  points  to  a  compati¬ 
ble  region,  (3)  merge  adjacent  regions  with  compat¬ 
ible  motions,  (4)  remove  outliers  from  the  regions. 

Compatibility  among  feature  points  is  checked  us¬ 
ing  the  structure  and  rotational  motion  estimation 
algorithm  described  in  Section  2  or  the  translational 
motion  algorithm  described  in  [7].  A  region’s  fea¬ 
ture  points  are  considered  incompatible  if  the  lit  er¬ 
ror  returned  by  the  appropriate  motion  estimation 
algorithm  is  above  a  threshold.  We  assume  that  the 
trajectory  detection  algorithm  can  produce  trajecto¬ 
ries  accurate  to  the  nearest  pixel,  and  therefore  we 
use  a  threshold  (which  we  call  the  error  threshold) 
of  one  half  of  a  pixel  per  visible  trajectory  point  per 
frame.The  details  of  the  four  steps  listed  above  may 
be  found  in  [7]  or  [10]. 

4  Experiments 

Our  algorithm  was  tested  on  two  real  image  se¬ 
quences  of  50  frames:  (1)  the  cylinder  sequence, 
consisting  of  images  of  a  cylinder  rotating  around  a 
nearly  vertical  axis  and  a  box  moving  right  with  re¬ 
spect  to  the  cylinder  and  the  background,  and  (2)  the 
robot  arm  sequence,  consisting  of  images  of  an  Un¬ 
imate®  PUMA®  Mark  III  robot  ann  with  its  second 
and  third  joints  rotating  in  opposite  directions. 
These  sequences  show  the  capabilities  of  the  ap¬ 
proach,  and  also  demonstrate  some  inherent  limita¬ 
tions  of  motion  based  segmentation  and  of  monocu¬ 
lar  image  sequence  based  motion  estimation. 
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Figure  1  The  image  sequence  segmentation  found 
for  the  cylinder  sequence  (the  segmentation  is 
superimposed  on  the  last  frame  of  the  sequence). 


Table  1  Comparison  of  the  parameters  estimated  by 
the  algorithm  and  the  true  parameter  for  the  cylinder 
image  sequence  experiment. _ 


Parameters 

Estimated 

Actual 

w 

-0.022 

-0.017 

F 

(0,0.94) 

(0,0.90) 

A 

V 

(0.29,-0.19) 

(0.14,0) 

Trajectories  were  detected  using  the  algorithm  de¬ 
scribed  in  [7]  (using  a  method  described  in  [11]), 
which  found  2598  trajectories  in  the  cylinder  se¬ 
quence  and  202  trajectories  in  the  robot  arm  se¬ 
quence.  These  trajectories  were  input  to  the  image 
sequence  segmentation  algorithm  described  in  Sec¬ 
tion  3,  which  partitioned  the  trajectories  into  groups 
corresponding  to  different  rigid  objects  and  estimat¬ 
ed  the  motion  and  structure  parameters. 

The  segmentation  for  the  cylinder  sequence  is 
shown  in  Figure  1.  The  algorithm  separated  out  the 
three  image  regions:  the  cylinder,  the  box,  and  the 
background.  The  cylinder  is  rotating,  and  thus  its 
structure  can  be  recovered  from  the  image  se¬ 
quence.  Figure  1  shows  a  projection  along  the  cylin¬ 
der  axis  of  the  3D  point  positions  calculated  from 
the  1456  points  on  the  cylinder.  The  points  lie  very 
nearly  on  a  cylindrical  surface.  Table  1  shows  the 
estimated  and  the  actual  motion  parameters  for  the 
cylinder.  The  error  in  the  m  estimate  is  large  because 
the  cylinder  is  rotating  around  an  axis  nearly  paral¬ 
lel  to  the  image  plane  and,  as  pointed  out  in  [12],  a 
rotation  about  an  axis  parallel  to  the  image  plane  is 


Figure  2  An  end-on  view  of  the  three-dimensional 
point  positions  calculated  by  our  structure  and 
motion  estimation  algorithm  from  point  trajectories 
derived  from  cylinder  image  sequence. 


Figure  3  The  image  sequence  segmentation  found 
for  the  robot  ann  sequence  (the  segmentation  is 
superimposed  on  the  last  frame  of  the  sequence). 


inherently  diftlcult  to  distinguish  from  translation 
parallel  to  the  image  plane  and  perpendicular  to  the 
rotation  axis  (this  also  explains  the  enxir  in  v).  Note 
that  the  predicted  trajectory  point  positions  still  dif¬ 
fer  from  the  actual  positions  by  an  average  of  less 
than  the  error  threshold  of  0.5  pixel.  The  accuracy 
of  the  motion  and  structure  estimation  algorithm  for 
less  ambiguous  motion  is  illustrated  in  the  experi¬ 
ments  on  the  robot  arm  sequence. 

The  image  sequence  segmentation  for  the  robot  arm 
sequence  is  shown  in  Figure  1 .  Note  that  several  sta¬ 
tionary  feature  points  (only  two  visible  in  Figure  1) 
in  the  background  are  grouped  with  the  second  seg¬ 
ment  of  the  arm.  This  occurs  because  any  stationary 
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Table  2  Comparison  of  the  estimated  and  the  true 
parameter  values  for  the  second  Garger)  segment  of 


Table  3  Comparison  of  the  estimated  and  the  true 
parameter  values  for  the  third  (smaller)  segment  of 
the  robot  ann. 

Parameters  Estimated  Actual 

w  -0.0127  -0.0131 

F  (-0.43, 0.045)  (-0.46, 0.016) 

point  lying  on  the  projection  of  a  rotation  axis  with 
no  translational  motion  will  fit  the  motion  parame¬ 
ters  of  the  rotating  object.  Thus  these  points  are 
grouped  incorrectly  due  to  an  inherent  limitation  of 
segmenting  an  image  sequence  on  the  basis  of  mo¬ 
tion  alone.  The  remaining  points  are  grouped  cor¬ 
rectly  into  three  image  regions:  the  second  and  the 
third  segments  of  the  robot  arm,  and  the  back¬ 
ground.  The  two  robot  arm  segments  are  rotating 
and  their  three-dimensional  structure  was  recovered 
by  the  motion  and  structure  estimation  algorith- 
m.Only  a  small  number  of  feature  points  were  asso¬ 
ciated  with  the  robot  arm  segments  making  it  diffi¬ 
cult  to  illustrate  the  structure  on  paper,  but  the  esti¬ 
mated  motion  parameters  of  the  second  and  third 
robot  arm  segments  are  shown  in  Table  2  and  Table 
3,  respectively.  Note  that  all  the  motion  parameters 
were  very  accurately  determined. 

5  Conclusions 

The  main  features  of  our  method  are:  ( 1 )  motion  and 
structure  estimation  and  segmentation  processes  are 
integrated,  (2)  frames  are  processed  sequentially 
with  continual  update  of  motion  and  structure  esti¬ 
mates  and  segmentation,  (3)  the  motion  and  struc¬ 
ture  estimation  algorithm  factors  the  trajectory  data 
into  separate  motion  and  structure  matrices,  (4) 
aside  from  and  SVD,  the  motion  and  structure  esti¬ 
mation  algorithm  is  closed  form  with  no  nonlinear 
iterative  optimization  required,  (5)  the  motion  and 
structure  estimation  algorithm  provides  a  confi¬ 
dence  measure  for  evaluating  any  particular  seg¬ 
mentation. 
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Abstract 

Understanding  the  visual  motion  of  articidated 
objects  has  broad  applicadons.  In  this  paper,  we  present 
new  algorithms  for  the  recovery  of  3D  structure  from  the 
motion  of  certain  types  of  Joints  of  articulated  objects 
under  the  assumption  of  perspective  projection.  Then  by 
the  assumption  that  a  human  body  can  be  considered  as  an 
articidated  object  and  each  part  of  the  articulated  object  is 
rigid,  we  apply  these  algorithms  to  motion  analysis  of 
human  ambulatory  patterns  in  a  combined  way.  The  algo¬ 
rithms  have  been  tested  in  terms  of  uniqueness  of  real 
solution  and  speed  of  convergence  with  synthesized  data. 

1.  Introduction 

An  articulated  object  is  a  collection  of  rigid  or  non-rigid 
bodies  which  are  connected  by  joints.  Human  vision  sys¬ 
tems  have  significant  ability  to  understand  the  visual 
motion  of  articulated  objects.  The  furtho'  study  of  how  to 
understand  the  visual  motion  of  articulated  objects  by 
means  of  computer  vision  may  help  us  to  gain  insight  into 
the  nature  of  the  way  how  biological  visual  systems  woik. 
Despite  a  large  body  of  motion  analysis  woiks  on  purely 
rigid  or  even  now  purely  nonrigid  objects  [Huang  87],  we 
found  that  tmly  a  few  published  papers  dealt  with  articu¬ 
lated  objects  which  are  neither  purely  rigid  nor  purely 
non-rigid  objects.  Hoffman  and  Flinchbaugh  presented  an 
algorithm  for  the  recovery  of  3D  structure  from  biological 
motion  in  1982.  Because  they  aimed  at  computing  the  3D 
structure  and  motion  of  animal  limbs,  a  "planarity  assump¬ 
tion”  is  introduced  in  their  paper,  i.e.  the  part(s)  around  a 
joint  of  an  articulated  object  always  move  inside  a  single 
plane.  Orthographic  projection  is  also  assumed.  Their 


results  are:  1)  given  three  distinct  orthographic  projections 
of  the  two  en^wints  of  a  rigid  rod  which  is  constrainted  to 
rotate  in  a  plane,  the  structure  and  motion  compatiable 
with  the  three  views  are  uniquely  determined;  2)  given  two 
distinct  orthographic  projections  of  the  three  endpoints  of 
two  rigid  rods  linked  in  a  hinge  joint  to  form  a  pairwise- 
rigid  structure  which  is  constrained  to  move  in  one  plane, 
the  structure  and  motion  compatiable  with  the  two  views 
are  uniquely  detomined.  No  experimental  results  are 
reported.  Webb  and  Aggarwal  presented  their  algorithm 
for  the  recovery  of  3D  structure  from  fixed-axis  motion  in 
1981.  Othographic  projection  is  also  assumed  in  their 
paper.  Their  result  is  that  the  3D  structure  of  two  points 
which  execute  fixed-axis  rotation  can  be  uniquely  deter¬ 
mined  with  four  consequential  image  frames.  Experimen¬ 
tal  results  with  real  image  data  have  been  rqxrrted,  but  the 
accuracy  of  the  results  is  considered  as  poor  by  the 
authors. 

Understanding  the  visual  motion  of  articulated  objects 
has  also  broad  applications.  One  application  may  be  to 
^ply  the  developed  algorithms  for  3D  structure  recovery 
from  the  motion  of  articulated  objects  to  motion  analysis 
of  human  ambulatory  patterns.  Studying  how  to  under¬ 
stand  the  visual  motion  of  human  ambulatory  patterns  is  of 
great  importance  for  both  vision  research  and  applications. 
For  example,  it  can  be  used  in  model-based  image  coding 
and  transmission  such  that  we  may  only  need  to  transmit  a 
few  estimated  motion  parameters  instead  of  sending  a 
whole  image  sequence  if  certain  human  body  models  are 
available  at  both  sender  and  receiver  sides  [Kimoto  91]. 
Another  important  iq)plication  of  understanding  the  visual 
motion  of  articulated  objects  is  in  robotic  vision  where 
most  robot  arms  are  ideal  articulated  objects. 

Following  this  section,  we  will  first  develop 
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algorithms  for  the  recovery  of  3D  structure  from  the 
motion  of  c^tain  types  of  joints  of  articulated  objects  in 
Section  2.  Then  in  Section  3,  we  will  study  how  to  apply 
these  algorithms  to  motion  analysis  of  human  ambulatory 
patterns.  Finally  in  Section  4,  we  will  give  a  brief  sum¬ 
mary  and  discuss  some  future  directions. 

2.  3D  Structure  from  Motion  of  Articulated 
Objects 

It  is  noticed  that  the  articulated  objects  discussed  in  both 
Hoffman  and  Webb’s  p^)ers  are  pairwise-rigid,  i.e.  each 
part  of  the  articulated  object  is  rigid,  and  moreovo'  each 
joint  has  at  most  one  degree  of  rotational  freedom.  In  this 
paper,  we  also  emphysize  such  kind  of  articulated  objects. 
However,  we  will  study  the  cases  under  the  assumption  of 
perspective  projection,  because  in  practice  camera  projec¬ 
tion  is  always  per^)ective  and  in  some  cases  using  ortho¬ 
graphic  projection  as  rm  i4)proximation  cannot  achieve  the 
desired  accuracy. 

Systematically,  we  can  write  equations  which 
describe  the  whole  articulated  object  no  matter  how  many 
joints  it  has  [Featherstone  87].  Due  to  the  high  order  of  the 
nonlinear  equations,  however,  it  is  very  difficult  to  solve 
such  equations  and  the  uniqueness  of  the  real  solution  of 
the  equations  is  also  hard  to  be  studied.  Alternatively,  here 
we  select  a  sequential  way  to  solve  the  whole  system.  Only 
one  joint  is  considered  at  a  time.  In  the  following  subsec¬ 
tions,  we  will  first  develop  algorithms  for  the  recovery  of 
3D  structure  from  fixed-axis  motion.  Then  we  will  discuss 
the  more  restricted  cases,  i.e.  the  cases  of  planar  motion. 
For  the  sake  of  simplicity,  we  will  use  identical  characters 
to  denote  points  and  their  associated  coordinate  vectors  in 
3D  space.  For  example,  if  there  is  a  point  p  associated 
with  a  3D  coordinate  vector  of  (x,  y,  z)^,  then  we  will  still 
use  character  p  to  denote  that  vector,  i.e.  we  let 
p  =  (x,  y,  z)^.  We  will  use  pop^  to  denote  the  vector  from 
point />o  to  point /Tj,  i.e.po/’ I  =Pi  -po.  and  use  IpoPi  * 
to  denote  its  length.  Because  perspective  projection  is 
assumed,  we  have  image  coordinates  and  their  correspond¬ 
ing  3D  coordinates  related  as  follows: 

X  =  -  (2-1) 

z 


Y=^  (2-2) 

where  (X,  Y)  are  image  coordinates  and  (x,  y,  z)  are  the 
corresponding  3D  coordinates,  z  is  the  depth.  Then  for 
any  point  p,  we  have: 

p  =  (X,  y,  l)^-z  (2-3) 

In  the  following  sections,  we  will  write  equations  in 
terms  of  the  related  3D  vectors.  It  is  equivalent  to  write 
those  equations  in  terms  of  images  coordinates  which  are 
usually  supposed  to  be  known  and  3D  coordinates  which 
are  unknown.  In  fact,  we  can  convert  each  equation  from 
either  of  the  two  forms  to  the  other  by  using  formula  (2-3). 

2.1.  Algorithm  for  3D  Structure  from  Fixed-Axis 
Motion  (Algorithm  2.1) 

Stq>pose  we  are  given  an  image  sequence  of  one  part  of  a 
moving  articulated  object,  sec  Figure  2.1-1.  The  only 
allowed  motion  is  that  rod  poPi  can  rotate  around  an 
unknown  but  fixed  axis  through  joint  po>  The  3D  coordi¬ 
nates  (or  equivalently  the  depth  only)  of  the  joint  po  art 
given.  The  goal  is  to  recover  the  3D  coordinates  of  points 
Pj  in  3D  space  from  the  given  image  coordinates. 


We  attack  this  problem  by  solving  a  set  of  nonlinear 
equations.  Using  four  consecutive  frames,  we  can  first 
write  three  degree-2  equations  based  on  the  rigidity  con¬ 
straint: 


IPoP.  I  =  IPo'Pi'l  =  lpo"Pi"l  =  lpo'"p.'"l  (2.1-1) 


550 


Because  the  rod  rotates  around  a  fixed  axis  through 
the  joint,  namely  the  end  point /rj  remains  in  a  single  plane 
during  the  rotation.  Then  we  have  another  degree-3  equa¬ 
tion: 


p"prPiPi'^Pi'Pi"^0  (2.1-2) 

Note  we  have  four  unknows.  And  we  have  set  up  four 
nonlinear  equations.  From  Bezout’s  Theorem,  the  number 
of  possible  solutions  is  no  larger  than  2^  ■  3  =  24. 

2.2.  Algorithms  for  3D  Structure  from  Planar 
Motion 

A.  Case  1  (Algorithm  2.2-A) 

Here  again  we  are  givoi  an  image  sequence  of  one  part  of 
a  moving  articulated  objea,  see  Figure  2.2-A.  But  rod 
PoFi  now  is  restricted  to  remain  in  a  plane  during  its  rota¬ 
tion.  The  3D  coordinates  (or  equivalently  the  depth  only) 
of  the  joint  po  <ne  still  given.  The  goal  is  to  recover  the  3D 
coordinates  of  pointspj  in  3D  qiace  from  the  given  image 
coordinates. 


Using  only  three  consecutive  frames  this  time,  we  can 
write  two  degree-2  equations  based  on  the  rigidky  con- 
strainL’ 


ipopi  I  =  Ipo'Fi'l  =  lFo'>i"l  (2.2-AI) 


Then  based  on  the  planarity  assumption,  we  can  write 
another  degree-3  equation: 


FoPiPo>i'xpo"p,"  =  0  (2.2-A2) 


In  this  case,  we  have  three  unknowns  in  total  Three 
nonlinear  equations  are  available.  From  Bezout’s 
Therxem,  the  number  of  solutions  is  no  larger  than 
2*  •  3  =  12. 

B.  Case  2  (Algorithm  2.2-B) 

Fewer  frames  are  needed  for  3D  structure  recovery,  if  we 
consider  two  rods  at  a  time,  and  if  the  motion  is  planar,  i.e. 
the  two  rods  remain  in  a  single  plane  during  their  rotation. 
See  Figure  2.2-B. 

With  only  two  consecutive  frames,  our  first  two 
degree-2  equations  still  come  Grom  the  rigidity  constraints: 


lpoPil  =  iFo'Pi'l  (2.2-BI) 


lp,P2l=  Ip/pi'l  (2.2-B2) 


Then  based  on  the  planarity  assumption,  we  have  two 
degree-3  equations: 


Pi'P2'PoPi^P\P2  =  0  (22-B3) 


P\'Po'  'PoP\^PiP2-^  (2.2-B4) 

There  are  four  unknowns  and  four  nonlinear  equa¬ 
tions  are  available.  From  Bezout’s  Theorem,  the  number 
of  solutions  is  no  larger  than  2^  ■  3^  =  36. 


To  analyze  the  uniqueness  of  real-valued  solution  of 
the  equations  in  the  above  cases  where  perspective  projec¬ 
tion  is  assumed  is  much  more  difficult  than  the 
coneqxMtding  cases  under  the  assumption  of  orthographic 
projection.  At  this  point,  we  have  only  done  a  number  of 


551 


numerical  experiments.  We  solve  the  above  nonlinear 
equations  by  using  continuation  method.  We  found  that  in 
all  cases  the  real  solution  was  unique  iq)  to  a  reflectance. 
The  computing  time  is  resonable.  For  example,  it  takes 
about  five  minutes  on  the  average  for  the  algorithm  to  con¬ 
verge  fw  the  equations  in  2.2-B  on  a  SUN-3  workstation. 

3.  Motion  Analysis  of  Human  Ambulatory 
Patterns 

Up  to  now,  few  papers  in  computer  vision  have  discussed 
motion  analysis  of  human  ambulatory  patterns,  still  fewer 
have  done  it  in  a  quantitative  way.  Johansson  published  the 
first  related  paper  in  1973.  Besides  the  sophisticated 
psycological  experiments,  only  rough  qualitative  explana¬ 
tion  is  given  in  his  paper  [Johansson  73,  76].  Rashid’s 
paper  published  in  1980  describes  his  work  on  the  segmen¬ 
tation  of  moving  light  patterns  which  are  produced  by 
moving  articulated  objects  like  a  human  being  or  a  dog 
against  stationary  background.  The  segmentation  is  done 
based  on  2D  information  in  images.  No  further  3D  infor¬ 
mation  recovery  or  motion  estimation  has  been  reported. 
In  fact,  the  paper  claims  that  the  segmentation  results  can 
be  a  useful  starting  point  for  further  analysis  [Rashid  80]. 
O’Rourke  and  Badler’s  work  in  1980  is  model-based.  Dif¬ 
ferent  from  others,  the  proposed  algorithm  employs  a  com¬ 
plete  human  body  model  to  track  input  human  body 
motion.  In  the  paper,  only  synthesized  input  data  has  been 
tested  with  the  algorithm.  Because  the  tracking  results  are 
heavily  based  on  the  segmentation  results  from  images,  it 
is  not  clear  how  well  the  algorithm  will  work  with  real 
images  [O’Rourke  80].  Recently,  Terzopolous  and 
Metaxas  suggest  to  use  parameterized  superqudrics  to 
model  human  bodies  and  then  do  motion  analysis  based  on 
that  model.  Unfortunately,  no  further  details  have  been 
reported  in  their  paper  [Terzopolous  90]. 


In  this  section,  we  apply  the  proposed  algorithms  for 
3D  structure  recovery  from  the  motion  of  articulated 
objects  to  motion  analysis  (tf  human  ambulatory  patterns. 
We  assume  that  the  segmentation  of  human  bodies  and  the 
extraction  of  the  joints  on  human  bodies  firom  images  have 
already  been  completed.  In  other  words,  we  start  with  the 
obtained  skeletons  of  human  boides  in  image  sequences.  In 
each  image  frame,  typically  such  a  skeleton  may  look  like 
the  one  in  Figure  3-1. 

Based  on  anatomical  reasons,  it  is  suggested  that  the 
four  limb  joints,  i.e.  pm,  pn,  pg4,  and  pg4  in  the  figure, 
allow  only  planar  motion  during  ambulatory  motion  [Hoff¬ 
man  82].  And  other  joints  allow  only  fixed-axis  motion 
[Webb  81].  So,  we  can  apply  our  algorithms  developed  in 
the  previous  section  to  them  correspondingly.  It  is 
observed  that  a  human  body  moves  periodically  during 
ambulatory  motion.  Each  poiod  consists  of  two  halves 
corresponding  to  the  right  and  the  left,  respectively.  Dur¬ 
ing  each  half  of  a  period,  one  foot  remains  on  the  ground 
stationarily.  So  we  can  use  that  foot  as  a  reference  point 
when  we  want  to  do  motion  analysis  during  that  time  inter¬ 
val.  Because  we  use  image  sequences  from  a  single  cam¬ 
era,  we  can  only  recover  the  relative  depth  information 
[Huang  87].  Hence,  without  loss  of  generality,  we  can 
assume  that  the  3D  coordinates  of  the  reference  point  are 
given.  Then  for  a  time  instant  t  which  belongs  to  a  selected 
half  of  a  period,  we  can  start  our  analysis  from  bottom  up 
as  follows.  We  will  use  p  to  denote  a  point  p  who^  3D 
coordinates  are  given  or  have  been  determined. 

Stepl. 

Algorithm  2.2-B  (pgo(0,  Pr\(1).  PgiiO)  “> 
(PK\i0.  PxzWJ.  rotationRgon: 
tran^orm pg2,  pgy  PgA.  Prs,  Plx  Pli,  Plx  Plx 
Pla,  and  Pls  based  on  Rg^n; 

Step  2. 

Algorithm  2.1  (pgj^t),  pg^it))  ->  (pgzit)).  rota¬ 
tion  Rg^; _ 

transform  pgj,  pg4^  and  pgs  based  on  Rji! 
Algorithm  2.1  (pgzO),  PlzO))  ->  (PiiO))-  rota¬ 
tion  Rgc;  _ 

tran^orm  ft,2.  Pti.  Plo.  Pl3.  Pla.  and/j^j  based 
on  R22: 

Step  3. 

Al^rithi^  2.2-B  (pg3(t),  pgAiO.  PgsiO)  -> 

(PraO),  PrsU));  _ 

Algorithm  2.2-B  (PLiiO,  Pu(‘\  PloU))  -> 
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(PliOX  PloO)): _  _ 

Algorithm  2.1  (pt2(0.  PLiiO)  ~>  (PLzif)).  rota- 
ttonRiii:; _ 

traiuform  pn,  pn^  and p^s  based  on  Rij^: 

Step  4. 

Al^rithn^22-B  fPtaO). />z,4(0.  PlsOW  -> 

(Rl4(0»  PlsO)): 

It  is  obvious  that  the  way  to  combine  the  algtnithms 
10  recover  the  3D  information  of  all  the  joints  is  not 
unique.  For  example,  we  can  employ  AlgorithmZ2-A 
instead  of  Algorithm2.2-B.  But  as  we  pointed  out  in  Sec¬ 
tion  2.2,  the  formo’  algorithm  uses  more  frames  than  the 
latter. 

4.  Summary 

In  this  ptq)^',  we  have  developed  three  algtvithms  under 
the  assumption  of  perspective  projection  for  the  recovoy 
of  3D  structure  from  the  motion  of  several  types  of  joints 
of  articulated  objects,  viz.  joints  which  allow  only  fixed- 
axis  rotation  and  joints  which  allow  only  planar  rotation. 
Then  we  ^ly  these  algorithms  to  analysis  of  human 
ambulatcwy  motion  in  a  combined  way.  Our  numerical 
experimental  results  with  synthesized  data  have  indicated 
that  the  real  solution  from  the  poposed  algorithm  was 
always  unique  up  to  a  reflectance  and  the  time  of  conver¬ 
gence  was  reasonable. 

Understand  the  visual  motion  of  human  body  is  an 
impntant  issue  which  has  been  started  quite  recently. 
There  is  a  long  way  to  go  and  many  sub-topics  now  are 
totally  open.  Even  in  the  q)ecia]  case  of  understanding 
human  ambulatmy  motion,  there  are  still  a  number  of 
topics  which  need  to  be  studied.  Among  them,  the  seg¬ 
mentation  of  human  bodies  from  images  is  the  first  and 
also  maybe  the  hardest  Gait  analysis  is  another  important 
and  also  interesting  topic  to  be  studied.  And  in  some  tq^pli- 
cations,  more  sq)histicated  human  nrodels  and/or  motion 
models  may  be  needed.  In  the  meantime,  testing  the 
present  algorithms  with  real  image  data  and  finding  ways 
to  improve  them  are  our  near-term  research  goals. 
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Abstract 

We  describe  a  method  of  visual  motion  recog¬ 
nition  applicable  to  a  range  of  naturally  occur¬ 
ring  motions  that  are  characterized  by  spatial 
and  temporal  uniformity.  The  underlying  mo¬ 
tivation  is  the  observation  that,  for  objects  that 
typically  move,  it  is  frequently  easier  to  identify 
them  when  they  are  moving  than  when  they  are 
stationary.  Specifically,  we  show  that  certain 
statistical  spatial  and  temporal  features  that 
can  be  derived  from  from  approximations  to  the 
motion  field  have  invariant  properties,  and  can 
be  used  to  classify  regional  activities  such  as 
windblown  trees,  ripples  on  water,  or  chaotic 
fluid  flow,  that  are  characterized  by  complex, 
non-rigid  motion.  We  refer  to  the  technique 
as  temporal  texture  analysis  in  analogy  to  the 
techniques  developed  to  classify  gray-scale  tex¬ 
tures.  This  recognition  approach  contrasts  with 
the  reconstructive  approach  that  has  typified 
most  prior  work  on  motion.  We  demonstrate 
the  technique  on  a  number  of  real-world  im¬ 
age  sequences  containing  complex  movement. 

The  work  has  practical  application  in  monitor¬ 
ing  and  surveillance,  and  as  a  component  of  a 
sophisticated  visual  system. 

1  Motion  Recognition 

The  use  of  visual  motion  for  the  quantitative  reconstruc¬ 
tion  of  world  geometry  has  been  extensively  studied. 
This  has  sometimes  obscured  the  fact  that  motion  can 
also  be  used  for  recognition.  In  fact,  in  biological  sys¬ 
tems,  the  use  of  motion  information  for  recognition  is 
often  more  evident  than  its  use  in  reconstruction.  A 
simple  example  occurs  in  the  case  of  the  common  toad 
Bufo  bufo  for  which  any  elongated  object  within  a  cer¬ 
tain  size  range  that  exhibits  motion  along  the  long  axis 
is  identified  as  a  potential  food  item,  and  elicits  an  ori¬ 
enting  response  [Ewart,  1987].  Birds  ignore  the  natural 
movement  of  trees  in  the  wind,  but  respond  immediately 
to  the  approach  of  a  predator.  More  generally,  stylized 
movements  seem  to  be  a  universal  form  of  communi¬ 
cation  between  animals  with  eyes,  from  the  aggressive 
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posturing  of  various  fiddler  crabs  {Uca  species),  to  the 
mating  dance  of  the  blue  footed  booby  {Sula  Nebouxi), 
to  the  expressive  facial  movements  of  baboons. 

Humans  have  a  remarkable  ability  to  recognize  dif¬ 
ferent  kinds  of  motion,  both  of  discrete  objects,  such  as 
animals  or  people,  and  in  distributed  patterns  as  in  wind¬ 
blown  leaves  or  waves  on  a  pond.  A  classic  illustration 
of  motion  recognition  by  humans  is  provided  by  Moving 
Light  Display  experiments  where  the  sole  source  of  in¬ 
formation  about  a  moving  actor  is  provided  by  lighted 
points  attached  to  a  few  joints  [Johansson,  1973].  People 
shown  these  images  dismiss  single  frames  as  meaningless 
dot  patterns  but  can  recognize  characteristic  gaits  such 
as  running  or  W2dking,  and  even  gender  and  familiar  in¬ 
dividuals  from  the  sequential  presentation. 

Such  abilities  suggest  that,  in  the  case  of  machine  vi¬ 
sion,  it  might  be  possible  to  use  motion  directly  as  a 
means  of  recognition  rather  than  indirectly  through  a 
geometric  reconstruction.  In  addition  to  the  biological 
motivations,  there  are  computational  reasons  for  consid¬ 
ering  motion  as  a  recognition  modality.  One  advantage 
is  that  the  motion  field,  insomuch  as  it  can  be  extracted 
at  all,  is  robust  with  respect  to  lighting  changes,  and 
much  more  simply  related  to  shape  than  is  image  lumi¬ 
nance.  Furthermore,  if  the  task  is  to  find  an  object  that 
is  known  to  be  moving,  motion  can  be  used  to  efficiently 
presegment  the  scene  into  regions  of  high  and  low  inter¬ 
est.  This  can  frequently  be  done  even  if  the  observer  is 
itself  moving  [Nelson,  1991] 

Motion  recognition  has  seen  limited  attention  in  the 
literature  Most  of  the  work  that  has  been  done  involves 
the  analysis  of  moving  light  displays.  A  domain  inde- 

riendent  approach  to  this  problem  is  given  by  Rashid 
1980].  Goddard  [1988]  considers  the  representation  and 
recognition  of  event  sequences  from  moving  light  display 
images  of  human  beings.  He  uses  the  joint  angles  and 
angular  velocities  computed  from  the  motion  of  the  dots 
in  the  light  displays.  The  joint  angles  and  angular  veloc¬ 
ities  are  invariant  to  scale,  translation,  and  rotation  in 
the  image  plane.  A  challenging  part  in  computing  these 
invariants  is  to  recover  the  underlying  connectivity  of  the 
individual  dots  in  the  MLD  images. 

A  number  of  potential  applications  for  motion  recogni¬ 
tion  exist.  One  example  is  detecting  malfunctions  in  in¬ 
dustrial  automation  processes.  This  could  be  done  both 
by  detecting  unusual  motion  or  by  noting  the  absence 
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of  a  familiar  motion  sequence  that  should  be  present  in 
normal  operation.  In  some  applications,  a  visual  sys¬ 
tem  could  monitor  a  process  more  effectively,  and  be 
retrofitted  more  easily  than  a  large  set  of  specialized 
sensors.  Motion  detection  can  also  be  used  in  surveil¬ 
lance  applications  where  the  system  must  distinguish  en¬ 
vironmental  motions  such  as  plants  waving  in  the  wind 
from  unfamiliar  motions  such  as  the  movement  of  an 
intruder  into  prohibited  areas.  Other  examples  include 
gesture  recognition  [Rhyne  and  Wolf,  1986]  and  hand¬ 
writing  analysis. 

2  Temporal  Texture 

Cleissical  gray-level  texture  analysis  is  concerned  with 
the  identification  of  spatial  invariances  in  the  gray-level 
patterns  in  an  image  region.  These  invariances  may  be 
either  structurally  or  statistically  defined.  The  basic  idea 
is  to  characterize  different  sorts  of  “stuff”  of  indetermi¬ 
nate  spatial  extent  in  terms  of  such  invariances.  In  this 
article  we  extend  this  basic  idea  into  the  temporal  di¬ 
mension  with  the  idea  of  recognizing  similar  “stufT’  in 
dynamic  scenes.  This  is  motivated  in  part  by  the  exis¬ 
tence  of  a  large  class  of  natural  phenomena  that  seem 
to  have  characteristic  motions,  but  indeterminate  spa¬ 
tial  extent.  Examples  include  windblown  trees  or  grass, 
turbulent  flow  in  cloud  patterns,  ripples  on  water,  falling 
snow,  the  motion  of  a  flock  of  birds  or  a  crowd  of  people, 
etc.  The  motion  in  a  temporal  texture  is  distinct  from 
that  in  patterns  such  as  walking,  cycling  etc.  which  in¬ 
volve  structure  at  a  single  location. 

Temporal  texture  could  be  analyzed  directly  as  a  three 
dimensional  signal  using  generalizations  of  the  tech¬ 
niques  applied  to  two-dimensional  fields.  However,  since 
most  changes  along  the  time  dimension  are  due  to  mo¬ 
tion  in  the  image,  it  makes  sense  to  preprocess  the  time- 
varying  image  to  obtain  motion  information,  as  it  is  in 
object  motion  that  the  physical  invariances  lie.  In  this 
case,  a  natural  choice  is  the  optic  flow  field.  The  basic 
source  of  information  is  thus  a  time  varying  vector  field 
representing  an  approximation  to  the  two  dimensional 
motion  field  induced  by  movement  in  the  world.  Such 
a  field  contains  considerably  more  information  than  the 
scaler  valued  field  associated  with  gray  level  texture  anal¬ 
ysis.  In  addition,  the  direction  and  magnitude  of  motion 
have  a  more  direct  relationship  to  typically  salient  events 
in  the  world  than  the  gray-level  of  a  single  pixel.  Conse¬ 
quently,  certain  types  of  recognition  might  be  expected 
to  be  easier.  One  problem  with  using  optic  flow  is  that  it 
is  difficult  compute  accurately.  One  solution  is  to  devise 
measures  that  are  insensitive  to  inaccuracy.  Another  is 
to  utilize  partial  information.  An  example  is  the  gradient 
parallel  component  of  the  optic  flow,  which  is  simpler  to 
compute  locally  from  an  image  sequences  than  the  full 
motion  field. 

Despite  the  differences  in  domain,  some  techniques  of 
spatial  texture  analysis  are  applicable  to  temporal  tex¬ 
tures.  Spatial  texture  analysis  is  traditionally  performed 
using  either  statistical  or  syntactic  methods.  Statistical 
methods  utilize  measures  of  local  features  that  are  ex¬ 
pected  to  be  similar  within  patches  of  the  same  texture. 
Examples  include  gray-level  co-occurrence  matrices,  gra¬ 


dient  uniformity  meeisures,  local  Fourier  power  spectra, 
average  response  of  oriented  meisks,  and  estimates  of  pa¬ 
rameters  for  Markov  generating  processes.  Syntactic  ap¬ 
proaches  are  most  appropriate  for  highly  regular  tex¬ 
tures  and  involve  analyzing  the  geometric  arrangement 
of  primitive  structural  elements.  In  the  case  of  natural 
temporal  textures,  techniques  similar  to  the  statistical 
gray-level  methods  seem  most  appropriate,  and  most  of 
the  features  described  in  this  article  are  of  this  type. 
As  with  spatial  textures,  the  main  criteria  for  selecting 
features  are  that  they  change  little  within  a  given  tex¬ 
ture  (i.e.  an  area  of  the  same  stuff),  and  that  they  vary 
significantly  between  different  textures. 

The  dimensionality  of  the  vector-valued  flow  field  and 
the  fact  that  measures  can  be  made  in  both  space  and 
time  allows  considerable  latitude  in  designing  features. 
Since  textures  are  characterized  by  statistical  regularities 
in  the  occurrence  of  local  structure,  extraction  of  fea¬ 
tures  useful  for  classification  generally  involves  at  least 
two  tiers  of  processing:  A  local  feature  extraction  stage, 
and  (at  least  one)  spatially  or  temporally  extended  in¬ 
tegration  stage.  Local  features  can  be  any  useful  quan¬ 
tity  that  can  be  associated  with  a  point  in  the  image. 
Examples  include  flow  magnitude  and  direction,  differ¬ 
ential  measures  such  as  divergence  and  curl,  and  local 
uniformity  measures.  The  spatio-temporal  motion  en¬ 
ergy  filters  introduced  by  Heeger  [1987]  could  also  pro¬ 
vide  useful  measures  in  this  context.  Typically  these 
are  expected  to  vary  within  a  texture,  thus  necessitat¬ 
ing  the  integration  phase.  Extended  measures  are  most 
frequently  based  on  quantities  such  as  means  or  vari¬ 
ances,  but  other  extended  measures,  such  as  Fourier  co¬ 
efficients  and  co-occurrence  statistics  can  be  used.  The 
most  typical  structure  for  a  temporal  texture  feature  in¬ 
volves  extended  spatial  or  temporal  (or  both)  measures 
of  spatio-temporal  microfeatures.  Features  can  also  be 
derived  from  extended  spatial  measures  of  extended  tem¬ 
poral  features  and  vice  versa. 

In  order  to  simplify  the  motion  preprocessing,  we  con¬ 
sidered  features  based  on  the  gradient  parallel  compo¬ 
nent  of  the  motion  field,  also  referred  to  as  the  normal 
flow.  The  simplest  local  motion  measures  are  the  mag¬ 
nitude  and  direction  of  the  normal  flow.  We  examine 
several  statistical  features  based  on  the  distribution  of 
these  first-order  quantities.  The  direction  and  magni¬ 
tude  can  be  combined  locally,  both  spatially  and  tempo¬ 
rally  to  obtain  second  order  local  motion  measures.  We 
also  examine  features  based  on  the  distribution  of  some 
second  order  measures.  All  these  are  described  below. 

A  useful  first  order  statistic  can  be  derived  from  the 
distribution  of  flow  directions.  Intuitively,  what  is  being 
measured  is  the  non-uniformity  in  direction  of  motion. 
Our  non-uniformity  statistic  was  computed  by  discretiz¬ 
ing  the  direction  into  8  possible  values,  computing  a  his¬ 
togram  over  the  relevant  neighborhood  of  the  image,  and 
summing  the  absolute  deviation  from  a  uniform  distribu¬ 
tion.  It  should  be  noted  that  the  normal  flow  direction 
at  a  pixel  is  parallel  (or  anti-parallel)  to  the  gradient 
direction.  Thus  meeisures  based  on  the  normal  flow  di¬ 
rection  alone  depend  on  the  underlying  intensity  texture. 
To  reduce  this  dependence,  the  normal  flow  directions  in 
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G:  flow  of  water  in  a  river. 


the  histogram  are  normalized  by  the  4-way  histogram 
of  gradient  directions.  This  feature  is  invariant  under 
translation,  rotation,  and  temporal  and  spatial  scaling. 

A  useful  statistic  based  on  the  distribution  of  the  nor¬ 
mal  flow  magnitude  is  the  average  flow  magnitude  di¬ 
vided  by  its  standard  deviation.  The  scaling  by  the 
standard  deviation  has  the  effect  of  making  the  mea¬ 
sure  robust  under  scaling  changes.  One  way  to  think  of 
this  statistic  is  eis  a  measure  of  “peakiness”  in  the  veloc¬ 
ity  distribution.  It  is  also  invariant  under  translation, 
rotation,  and  temporal  and  spatial  scaling. 

Second  order  statistics  of  the  normal  flow  direction 
distribution  can  be  derived  from  the  difference  statistics, 
which  give  the  number  of  pixels  pairs  at  a  given  offset 
which  differ  in  their  values  by  a  given  amount.  These  dif¬ 
ference  statistics  can  be  represented  by  a  co-occurrence 
matrix  of  the  normal  flow  direction  surrounding  a  pixel. 
Co-occurrence  matrices  are  computed  for  four  directions, 
(horizontal,  vertical,  positive  diagonal  and  negative  di¬ 
agonal),  at  a  distance  proportional  to  the  average  flow 
magnitude.  This  yields  invariance  with  respect  to  scal¬ 
ing.  In  each  direction  the  ratio  of  the  number  of  pixel 
pairs  differing  in  direction  by  at  most  one  to  the  number 
of  pixel  pairs  differing  by  more  than  one  is  computed. 
This  ratio  is  the  sum  of  the  first  two  difference  statistics 
to  the  sum  of  the  last  three  difference  statistics.  Log¬ 
arithms  of  the  resulting  ratios  are  used  as  a  feature  in 
each  of  the  four  directions,  and  represent  a  measure  of 
the  spatial  homogeneity  of  the  flow.  These  features  are 
invariant  under  translation,  rotation  and  scaling. 

Finally,  we  considered  statistics  of  some  second  order 
flow  features,  namely,  estimates  of  the  divergence  and 
curl  of  the  motion  field  obtained  from  the  normal  flow. 
Positive  and  negative  divergence,  and  positive  and  neg¬ 
ative  curl  were  taken  as  separate  features  to  give  four 
different  second  order  features.  The  features  used  are 
the  mean  values  of  these  quantities  over  the  region  of  in¬ 
terest.  They  are  invariant  with  respect  to  rotation  and 
translation,  but  not  scaling.  If  scale  invariant  features 
are  desired,  ratios  of  the  differential  measures  can  be 
used. 

3  Experimental  Results 

A  set  of  image  sequences  representing  both  oriented  tem¬ 
poral  textures  such  as  flowing  water  and  non-oriented 
textures  such  as  leaves  fluttering  in  the  wind  was  dig¬ 
itized.  In  addition,  sequences  representing  uniform  ex¬ 
pansion  and  rotation  of  a  textured  scene  were  obtained. 
These  were  used  in  classification  experiments  utilizing 
the  features  described  above.  Seven  different  texture 
samples,  listed  below,  were  used  for  the  experiments. 

A:  uniformly  expanding  image  produced  by  observer 
motion 

B:  fluttering  of  vertical  paper  bands 

C;  cloth  waving  in  the  wind 

D:  motion  of  tree  leaves  in  a  breeze 

E:  uniformly  rotating  image  produced  by  observer  roll 

F;  turbulent  motion  of  water 


Representative  examples  are  illustrated  in  Figure  1. 

For  each  sample  texture,  two  image  sequences  consist¬ 
ing  of  16  256x256  pixel  frames  taken  at  30  Hertz  were 
split  into  quadrants  to  obtain  eight  independent  sam¬ 
ple  image  sequences  of  128x128  pixels.  The  normal  flow 
field  was  computed  between  each  consecutive  pair  of  im¬ 
age  frames  using  a  multi-resolution  flow  computation, 
with  the  direction  of  normal  flow  quantized  to  one  of 
eight  directions.  The  end  result  of  the  processing  w2ls  a 
sample  of  8  normal  flow  sequences  of  fifteen  frames  each 
for  each  texture. 

Clcissification  experiments  were  run  using  a  nearest 
centroid  classifier.  More  elaborate  classifiers  could  be 
used,  but  the  nearest  centroid  method  gives  a  fairly  di¬ 
rect  indication  of  the  utility  of  the  features.  The  fea¬ 
tures  used  were  those  described  in  the  previous  section, 
namely 


The  first  four  samples  of  each  texture  are  used  as  a 
training  set  to  compute  the  centroid  of  the  cluster  cor¬ 
responding  to  that  texture  in  the  feature  space.  The 
different  feature  values  are  converted  into  common  units 
by  mapping  the  average  of  the  resulting  centroids  to  a 
unit  vector.  Table  1  contains  the  values  of  these  features 
for  each  flow  sample.  It  can  be  seen  that,  as  desired,  the 
within  sample  variation  is  small  and  the  between  sample 
variation  is  high.  No  single  feature  is  sufficient  to  distin¬ 
guish  all  the  textures,  but  for  each  texture,  there  is  at 
least  one  feature  that  clearly  separates  it  from  the  oth¬ 
ers.  For  example,  as  would  be  expected,  texture  A  con¬ 
taining  an  approaching  object  is  distinguished  by  high 
divergence.  For  texture  B,  containing  moving  vertical 
bands,  the  second  order  difference  feature  in  the  vertical 
direction  clearly  separates  it  from  the  rest. 

The  remaining  four  samples  are  tested  using  a  nearest 
centroid  classification  scheme.  The  results  of  classifica¬ 
tion  are  summarized  in  Table  2.  Note  that  none  of  the 
features  alone  is  sufficient  to  separate  all  the  textures, 
but  the  combination  gives  one  hundred  percent  success 
in  the  classification  of  the  test  cases.  In  fact,  the  sec¬ 
ond  order  features  alone  are  sufficient  in  this  case  for 
successful  classification  in  all  cases. 


Feature 

Combination 

Correct 

Classification 

Percent 

Success 

all 

28 

100 

c,d 

28 

100 

b,d 

24 

85 

a,c 

21 

75 

d 

21 

■  75 

c 

20 

71 

Table  2:  nearest  centroid  cla.ssification 


a;  nonuniformity  of  normal  flow  direction 

b:  mean  flow  magnitude  divided  by  its  standard  devia¬ 
tion 

c:  directional  difference  statistics  in  four  directions 
d:  positive  and  negative  curl  and  divergence  estimates 
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Table  1:  sample  features 


Table  3:  first  three  principal  components 


We  also  performed  a  principal  component  analysis  of 
these  features  to  gauge  the  relative  importance  of  dif¬ 
ferent  features  in  producing  the  variation  in  the  sample 
values.  The  first  three  principal  components  of  the  entire 
data  set  are  shown  in  Table  3.  Note  that  the  first  prin¬ 
ciple  component  has  a  high  eigenvalue,  and  relatively 
high  proportions  of  the  second  order  features,  particu¬ 
larly  positive  and  negative  divergence.  This  is  consistant 
with  the  finding  that  the  second  order  features  alone  are 
sufficient  for  classification  in  this  case.  The  principal 
components  within  each  sample  contain  small  absolute 
coefficients  for  the  same  second  order  features,  showing 
that  these  features  are  most  useful  in  classification. 

4  Conclusion 

We  have  described  a  method  of  motion  recognition  using 
temporal  textures.  This  technique  uses  statistical  mea¬ 
sures  of  local  motion  features  as  components  of  a  feature 
vector  that  can  be  used  in  standard  classification  meth¬ 


ods.  We  identified  several  motion  features  that  appear 
to  have  desirable  properties  for  recognition,  and  illus¬ 
trated  their  utility  in  classifying  a  sample  of  real-world 
temporal  textures.  Future  work  includes  the  analysis  of 
other  feature  classes,  including  purely  temporal  features 
of  the  flow  as  well  as  Fourier  techniques.  We  also  plan 
to  pursue  the  connection  to  qualitative  analysis  of  mo¬ 
tion,  and  to  see  whether  such  qualitative  interpretation 
can  allow  the  technique  to  be  extended  to  the  analysis  of 
more  structured  movements  such  as  those  exhibited  by 
individual  people  or  animals. 
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Abstract 

A  new  approach  to  shape  from  shading  is  described, 
based  on  a  connection  with  a  calculus  of  varia¬ 
tions/optimal  control  problem.  The  approach  leads  nat¬ 
urally  to  an  algorithm  for  shape  reconstruction  that  is 
simple,  fut,  provably  convergent,  and,  in  many  cases, 
provably  convergent  to  the  correct  solution.  In  contrast 
with  standard  variational  algorithms,  it  does  not  require 
regularisation.  An  explicit  representation  is  given  for  the 
surface;  uniqueness  of  the  reconstruction  (under  suitable 
conditions)  is  an  immediate  consequence.  Given  a  con¬ 
tinuous  image,  the  algorithm  can  be  proven  to  converge 
to  the  continuous  surface  solution  as  the  image  sampling 
frequency  is  taken  to  infinity.  Experimental  results  are 
presented  for  synthetic  and  real  images,  for  general  light¬ 
ing  direction. 

1  Introduction 

Shape  from  shading  has  traditionally  been  considered  an 
ill-posed  problem,  with  potentially  infinitely  many  dif¬ 
ferent  surfaces  corresponding  to  a  shaded  image.  There¬ 
fore,  most  algorithms  for  reconstructing  shape  have  in¬ 
corporated  Ttgvlarizaiion  techniques  to  guarantee  recov¬ 
ery  of  a  unique,  ‘physicaUy  reasonable’  surface  solution. 

More  recently,  it  was  suggested  that  shape  from  shad¬ 
ing  need  not  be  ill-posed  when  the  image  contains 
singular  points,  i.e.,  maximally  bright  image  points 
[1,4,16,17,14,13].  This  was  shown  for  the  case  of  illu¬ 
mination  from — or  symmetric  around — the  camera  di¬ 
rection  in  [14].  In  addition,  a  general  shaded  image  was 
shown  to  uniquely  determine  shape  under  the  assumed 
lighting  conditions  [14].  Singular  points  provided  the  es¬ 
sential  constraints. 

Singular  points  continue  to  give  strong  constraints  on 
the  surface  solutions  for  illumination  &om  a  general  di¬ 
rection  [13].  Thus,  shape  from  shading  should  not  be 
assumed  ill-posed  in  general,  and  regularisation  should 
be  used  with  caution.  Also,  the  image  of  the  occluding 
boundary  gives  no  useful  constraint  on  surface  recon¬ 
struction  [13].  Singular  points,  therefore,  provide  the 
primary  constraints. 

'This  work  was  supported  by  the  National  Science  Foun¬ 
dation  under  grants  IRI-91 13690,  CDA-8922S72tand  NSF- 
DMS-8902333t,  and  by  a  grant  from  DARPA,  via  TACOM, 
contract  number  DAAE07-91-C-R035t. 


Nevertheless,  shape-&om-shading  edgorithms  in  the 
past  have  not  taken  full  advantage  of  the  strong  con¬ 
straints  due  to  singular  points.  Algorithms  based  on  the 
method  of  characteristic  strips  [3]  have  used  these  con¬ 
straints  explicitly,  but  in  an  approximate  way.  These 
algorithms  have  usually  been  applied  to  rather  simple 
images,  and  are  nonrobust  in  the  presence  of  noise. 

Most  recent  algorithms  for  recovering  shape  from 
shading  have  been  based  on  the  variational  approach 
(e.g.,  [6,5,4]).  These  algorithms  have  had  significant  suc¬ 
cesses  on  complex  images,  but  do  not  explicitly  use  the 
singular  point  constraints.  This  is  seen  experimentally 
in  the  fact  that  these  algorithms  do  better  on  images 
with  many  singular  points  than  on  images  with  just  one 
(see  below  and  also  [9]);  yet  for  such  simple  images,  the 
sole  singular  point  is  known  to  directly  and  uniquely  con¬ 
strain  the  surface  reconstruction  [1,17]. 

In  this  paper,  an  algorithm  is  presented  that  takes  full 
advantage  of  the  singular  point  constraints.  It  is  simple, 
fast,  provably  convergent,  and,  in  many  cases,  provably 
convergent  to  the  correct  solution.  In  particular,  if  the 
surface  is  known  to  be  unimodal  at  a  singular  point  in  the 
image  (i.e.,  locally  concave  or  convex  at  this  point),  then 
the  algorithm  provably  reconstructs  the  correct  surface 
in  a  region  around  the  singular  point.  The  algorithm 
is  robust  against  noise  and,  unlike  previous  algorithms, 
does  not  employ  regularisation.  There  is  no  problem 
with  fake  minima,  in  contrast  to  the  standard  variational 
approach.  Finally,  this  approach  is  capable  of  dealing 
with  some  orientation  discontinuities — images  for  which 
the  intensity  function  is  only  piecewise  continuous. 

The  algorithm  is  based  on  estabUshing  the  equivalence 
of  shape  &om  sh£tding  to  a  calculus  of  variations/optimal 
control  problem.  This  can  be  done  for  illumination  from 
the  camera  direction,  or  for  regions  near  unimodal  sin¬ 
gular  points.  For  the  general  case  with  illumination  from 
an  arbitrary  direction,  the  optimal  control  problem  must 
be  extended  to  a  differential  game.  This  equivalence  fa¬ 
cilitates  the  theoretical  analysis  of  shape  from  shading, 
and  makes  the  algorithm  highly  adaptable.  It  also  gives 
intuition  about  the  convergence  performance  of  the  al¬ 
gorithm.  Below,  we  present  a  simple  uniqueness  proof 
for  shape  from  shading  which  generalises  from  the  local 
uniqueness  results  of  Bruss  [l]  and  Saxberg  [17].  This  is 
possible  because,  in  the  optimal  control  representation, 
an  expression  for  the  surface  corresponding  to  a  shaded 
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image  can  be  exhibited  explicitly.  Some  of  the  results 
presented  in  this  paper  have  also  been  derived  in  [18]. 

2  Shape  from  Shading  as  a  Problem  of 
Optimal  Control 

The  imaged  surface  is  assumed  to  be  Lambertian,  and 
viewed  from  above  along  the  — i  direction.  It  is  repre¬ 
sented  in  the  explicit  form  z{x,y),  where  x  :  — >  R 

is  the  height  function  to  be  reconstructed.  We  consider 
first  the  simpler  case  of  illumination  along  the  viewing 
direction  —z  (vertical  light).  The  case  of  illumination 
from  a  general  direction  is  discussed  later. 

Under  these  conditions,  the  image  irradiance  equation 
is: 

=  (l-f-|Vz(*,y)P)i/>- 

It  is  convenient  to  rewrite  this  in  the  eikonal  form: 

~  ^ 

where  /(*,  y)  G  (0,  1],  V{x,y)  €  [0,  oo).  This  type 
of  equation  arises  frequently  in  the  dynamical  program- 
ming  approach  to  problems  of  optimal  control.  In  this 
section,  we  give  a  heuristic  and  indirect  demonstration 
that  it  is  equivalent  to  a  calculus  of  variations/optimal 
control  problem.  Later,  a  direct  and  rigorous  derivation 
will  be  given.  Other  possibilities  (e.g.  inclusion  of  a  ter¬ 
minal  cost)  are  also  of  interest  for  shape  &om  shading, 
and  will  be  discussed  below. 

Consider  the  following  control  problem:  a  ‘particle’ 
initially  located  at  {xo,  yo)  moves  in  the  image  plane  in 
response  to  control  parameters  u,  v,  according  to: 

i  =  u,  y  =  v,  i(0)  =  xq,  y(0)  =  yo.  (2.3) 

The  control  parameters  are  to  be  chosen  to  minimize  a 
cost  function  for  the  particle’s  trajectory  (z(s),  y(s)): 

Uixo,yo,T)  =  inf{ 

(2.4) 

i  Sl  +  ’'(»)*  +  ^(»(»).v(»)))}- 

Other  control  problems  could  abo  be  considered  [18]. 
In  this  equation,  the  minimal  cost  has  been  defined  as  a 
function  of  the  trajectory  starting  point  (zq,  yo).  The  in- 
fimisation  is  over  all  piecewise  continuous  functions  u('), 
!;(•)  on  [0,  7^.  Let 

U{x,y)  =  limT-oo£^(*iyiT). 

U(')  will  turn  out  (in  the  unimodal  case)  to  be  the  sur¬ 
face  z(z,  y)  up  to  a  translation.  To  show  this  formally, 
we  assume  that  U(-)  is  a  differentiable  function  of  the 
starting  point,  and  formally  demonstrate  using  dynami¬ 
cal  programming  that  it  satisfies  eq.  2.2. 

Let  ST  be  a  small  time  increment.  Then: 

f^(*o,yb,T)  =inf(,,,){ 

U{x{6T),y(6T),T-6T)  (2.5) 

+  5  (“(*)’  +  +  ''(*(»).»(*)))}• 


The  explicit  inhmization  is  now  over  the  part  of  the 
trajectory  with  T  €  [0,  6T\\  the  infimisation  over  the 
rest  of  the  trajectory  is  included  in  the  cost  function 
V(-,T  -  ST).  Since  ST  is  small,  V  can  be  expanded  to 
first  order  in  this  quantity,  which  gives  (for  ST  — »  0): 

f^(®iyi^)  =  inf(«(o),»(o))  { 

l(u2(0)-(-t>*(0)  +  V(z,y))  (2.6) 

+  ^(*.y.T)u(0)  +  f  (z,y,T)u(0)}. 
Performing  the  minimization  over  u(0)  and  v(0)  yields: 

“(0)  = 
and 

|^(®.y.T)  = 

I  [n*.y))  -  (^(^.y-T*))’-  (f  (z,y,T))*] . 

Suppose  that  the  image  region  under  consideration  is 
a  small  neighborhood  of  a  singular  point,  at  which  7=1 
and  V  =  0.  A  minimal  cost  trajectory  clearly  moves  to¬ 
ward  regions  of  smaller  V,  and  will  converge  to  the  sin¬ 
gular  point  at  which  the  incremental  cost  is  zero.  As  the 
trajectory  converges  to  this  point,  the  total  cost  along 
the  trajectory  converges  to  a  finite  value.  Therefore,  the 
integration  limit  T  in  eq.  2.4  can  be  taken  to  infinity, 
and  U(z,  y)  is  well  defined.  Since  the  time  derivative 
vanishes,  U(z,y)  satisfies: 

(-) 

Since  this  is  just  the  image  irradiance  equation,  eq.  2.2, 
V{-)  can  he  identified  with  z{x,  y).  Also,  u  and  v  can  be 
identified  with  —p  and  —q,  respectively,  from  eq.  2.7. 
Thus  the  minimal  cost  trajectories  are  curves  of  steepest 
descent,  and  are  just  the  characteristic  strips  [3,13]. 

Note  also  that  U  >  0,  and  that  17  =  0  only  at  the  sin¬ 
gular  point.  Thus,  the  solution  that  is  locally  concave 
at  the  singular  point  has  been  automatically  selected  by 
this  formulation;  the  solution  locally  convex  at  the  sin¬ 
gular  point  is  just  its  negative.  The  solution  V  is  unique, 
since  it  is  equal  to  the  infimum  of  the  cost,  which  must 
be  unique.  Since  an  infimum  of  the  cost  always  exists, 
the  solution  U  always  exists.  It  must  be  continuous,  but 
need  not  be  differentiable. 

This  formulation  also  gives  a  way  of  computing  U. 
Clearly,  I7(z,y,  0)  =  0  for  aU  (z,  y),  while  U(z,  y,  T)  is 
monotonically  increasing  in  time:  extending  a  trajectory 
cannot  result  in  a  reduced  cost.  Therefore,  by  solving  eq. 
2.8  iteratively  in  time,  with  initial  condition  U(-,T)  =  0 
at  T  =:  0,  a  sequence  of  functions  U(z,y,  T)  is  obtained 
which  at  every  point  converges  monotonically  upward  to 
z(z,  y)  as  T  — »  oo.  Because  this  convergence  is  pointwise 
monotonic,  it  is  clearly  stable. 

For  the  actual  implementation,  an  iterative  procedure 
is  used  that  is  justified  by  its  exact  relation  to  a  dis¬ 
cretized  control  problem  [7].  The  continuous  image  plane 
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is  leplaced  by  an  image  discietised  into  pixek,  and  the 
trajectory  described  by  eq.  2.3  is  approximated  by  a 
Markov  process.  This  is  described  in  detail  in  the  next 
section.  It  can  be  shown  that  this  gives  a  discrete  ap¬ 
proximation  U''  to  U,  which  converges  to  the  continuous 
U  as  the  spatial  grid  sise  h  approaches  sero  [11].  How¬ 
ever,  a  naive  discretisation  of  eq.  2.8  does  not  necessarily 
give  a  stably  convergent  algorithm. 

3  Algorithm  Description 

A  more  detailed  description  of  the  algorithm  and  its 
derivation  is  now  presented.  We  consider  a  control  prob¬ 
lem  defined  on  the  discrete  grid  of  pixels  and  chosen  to 
approximate  the  continuous  calculus  of  variations  prob¬ 
lem  described  above,  h  is  the  pixel  spacing.  For  the 
discrete  case,  a  ‘particle’  trajectory  is  a  sequence  of  dis¬ 
crete  jumps  between  grid  sites — a  poor  approximation  to 
a  continuous  trajectory.  In  order  to  better  approximate 
a  continuous  trajectory  on  a  discrete  grid,  an  element  of 
randomness  is  introduced. 

The  control  problem  is  as  follows:  a  ‘particle’,  with  ini¬ 
tial  image  plane  location  4>o  =  (ioiio)h,  jumps  between 
neighboring  pixel  sites  in  response  to  control  parameters 
C  =  (u(k),  v{k)),  where  k  indicates  the  time  step.  (A  4- 
neighborhood  is  assumed.)  The  jumps  are  probabilistic, 
but  it  is  required  that  on  average 

(^(jb -f-l)  -  ^(jfc))  =  AtC(k),  (3.10) 

in  analogy  with  eq.  2.3.  Here  At  is  the  time  increment 
from  time  step  k  to  step  k-H  1,  and  <  >  denotes  the  noise 
average.  Let  q(k)  be  the  random  vector  representing 
the  jump  at  time  k:  tj  =  ^(k  -f  1)  -  ^(k).  The  jump 
probabilities  are  assigned  as  follows.  When  u  =  v  =  0, 
P  (}}  =  0)  =  1,  with  all  other  probabilities  sero.  In  this 
case.  At  is  arbitrarily  chosen  to  be  1.  Otherwise, 

P(7f  =  (sgn(u),  0)h)  =  Atjuj/h  =  |u|/(|u|  -1-  |v|) 

P  (»7  =  (0,  »gn(w))h)  =At|i;|/h  =  |v|/(M -I- |v|), 

again  with  all  other  probabilities  sero.  Eq.  3.11  implies 
that  the  particle  jumps  by  one  lattice  site  at  each  itera¬ 
tion  (for  a  nonxero  control),  and  therefore  moves  on  the 
lattice  with  maximum  speed,  causing  the  algorithm  to 
converge  quickly.  This  has  been  achieved  by  taking  At 
to  depend  explicitly  on  the  controls  as  At  =  /i/(|u|-f  |v|). 
It  is  clear  that  eq.  3.11  implies  eq.  3.10. 

The  analog  to  the  calculus  of  variations  problem  b  as 
follows:  choose  the  control  parameters  to  minimise  the 
expected  cost  for  the  discrete  trajectories: 

H'‘(^,X)  =  inf{ 

J:Lo  At  («(k).  «(*))  Mk)*  +  «(k)^  +  v(^(k))) )}, 

(3.12) 

where  the  iniimisation  is  over  all  nonanticapative  control 
sequences  {(u(k),v(k)),k  =  0,  ...if}  (i.e.,  controls  which 
do  not  depend  on  the  future  history  of  the  particle)  [7]. 
It  can  be  shown  the  value  function  for  this  discrete 
control  problem  converges  to  the  continuous  value  func¬ 
tion  as  the  grid  spacing  is  taken  to  sero  [11]. 


A  dynamical  programming  equation  can  be  derived  for 
this  control  problem  as  in  the  previous  section: 

V'{4>o,  K)  =  inf(tt(o),.(o)){ 

I  At(u(0),  «(0))  (u(0)»  1/(0)^  -f  V{4>o))  (3.13) 

The  expectation  in  this  equation  is  easily  calculated  from 
eq.  3.11;  for  nonsero  controk  it  k 

\  (|u|I7'‘  (^0  +  (sgn(u),0))  +  (^0  +  (0,sgn(w))))  • 

Performing  the  minimisation  in  eq.  3.13  k  slightly 
complicated  since  the  cases  with  C  in  different  quadrants 
must  be  treared  separately.  EventuaUy,  the  following 
algorithm  k  obtained.  Define 

Uiii  =  Uin(V\<^±(l,0)h), 


=  Min(f^fc(0  ±  (0, 1)A), 


and  let  Dtr  =  Ujf2  ~  Further,  let 

T=i  (3.14) 

[  V*/*(^)  +  Mini(l7j^J  otherudse. 

The  lower  case  corresponds  to  the  minimum  in  eq.  3.13 
being  realised  on  one  of  the  axes  in  the  u-v  plane  (with 
the  origin  excluded);  the  upper  corresponds  to  an  off- 
axis  minimum.  The  final  update  equation  k 

K  +  l)  =  Min(T,  V{d>)/2  +  K)),  (3.15) 


where  the  second  term  accounts  for  the  case  of  sero  con¬ 
trok.  As  in  the  previous  section,  the  initial  value  for 
’)  should  be  taken  as  0.  Since  the  expected  cost  for 
an  optimal  trajectory  cannot  decrease  with  time,  an  iter¬ 
ative  solution  K)  to  the  above  equation  increases 

monotonically  at  every  point.  Thk  algorithm  k  more 
efficient  than  the  one  previously  reported  in  [12]  because 
the  time  increment  At  k  adjusted  optimally  as  a  func¬ 
tion  of  the  controk. 

To  avoid  indeterminacy,  it  k  necessary  to  impose  the 
boundary  condition  that  no  trajectory  exits  the  image, 
as  k  easily  done  [12];  the  significance  of  thk  k  discussed 
below.  Then,  assuming  that  there  are  singular  points  in 
the  image  where  V  =  0,  ask— »ooall  optimal  tra¬ 
jectories  must  converge  to  the  singular  points.  Thus 
17*  (^,  K]  converges  monotonically  as  k  —*  oo  to  a  so¬ 
lution  z  =  in  fact,  convergence  occurs  in  a  fi¬ 

nite  number  of  iterations  [11].  Thk  solution  satkfies  a 
discretised  version  of  the  shape-from-shading  equation 
[12],  and  k  always  nonnegative,  since  the  summand  in 
eq.  3.12  k.  Ako,  s  =  0  at  a  singular  point:  a  trajectory 
beginning  at  a  singular  point  achieves  minimal  cost  by 
remaining  there,  since  F  =  0  at  the  point.  Thus,  x  at¬ 
tains  a  Ioc2il  minimum  at  a  singular  point.  Ako,  as  in 
eq.  2.7,  the  expected  optimal  trajectories  are  approxi¬ 
mate  curves  of  steepest  descent  [12]. 
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The  algorithm  described  in  this  section  is  appropriate 
for  unimodal  images — ^images  contfiining  just  one  singu¬ 
lar  point  where  the  height  has  either  a  local  minimum 
or  maximum.  For  these  images,  the  iterative  solution  of 
eq.  3.15  will  correctly  reconstruct  the  original  surface 
at  all  points  where  this  surface  is  theoretic2dly  deter¬ 
mined  [13] — that  is,  at  all  points  connected  by  a  steep¬ 
est  descent  curve  on  the  original  surface  to  the  singular 
point.  Such  points  “learn”  their  height  from  the  singu¬ 
lar  point.  In  contrast,  at  other  image  points  the  surface 
reconstruction  can  be  ambiguous  [13].  These  ambiguous 
points  lie  on  steepest  descent  curves  that  exit  the  image 
rather  than  terminating  at  the  singular  point.  Imposing 
the  boundary  condition  as  above  that  no  trajectory  exits 
the  image  only  affects  the  surface  reconstruction  at  these 
ambiguous  points.  Our  algorithm  does  not  necessarily 
reproduce  the  original  surface  at  ambiguous  points.  A 
modified  algorithm  appropriate  for  the  multimodal  case 
(many  singular  points)  is  described  in  the  next  section. 

4  Modifications  of  the  Algorithm 

The  cdgorithm  of  eq.  3.15  is  of  the  Jacobi  type,  with 
the  surface  updated  everywhere  in  parallel  at  each  itera¬ 
tion.  It  can  be  shown  that  the  algorithm  also  converges 
if  implemented  via  Gauss-Seidel,  with  updated  surface 
estimates  used  as  soon  as  they  are  available  [1 1].  Our  ex¬ 
periments  show  that  this  produces  a  significant  speedup. 

A  more  important  modification  introduces  a  terminal 
coat  term  into  the  cost  function.  This  gives  an  algorithm 
capable  of  dealing  with  multimodal  images.  Including 
this  term,  the  minimal  cost  is  (compare  eq.  3.12): 

eLo  («(*).  ’'(*'))  + Vim)))  ■ 

(4.16) 

The  terminal  cost,  g{<p{K)),  introduces  a  penalty  term 
for  a  trajectory  stopping  at  the  position  ^(K).  It  causes 
an  optimal  trajectory  to  not  remain  in  regions  of  high 
terminal  cost,  and  converge  instead  to  points  of  low  ter¬ 
minal  cost.  This  can  dramatically  improve  the  conver¬ 
gence  speed  of  the  algorithm.  Also,  the  terminal  cost 
can  be  used  to  distinguish  between  singular  points  of 
different  type  (concave,  convex,  saddle).  In  the  final  sur¬ 
face  solution,  only  a  concave-type  singular  point  should 
be  the  terminus  for  optimal  trajectories,  since  these  are 
descending  curves.  By  placing  a  high  terminal  cost  at 
other  singular  points,  trajectories  can  be  prohibited  from 
terminating  at  these  points.  Then  the  surface  solution 
will  only  be  “learned”  from  the  concave  singular  points. 
Also,  if  the  heights  of  the  concave  singular  points  are 
known,  e.g.,  using  stereo,  then  this  can  be  specified  in  the 
algorithm  by  setting  the  terminal  costs  at  these  points 
equal  to  their  heights.  Since  the  singular  points  are  dis¬ 
tinctive,  it  is  likely  that  their  heights,  and  the  local  na¬ 
ture  of  the  surface,  can  be  determined  easily  from  stereo. 

The  dynamical  programming  equation  corresponding 
to  the  cost  in  eq.  4.16  is  exactly  the  same  as  eq.  3.15, 
as  is  easily  seen.  The  algorithm  differs  only  in  the  ini¬ 
tial  condition  for  U’':  clearly,  should  be  set  initially 


to  g{<l>),  not  0  as  before.  Thus,  from  the  optimal  con¬ 
trol  viewpoint,  the  choice  of  initial  values  for  Z7^‘  in  the 
algorithm  has  a  concrete  and  intuitive  interpretation. 

5  Proof  of  Equivalence 

In  this  section  we  will  sissume  the  situation  of  vertical 
light,  as  described  in  Section  2.  For  this  case  (and  under 
suitable  assumptions)  the  height  function  has  a  repre¬ 
sentation  in  terms  of  an  associated  calculus  of  variations 
problem.  For  the  general  case  of  oblique  light  there  is  a 
representation  in  terms  of  a  differential  game  [11]. 

The  data  available  for  the  determination  of  the  func¬ 
tion  z(-)  is  encoded  in  the  intensity  function  I(x,  y)  de¬ 
termined  by  eq.  2.1.  7  is  well  defined  at  all  points  (z,y) 
where  2(*)  is  differentiable.  We  will  always  assume  that 
the  function  /(•)  is  defined  on  a  bounded  open  set  of 
the  form  G  =  N  <  oo,  where  each  Gj  has  a 

boundary  dGj.  Let  nj(z,y)  denote  the  inward  nor¬ 
mal  to  Gi  at  (z,  y)  €  dGi.  First  consider  the  following 
situation.  _ 

Assumption  5.1  1.  2(‘)  ia  on  G. 

2.  There  ia  exactly  one  point  (z,y)  auch  that 
Vz{x,  y)  =  0. 

S.  (z,  y)  ia  a  local  minimum. 

4-  V2(z,  y)  •  ni(z,  y)  <  0  whenever  (z,  y)  €  dG  D  flG<. 

(4)  implies  that  the  steepest  descent  direction  is  always 
inward  r.n  the  boundary.  We  next  define  a  calculus  of 
variations  problem.  Fix  (z,  y)  €  G,  and  set 

V{x,y)  =  M  [  L{<f>{a),^{3))da.  (5.17) 

Jo 

Here  r  =  inf{t  :  <p{t)  =  (z,  y)},  and  the  infimum  is 
over  all  piecewise  continuously  differentiable  paths  ^  : 
[0,  oo)  — »  G  that  satisfy  ^(0)  =  (z,y).  The  variational 
integrand  L(')  is  given  by 

i((», »),(».»))  =  5  (“’  +  »’)  + 5 

We  follow  the  usual  convention  of  defining  inf  0  =  -f  oo. 
Thus  if  ^(t)  ^  (z,  y)  for  all  t,  then  r  —  -t-oo. 

Theorem  5.2  Under  the  conditiona  of  Aaaumption  5.1 
we  have 

my) -my)  =  U{x,y). 

Proof.  Let  ^(•)  be  any  piecewise  continuously  differen¬ 
tiable  path  that  starts  as  (z,y).  For  all  e  >  0  define 

r*  =  inf{t  :  [^(t)  -  (z,y)|  <  c}. 

Fix  >  0  and  choose  c  >  0  such  that 

my)  <  my)  +  ^ 

for  |(z,y)  -  (z,y)|  <  c. 

To  prove  2(z,y)  —  2(z,y)  <  U(x,y),  we  consider  two 
cases.  First  assume  =  -f-oo.  By  Assumption  5.1  there 
exists  c  >  0  such  that 

L((z,y),(u,w))  >  c 


for  all  (®,  y)  satisfying  |(®,  y)  - (i ,  y)|  >  c  and  all  (u, n)  e 
R^.  Thus,  in  such  a  case 

[  L{<f>{a),^{s))ds  =  f  £(^(s),  (^(s))ds  = +00. 

Jo  Jo 

Next  assume  r'  <  oo.  By  the  chain  rule, 

almost  surely  in  t.  Therefore 

-  z{i,y)  <  z{x,y)  -  2{4>{t‘))  +  6 


=  /  -Vz{<f>{t))  ■  ^(t)dt  +  6 

Jo 

<  r  L{<f,(t),m)dt+6 

Jo 

<  f  L{4>{t),Ht))dt  +  6. 

Jo 


Sending  6  — »  0  we  obtain  z{x,y)  -  z{x,y)  <  I7(z,y). 

To  prove  y)  —  x(£,y)  >  U{x,y),  let  ^(•)  be  a  so¬ 
lution  (note  that  there  may  not  be  uniqueness  since  z  is 
only  assumed  C^)  to  the  equation 

^{t)  =  -Vz(^(t)),  0(0)  =  (j:,y). 

By  Assumption  5.1  0(-)  never  touches  dG  for  t  >  0,  and 
therefore  the  solution  is  well  defined  for  all  t  >  0.  Let 
r  =  inf{t :  <p{t)  =  (x,  y)}  and  let  a  Ah  denote  the  smaller 
of  a  and  b.  For  any  t  <  oo,  we  have 

^(*.  y)  -  •*(*•  y)  >  «(^(0))  -  ’■)) 

=  -  [z(4>(t  A  t))  -  2(0(0))] 

/tAT 

-Vz(0(s))  •  0(s)ds 


0iAT 

=  /  |V2(0(a))pds 

Jo 

=  /  i(0(s),  0(3))ds. 

Jo 


Sending  £  — »  oo  we  conclude  that 


*(*. y) -■*(*! y)  >  f  ^(»))<^«  >u{x,y). 

Jo 


The  solution  to  the  calculus  of  variations  problem 
uniquely  identifies  the  height  function  up  to  an  over¬ 
all  translation  in  z.  This  ambiguity  can  be  removed  by 
specifying  2(x,y). 

We  next  consider  a  more  general  situation  involving 
more  than  one  stationary  point.  Let  M  be  the  set  of 
local  minima  of  z{-). 

Assumption  6.3  1.  z(-)  ia  on  G. 

2.  The  value  of  z{-)  ia  known  on  M. 

3.  Vz(x,y)  •ni(x,y)  <  0  whenever  (x,y)  €  dGndGi. 


Consider  the  calculus  of  variations  problem 

t^(*iy)  =  ‘nf  f  H<t>{a),^{a))da  +  g{4>{T))  .  (5.18) 

.Jo 

Here  the  infimum  is  over  all  r  <  oo  and  absolutely  con¬ 
tinuous  paths  0  :  [0,  r]  — >  G  that  satisfy  0(0)  =  (x,y). 
Unlike  the  case  of  a  single  stationary  point,  it  is  neces- 
siiry  that  a  terminal  cost  be  included  in  order  to  guar¬ 
antee  that  trajectories  do  not  get  “stuck”  at  stationary 
points  that  are  not  local  minima. 

We  have  the  following  result  for  this  case. 

Theorem  5.4  Under  the  condiiiona  of  Aaaumption  5.3 
we  have 

z(x,y)  =  U{x,y). 

Remarks  on  the  proof.  The  proof  is  very  similar  to 
that  of  Theorem  5.2  and  will  only  be  sketched.  Consider 
any  path  0(-)  which  starts  at  (x,  y)  and  for  which  the 
cost 

f  X(0(s),  0(s))ds -)- y(0(T))  (5.19) 

Jo 

b  finite.  Boundedness  of  the  cost  implies  0(r)  €  M. 
Suppose  0(r)  =  (x,  y).  The  proof  of  Theorem  5.2  then 
shows  that  z(x,y)  —  z(x,y)  <  Z(0(s),  0(5))d5.  To¬ 

gether  with  the  definition  of  y(-)  this  implies  z(x,  y)  < 
U(x,y). 

Next  consider  the  reverse  inequality.  As  in  the  proof 
of  Theorem  5.2,  we  would  like  to  construct  a  particular 
path  0(-)  that  starts  at  (x,y)  so  that  the  cost  (5.19)  is 
arbitrarily  close  to  z(x,y).  We  first  note  that  by  a  per¬ 
turbation  argument  [13,10,15]  we  can  assume  that  there 
are  at  most  finitely  many  points  such  that  Vz(x,  y)  =  0. 
It  can  be  shown  that  there  exists  a  dense  subset  D  of  G 
with  the  property  that  whenever  the  path  0(‘)  satisfies 

0(t)  =  -vz(0(t)),0(o)ei>, 

then  0(t)  converges  to  a  local  minimum  (x,  y)  of  z(-)  as 
t  —*  oo.  Using  the  argument  of  Theorem  5.2  and  the 
fact  that  z(0(t))  is  nondecreasing  we  conclude  z(x,  y)  > 
U{x,y)  for  (x,y)  €  D.  By  continuity  of  both  z(-)  and 
U(  )  (which  is  easy  to  prove)  we  have  z(x,  y)  >  U(x,y) 
for  (x,  y)  G  G.  ■ 

Previous  uniqueness  proofs  [1,17,14]  assumed  that 
z(x,  y)  was  at  leetst  C^;  here  z  is  only  assumed  C^.  A  for¬ 
tiori,  no  conditions  are  placed  on  the  second  derivatives 
of  the  intensity;  in  particular,  the  singular  points  are  not 
required  to  be  “good”  or  “nondegenerate”  [17,14]. 

6  Illumination  from  a  General 
Direction 


For  a  Lambertian  surface,  the  image  irradiance  equation 
for  the  intensity  is: 


7(x,y)  =  L  ■ 


(l  +  zi  +  z^y/^’ 


Define  the  terminal  cost  function 


if  (x,  y)  €  M, 
otherwise. 


where  L  is  a  unit  vector  giving  the  light  source  direc¬ 
tion,  and  Zx,Zy  are  partial  derivatives  of  the  height.  For 
simplicity  and  w.l.o.g.,  we  take  the  x-component  of  L 
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to  be  zero.  After  some  algebra,  this  equation  may  be 
rewritten  as: 

Pzl  +  Jzl  +  2L,L^z^  +  (/*  -  L\)  =  0, 
with  J(x,  y)  =  P{x,  y)  - 

Define  a  new  variable  ^  =  (x,  y,z)  •  L  measuring  the 
“height”  along  the  light  direction  rather  than  the  viewer 
direction  z.  This  is  done  so  that  the  local  cost  at  singular 
points  will  be  lero,  like  V  previously,  causing  optimal 
trajectories  to  terminate  at  these  points.  Then 

ix  =  •£’*■**,  iy  = 

Substituting  in  the  previous  equation  yields 

IHI  +  Jil  +  2(1  -  I^)Lyiy  -  (1  -  7^)  =  0. 

J,  the  coefficient  of  is  positive  in  an  image  region 
B  that  includes  the  singular  points.  When  P  = 

the  angle  between  the  surface  normal  and  X  is  large 
enough  that  it  may  correspond  to  a  point  on  the  oc¬ 
cluding  boundary. 

In  the  image  region  J3,  we  consider  a  control  prob¬ 
lem  analogous  to  that  of  Section  2:  a  ‘particle*  initially 
located  at  (x,  y)  is  controlled  using  the  parameters  u,  v: 

X  =  P(x,y)u,  y  =  J(x,y)u  -  (1  -  P{x,y))Ly. 

u,  V  are  chosen  to  infimise  a  cost  function  for  the  parti¬ 
cle’s  trajectory 

U(x,y,T)  =  inf(,,,) 


5  So  +  (1  -  ^*(*.y)))- 

(6.20) 

As  before  the  integrand  is  nonnegative  throughout  the 
region  B,  and  the  local  cost  1  —  7^  vanishes  at  singular 
points.  Eq.  2.4  for  the  vertical-light  case  Ly  =  0  can  be 
recovered  by  dividing  the  above  equation  by  P. 

This  control  problem  is  essentially  equivalent  to  the 
one  previously  considered,  and  results  similar  to  those  of 
the  previous  sections  are  easily  obtainable.  In  particular, 
by  a  Schwarz  inequality  argument  as  in  Section  5, 


< 


-mm)  ■  m 

-(^Pu-(y{jV-{l-P)Ly) 

2 

\  {Pu^  +  Ju*  +  1  -  P)  , 


which  is  just  the  integrand  of  eq.  6.20.  This  gives  the 
necessary  generalisation  for  the  rigorous  proof  of  equiva¬ 
lence.  Similarly,  an  algorithm  can  be  defined  in  the  same 
way  as  before,  and  will  recover  the  correct  solution  near 
concave  (or  convex)  singular  points. 

In  the  image  region  where  7*  -  Xj  <  0,  the  optimal 
control  representation  of  tht  problem  no  longer  suffices. 
Instead,  there  is  a  representation  in  terms  of  a  differential 


game  (see  e.g.  [2]).  However,  it  is  a  particularly  simple 
one,  in  which  the  opposing  controllers  effectively  direct 
the  ‘particle’  motion  in  orthogonal  directions,  and  where 
the  cost  also  splits  into  a  sum  of  terms  depending  on  the 
different  control  parameters.  Thus,  the  Isaacs  condition 
and  the  existence  of  a  “value”  follow. 

The  ‘particle’  dynamics  for  the  differential  game  is: 


i  =  Pu,  y  =  J  {e{J)vi  +  e{-J)vy)  -  (1  -  P)Ly, 


where 


if  X  >  0, 
if  X  <  0. 


The  player  associated  with  u  ^uld  vi  seeks  to  minimise 
the  value  function  of  the  game,  while  the  vy  player  seeks 
to  maximise  it.  The  value  that  opposing  players  attempt 
to  control  is 


s;  ds{Pu^  +  J(ff(J)vf  -h  ff(-J)vl)  +  (1-  P)). 

A  precise  description  of  the  differential  game  is  some¬ 
what  technical  (see  e.g.  [2]).  Here  we  simply  note  that 
the  properly  defined  value  gives  the  height  function  (un¬ 
der  suitable  conditions),  and  that  an  algorithm  on  a  dis¬ 
crete  grid  for  approximating  this  value  function  can  be 
derived  that  is  similar  to  the  vertical-Ught  algorithm. 

7  Experiments 


Figure  1  displays  a  32  by  32  surface  parabolic  surface 
which  is  assumed  to  be  imaged  from  above.  The  image 
has  one  singular  point.  Assuming  vertical  light,  the  im¬ 
age  intensity  was  first  computed  using  the  discretisation 
of  the  derivative  implicit  in  eq.  3.15  [12].  With  this 
choice,  the  original  surface  is  a  fixed  point  of  the  algo¬ 
rithm  and  should  be  reconstructed  exactly.  Using  Jacobi 
updates,  the  algorithm  converged  to  the  correct  solution 
to  within,  on  average,  one  part  in  10^  after  63  iterations. 
In  general,  the  convergence  time  is  expected  to  be  on 
the  order  of  the  maximum  length  of  an  optimal  trajec¬ 
tory.  Since  from  eq.  3.11  an  optimal  trajectory  jumps 
one  lattice  site  per  iteration,  when  the  number  of  it¬ 
erations  becomes  greater  than  the  maximum  trajectory 
length,  then  all  image  points  are  able  to  “learn”  their 
heights  Horn  the  singular  points.  For  the  given  surface, 
the  maximal  trajectory  length  is  on  the  order  of  32,  since 
trajectories  starting  at  the  image  corners  must  zigzag  to 
the  singular  point  at  the  center  of  the  image. 

Convergence  using  Gauss-Seidel  updating  was  faster: 
it  was  obtained  after  just  17  iterations.  The  states  were 
updated  in  a  spiral  pattern  outward  from  the  singular 
point,  since  this  reflects  the  approximate  information 
flow.  The  points  near  the  singular  point  learn  their 
height  early,  and  this  information  can  then  be  used  in 
determining  the  heights  of  more  distant  points. 

For  an  image  obtained  by  analytically  differentiating 
the  displayed  surface,  convergence  was  obtained  in  the 
same  number  of  steps.  The  average  and  maximal  errors 
were  .8  and  1.6  (the  latter  obtained  at  the  image  bound¬ 
ary),  compared  with  a  range  for  the  surface  height  of  25. 
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The  algorithm  has  abo  been  applied  to  a  noisy  image  of 
this  surface;  the  result  is  a  noisy  approximation  of  the 
surface.  The  convergence  time  is  longer  since  the  steep¬ 
est  descent  curves  lengthen  due  to  wiggling.  The  surface 
was  also  reconstructed  assuming  oblique  lighting. 

For  comparison,  Figs.  2,  3  display  the  result  of  apply¬ 
ing  our  implementation  of  Horn’s  algorithm  [4]  to  a  sim¬ 
ilar  surface.  The  intensity  is  computed  differently  than 
before,  using  the  discrete  forward  derivatives  appropri¬ 
ate  for  this  algorithm.  Even  after  3072  iterations,  the 
algorithm  has  not  converged  to  the  correct  solution.  We 
have  also  implemented  the  variational  algorithms  of  [8] 
and  [19],  and  applied  them  to  this  surface  wit''  similar 
results.  As  also  noted  by  [9],  standard  variational  al¬ 
gorithms  often  give  a  wrong,  saddle-shaped  surface  for 
such  simple  images  containing  one  singular  point. 

Figure  4  shows  a  more  complicated  128  by  128  sur¬ 
face.  As  for  Figure  1,  the  intensity  was  first  computed 
assuming  the  discretisation  of  eq.  3.15.  The  algorithm 
this  time  incorporated  a  terminal  cost — an  initial  value 
for  U — which  was  large  everywhere  but  at  the  concave 
singular  points.  At  these  points,  U  was  initialised  to  the 
known  height  values.  For  vertical  light,  the  algorithm 
converged  to  a  perfect  reconstruction  of  the  original  sur¬ 
face  in  100  iterations.  As  expected,  the  convergence 
time  is  on  the  order  of  the  longest  optimal  trajectory. 
When  the  intensity  was  derived  analytically,  the  algo¬ 
rithm  again  converged  in  100  iterations,  with  an  average 
error  of  1.6  compared  to  a  surface  range  of  51  (Figure 
5).  Because  the  surface  does  not  obey  the  boundary  con¬ 
dition  that  it  is  decreasing  in  from  the  boundary  (Sec¬ 
tion  3),  the  reconstruction  is  incorrect  in  places  at  the 
boundary,  though  it  is  good  in  the  interior.  This  is  clear 
in  Figure  6,  which  displays  the  difference  between  the  re¬ 
construction  and  the  original  surface.  This  surface  was 
also  reconstructed  assuming  oblique  light  at  an  angle  of 
17.5°  to  the  vertical.  For  an  intensity  derived  as  for 
eq.  3.15,  convergence  to  within  one  part  in  10~^  was 
obtained  within  120  iterations.  Reconstruction  for  the 
analytically-derived  intensity  function  was  also  obtained 
in  about  120  iterations,  with  an  average  error  of  2.2.  As 
previously,  the  reconstruction  was  good  in  the  interior 
but  incorrect  along  one  boundary  (Figure  7). 

Figure  8  shows  the  result  for  vertical  light  of  applying 
the  algorithm  without  the  terminal  cost.  The  algorithm 
reconstructs  a  surface  that  is  locally  concave  at  all  sin¬ 
gular  points;  it  is  correct  in  the  neighborhood  of  those 
singular  points  where  the  surface  is  in  fact  locally  con¬ 
cave.  Note  the  sharp  orientation  discontinuities  at  the 
boundaries  between  the  regions  associated  with  different 
singular  points. 

Finally,  our  algorithm  has  been  applied  to  the  real 
200  X  200  image  shown  in  Figure  9,  which  was  provided 
to  us  by  Yvan  Leclerc  of  SRI.  The  light  is  from  above  at 
(0,  .488,  .873).  For  the  reconstruction,  just  one  singular 
point  was  used,  located  on  the  tip  of  the  nose,  although 
the  image  actually  contains  several.  This  has  the  effect 
of  planing  down  the  surface  bumps  associated  with  the 
other  singular  points.  Figure  10  shows  the  reconstruc¬ 
tion  obtained  using  Gauss-Seidel  after  80  iterations,  il¬ 


luminated  from  the  same  direction  as  the  original.  Fig¬ 
ure  11  shows  the  reconstruction  illuminated  from  below. 
Convergence  has  nearly  been  achieved,  apart  from  small 
patches  near  the  occluding  boundary.  This  reconstruc¬ 
tion  took  about  2  minutes  of  CPU  time  on  a  DEC  5000 
work  station.  Standard  variational  algorithms  typically 
require  thousands  of  iterations  [4].  Convergence  was 
complete  after  about  200  iterations;  the  result,  shown 
in  Figure  12,  differs  little  from  Figure  11.  Finally,  Fig¬ 
ure  13  shows  the  surface  reconstruction. 

For  comparison.  Figures  14,  15  display  the  reconstruc¬ 
tion  obtained  by  the  authors  of  [8]  using  a  more  standard 
variational  method  [8],  developed  for  the  purpose  of  in¬ 
cluding  stereo  information.  Stereo  information  was  used 
as  an  initial  condition  for  this  reconstruction. 
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Fig.  1  Parabolic  surface.  Fig.  2  Horn's  algorithm: 


128  iterations. 


Fig.  3  Horn's  algorithm: 
3072  iterations. 


Fig.  10  Reconstruction 
lighted  from  above. 


Fig.  11  Reconstruction 
lighted  from  below. 


Fig.  12  Final 
reconstruction . 


Fig.  13  Surface 


reconstruction . 


Fig.  14  Reconstruction  [8]  Fig.  15  Reconstruction 
lighted  from  above.  lighted  from  below. 
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Abstract 

This  paper  details  the  use  of  mathemati¬ 
cal  techniques  from  computational  geom¬ 
etry  and  fuzzy  set  theory,  for  the  geomet¬ 
ric  classification  of  a  set  of  points.  This 
investigation  was  encouraged  by  a  study 
of  topological  visual  navigation  in  two- 
dimensional  spaces.  The  navigational  sys¬ 
tem  employed  extracts  information  from 
the  environment  in  the  form  of  landmarks 
and  represents  these  landmarks  as  points 
in  an  abstracted  map  of  the  original  en¬ 
vironment.  Left  with  only  point-like  ob¬ 
jects  of  the  environment  it  is  necessary 
to  describe  the  point-like  objects  in  terms 
of  their  interrelationship.  This  paper  de¬ 
scribes  why  point-like  objects  were  cho¬ 
sen  to  represent  the  environment,  why  the 
point-like  objects  are  described  by  the  ge¬ 
ometrical  shapes  they  form,  and  just  how 
these  tasks  are  performed.  The  use  of 
the  convex  hull  of  a  set  of  points  and  the 
use  of  membership  functions  from  fuzzy 
set  theory  are  employed  to  classify  the 
point-like  objects  into  such  basic  geomet¬ 
ric  shapes  as  triangles,  squares,  rectangles, 
and  other  n-sided  polygons,  such  as  pen¬ 


tagons,  hexagons  and  octagons.  Pattern 
recognition,  spatial  analysis,  and  topolog¬ 
ical  navigation,  are  areas  that  can  benefit 
from  this  study. 

1  Introduction 

The  problem  of  extracting  the  geometric 
shape  from  a  set  of  points  arose  from  work 
in  the  area  of  topological  visual  naviga¬ 
tion  in  two-dimensional  spaces  [5].  Tra¬ 
ditional  visual  navigation  has  depended 
primarily  on  quantitative  measures  that 
are  captured  in  the  form  of  metric  maps. 
Direction-giving  in  such  a  .system  relies  on 
the  metric  map  for  travel  distance  and  di¬ 
rectional  information.  Direction-giving  in 
the  topological  visual  navigation  system 
employed  in  this  paper,  relies  on  land¬ 
marks  instead  of  metric  maps.  A  land¬ 
mark  is  [5] 

an  object  that  can  be  recognized 
by  a  navigator  along  a  path  ...  re¬ 
gardless  of  the  intervening  object 
nodes.  Thus  a  tree  in  a  desert  is 
a  landmark,  but  a  tree  in  a  forest 
is  not. 
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These  landmarks  are  viewed  aerially,  as 
from  a  helicopter,  therefore  there  is  no  oc¬ 
clusion  of  landmarks  and  an  abundance 
of  information  regarding  the  environment 
can  be  extracted.  However,  much  of 
this  information  can  be  disregarded  and 
considered  extraneous  for  the  navigation 
problem.  Much  of  the  aerial  view  of  an 
environment  consists  of  empty  space  be¬ 
tween  objects.  This  fact  can  be  used  to 
direct  navigation  according  to  the  arrange¬ 
ment  of  the  objects,  rather  than  by  their 
distance.  In  this  case  an  object  -  or  inter¬ 
changeably  a  landmark  -  can  be  considered 
a  point  on  our  map  of  the  environment. 

It  would  seem,  off  hand,  that  reducing 
our  environment  to  point-like  objects  sim¬ 
plifies  our  problem,  but  it  in  fact  creates 
some  new  ones.  Attributes  such  as  size, 
color  and  shape  are  no  longc^r  available. 
We  no  longer  can  describe  an  object  as 
“the  big  red  block”.  Instead  wo  have  a  set 
of  point-like  objects  with  no  size,  color  or 
shape  distinction.  Nevertheless,  the  choice 
of  reducing  our  problem  to  point  sets  is 
made  because  any  object  can  be  repre¬ 
sented  as  a  point  and  it  leads  to  an  in¬ 
tuitively  appealing  way  of  characterizing 
geometric  form.  It  further  allows  a  quali¬ 
tative  and  quantitative  way  o^  describing 
shape  form.  The  approach  in  this  paper 
has  been  a  qualitatively  descriptive  one, 
as  opposed  to  a  very  rigorous  quantitative 
approach. 

There  are  several  reasons  for  choosing  a 
qualitative  description  of  form:  it  is  less 
prone  to  error,  and  more  intuitive  than  a 
quantitative  description.  It  is  less  prone  to 
error  because  the  shape  form  does  not  rely 


solely  on  metrics  for  its  description.  It  is 
more  intuitive  because  people  in  essence 
do  not  speak  in  terms  of  inches  and  de¬ 
grees  but  rather  in  terms  of  shapes  and 
their  spatial  orientation.  As  an  exam¬ 
ple,  consider  the  problem  of  asking  a  per¬ 
son  to  single  out  a  point  among  a  set  of 
points.  It  is  more  likely  that  the  per¬ 
son  will  describe  the  point  with  regard  to 
those  points  around  it  ,  than  to  describe 
its  exact  coordinate  location.  The  descrip¬ 
tion  given  might  be  any  of  the  following: 
most  isolated  point,  directly  north  of  the 
bottom-most  point,  second  point  from  the 
top,  in  the  middle  of  the  points  forming 
the  square,  etc. 

After  simplifying  our  environment  to 
include  only  point-like  objects,  the  next 
step  is  the  actual  characterization  of  these 
point  sets  into  some  geometric  shape.  It  is 
beneficial  to  characterize  points  into  geo¬ 
metric  shapes  because  they  are  the  build¬ 
ing  blocks  of  more  complex  shapes,  and 
such  a  classification  is  invariant  to  rota¬ 
tion,  scale,  and  translation.  As  mentioned 
earlier,  to  qualitatively  describe  points  it 
is  necessary  to  study  their  relationship 
with  those  points  arou  d  them.  The  ap¬ 
proach  in  this  paper  is  o  describe  points 
as  they  interrelate  geoi  ietrically.  Other 
approaches  to  point  set  analysis  include 
[6]  which  describes  a  methodology  for  de¬ 
scribing  the  internal  shape  of  a  set  of 
points  by  using  a  measure  of  neighbor¬ 
liness  of  points.  The  paper  [1]  devel¬ 
oped  a  generalization  of  the  convex  hull 
to  describe  the  external  shape  of  a  set 
of  points.  The  paper  [2]  developed  tech¬ 
niques  to  analyze  sparse  images,  which 
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are  represented  as  sets  of  points  in  two- 
dimensional  space.  The  paper  [9]  investi¬ 
gates  the  problem  of  connecting  dots  into 
orthogonal  polygons.  The  following  sec¬ 
tions  go  on  to  describe  how  to  classify  a 
finite  set  of  points  into  geometric  shapes, 
such  as  triangles,  squares,  rectangles,  and 
other  n-sided  polygons,  such  as  pentagons, 
hexagons,  and  octagons,  using  techniques 
from  computational  geometry  and  fuzzy 
set  theory. 

2  The  Basic  Theoretical 
Foundation 

As  explained  in  the  introduction,  this  pa¬ 
per  aims  to  classify  the  shape  of  a  set  of 
points  using  the  convex  hull  of  a  point  set 
and  membership  functions  from  fuzzy  set 
theory.  For  completeness  sake,  this  section 
will  begin  with  a  brief  review  of  the  convex 
hull  and  fuzzy  set  theory.  The  following 
sections  will  contain  a  more  detailed  de¬ 
scription  of  the  actual  membership  func¬ 
tions  used  in  ultimately  determining  the 
geometrical  shape  of  a  set  of  points. 

2.1  The  Convex  Hull  Of  A  Set  Of 
Points 

The  first  step  in  characterizing  the  geo¬ 
metrical  shape  of  a  set  of  points  is  to  de¬ 
termine  the  convex  hull  of  the  point  set. 
The  convex  hull  of  a  set  of  points  con¬ 
sists  of  the  extreme  points  of  the  set,  and 
as  such,  it  provides  a  suitable  description 
of  the  shape  of  the  point  set,  as  shown 
in  figure  1.  The  weakness  of  the  convex 


hull  is  in  choosing  the  extreme  points  as 
representative  of  essential  points.  In  other 
words,  the  extreme  points  may  not  neces¬ 
sarily  be  the  best  descriptors  of  the  shape 
of  the  point  set;  some  interior  points  may 
be  better  suited  to  be  the  descriptors  of 
shape.  The  convex  hull,  however,  has  been 
well-studied,  there  are  many  efficient  al¬ 
gorithms  for  computing  it,  and  it  provides 
the  necessary  properties  for  classifying  the 
shape  of  a  point  set  and  it  is  invariant  to 
rotation,  translation,  and  scale.  It  is  for 
these  properties  that  the  convex  hull  is  em¬ 
ployed  to  tackle  the  task  of  classifying  the 
geometrical  shape  of  a  set  of  points. 

2.2  A  Brief  Overview  Of  Fuzzy 
Set  Theory 

The  theory  of  fuzzy  sets  is  used  to  rep¬ 
resent  uncertainty,  information,  and  com¬ 
plexity  [7].  The  theory  of  classical  sets 

on  the  other  hand,  represents  certainty. 
A  classical  set  divides  the  world  into  two 
groups:  those  that  certainly  belong  to  a 
set  and  those  that  certainly  do  not  belong 
to  a  set.  A  fuzzy  set,  on  the  other  hand, 
divides  the  world  much  more  loosely,  by 
introducing  vagueness  into  the  grouping 
process.  This  means  that  members  of  a 
set  belong  to  that  set  to  a  greater  or  lesser 
degree  than  other  members  of  the  set. 
Mathematically,  members  of  the  set  are 
assigned  a  membership  grade  value  that 
indicates  to  what  degree  they  belong  to 
the  set.  This  membership  grade  is  usually 
a  real  number  in  the  closed  interval  be- 

'AI.so  know  a.s  crisp  sets. 
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tween  0  and  1.  Therefore  a  member  that 
has  a  membership  grade  closer  to  1  be¬ 
longs  to  the  set  to  a  greater  degree  than  a 
member  with  a  lower  membership  grade. 
Because  of  its  properties,  fuzzy  set  theory, 
can  find  application  in  fields  that  study 
how  we  assimilate  information,  recognize 
patterns,  and  simplify  complex  tasks. 

The  theory  of  fuzzy  sets  applies  quite 
naturally  to  the  task  set  forth  in  this  paper 
because  we  seek  “approximate”  measure¬ 
ments  and  not  “exact”  ones  for  determin¬ 
ing  the  geometrical  shape  of  our  point  set. 
In  other  words  a  rectangular-like  figure 
with  angles  89°,  88°,  88°,  and  93°  should 
be  classified  as  a  rectangle,  even  though 
the  angles  aren’t  all  exactly  90°.  The  fol¬ 
lowing  section  will  describe  just  how  to 
make  such  a  classification. 

3  The  Geometric  Shape 
Of  A  Set  Of  Points 

The  first  step  in  extracting  the  geomet¬ 
ric  shape  from  a  set  of  points  is  to  com¬ 
pute  the  convex  hull  of  the  points.  Pa¬ 
pers  [4,  3]  each  detail  an  algorithm  for 
computing  the  convex  hull.  The  num¬ 
ber  of  points  of  the  convex  hull  then  dic¬ 
tate  the  course  of  actions  to  follow.  Two 
points  obviously  describe  a  line.  Three 
points  can  describe  a  right  triangle,  isoce- 
les  triangle,  equilateral  triangle,  and  ordi¬ 
nary  triangle.  Four  points  can  describe  a 
square,  rectangle,  rhombus,  and  a  trape¬ 
zoid.  There  is  no  specific  classification  for 
n-sided  polygons  where  n  is  greater  than 
four,  but  such  polygons  can  be  classified  as 


pentagons,  hexagons,  heptagons,  and  oc¬ 
tagons.  As  mentioned  in  section  2.2,  mem¬ 
bership  functions  allow  for  “approximate” 
measurements,  and  as  such  they  allow  for 
the  classification  of  “approximate”  poly¬ 
gons.  The  paper[8]  provides  the  various 
membership  functions  that  will  allow  the 
classification  of  the  forementioned  geomet¬ 
ric  shapes.  The  following  subsections  will 
present  these  membership  functions  and 
describe  how  they  are  used  to  classify  the 
“approximate”  geometric  shape  of  a  set  of 
points. 

3.1  Triangle  Classification 

As  mentioned  in  the  introduction  to  this 
section,  a  triangle  can  have  various  clas¬ 
sifications:  right,  isoceles,  equilateral,  and 
ordinary.  To  determine  which  of  the  var¬ 
ious  triangle  classifications  a  set  of  three 
points  can  form,  the  following  membership 
functions  are  used,  see  also  figure  2: 

f^right  —  1  Prigfit  90  |, 

|5-90°|,|C-90°|}/180° 


Pisoceies  —  1  Pisoceles  i?|, 

\B-C\,\C-  A\}/m° 


P'cguilairral  —  1  PequHateral  /?|, 
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|^-6'|,|C’-  .4|}/180° 

A,  B  and  C  are  the  angles  formed  be¬ 
tween  the  points  of  the  convex  hull,  the 
various  /x’s  are  the  membership  grades  and 
the  p’s  are  constants. 

To  determine  the  constants  each  mem¬ 
bership  function  is  computed  by  choosing 
angles  A,  B  ,  and  C  to  be  those  angle  mea¬ 
surements  that  are  least  like  any  of  the 
triangles.  For  example,  a  “triangle”  with 
angles  A  =  180°,  B  =  0°,  and  C  =  0°  can 
be  a.ssumed  to  be  most  unlike  a  right  tri¬ 
angle,  therefore 

y-right  —  0=1  —  Prig  lit  90/180 

priglit  —  2 


should  be  classified  as  belonging  to  both 
sets.  Various  functions  can  be  created  to 
test  for  this  property.  One  such  function 
would  calculate  the  product  of  both  mem¬ 
bership  grades.  If  the  result  yields  a  value 
greater  than  the  threshold  than  the  tri¬ 
angle  should  be  classified  as  belonging  to 
both  sets.  For  example  if  pisoceies  =  0.9 
and  Pright  =  0.9  their  product  is  0.81 
and  the  triangle  should  be  classified  as  a 
right  isocleses  triangle.  If,  however,  the  re¬ 
sult  is  not  greater  than  the  threshold  than 
the  maximum  of  the  membership  grades  is 
chosen.  If  none  of  the  membership  grades 
is  greater  than  the  threshold  value  then 
the  set  of  points  is  considered  to  form  an 
ordinary  triangle. 

3.2  Quadrangle  Classification 


Adopting  the  same  approach  for  the 
other  constants  yields  pisoceles  —  3  and 
Pequilaterai  =  1-  To  determine  if  three 
points  form  any  of  the  forementioned  tri¬ 
angles  a  threshold  value  6  must  be  cho¬ 
sen.  The  threshold  value  is  chosen  accord¬ 
ing  to  the  degree  of  exactness  required  for 
the  point  classification.  A  greater  thresh¬ 
old  value  demands  a  more  exact  classifica¬ 
tion  and  a  lower  threshold  value  demands 
a  looser  classification. 

After  applying  the  angles  to  each  of 
the  membership  functions  the  one  with 
a  membership  grade  greater  than  the 
threshold  value  is  chosen  to  be  the  tri¬ 
angle  that  most  closely  characterizes  the 
set  of  points.  If  more  than  one  member¬ 
ship  grade  is  greater  than  the  threshold 
than  there  is  a  possibility  that  the  triangle 


As  mentioned  in  the  introduction  to  this 
section,  four  points  can  have  various  clas¬ 
sifications:  trapezoid,  rectangle,  rhombus, 
and  square.  To  determine  which,  if  any,  of 
these  various  quadrangle  classifications  a 
set  of  four  points  can  form,  the  following 
membership  functions  are  used,  see  also 
figure  3: 

Ptrapezoid  ~  1  Ptrapezoid  mi7l{|A  B  180  |, 

\B^C  -  I80°|}/180° 


yrrctangle  —  1  Prectaiigle  {|A  90  |  -b 

|fl-90°|-|-((:’-90°(-t- 

|£)-90°|}/90° 


577 


t^rhombus  —  1  Prhombua  5|, 

|6-  c|,|c-  al} 

I  d  6  “f-  c  ~f"  ^ 


Psquare  —  Prectangle  X  Prhombua 

A,  B,  C,  and  D  are  the  angles  formed 
between  the  points  of  the  convex  hull,  the 
various  /i’s  are  the  membership  grades,  the 
p’s  are  constants,  and  a,  b,  c  ,  and  d  are 
the  lengths  between  the  points.  The  p’s 
are  calculated  the  same  way  they  were  for 
the  triangle  cases.  For  example,  a  rectan¬ 
gle  with  angles  A  =  180®,  B  =  180°,  C 
=  0°,  and  D  =  0°,  can  be  assumed  to  be 
most  unlike  a  rectangle,  therefore: 


prectangle 


0  = 


I  -  Preciar^gle  360°/90‘’ 

Prectangle  —  1/4 


Using  the  same  technique  for  calculating 
Prectangle  yields  Ptrapezoid  —  The  mem¬ 
bership  function  for  a  square  is  the  prod¬ 
uct  of  the  membership  grade  of  a  rhombus 
and  rectangle  because  angles  alone  and 
lengths  alone  are  not  sufficient  to  classify 
the  square,  information  regarding  both  is 
needed. 

To  classify  a  quadrangle  as  either  a 
square,  rectangle,  rhombus,  or  trapezoid, 
the  same  method  as  the  triangle  classifica¬ 
tion  is  employed  -  choose  a  threshold  value 
and  find  the  maximum  among  the  mem¬ 
bership  grades,  if  none  are  greater  than 
the  threshold  value  then  the  quadrangle  is 
classified  as  an  ordinary  four-sided  quad¬ 
rangle. 

3.3  N-Sided  Polygon  Classifica¬ 
tion 

The  previous  two  subsections  described 
the  membership  functions  that  allowed  for 
the  classification  of  various  triangles  and 
quadrangles.  This  subsection  illustrates 
the  generalized  formula  for  classifying  n- 
sided  polygons,  and  the  generalized  con¬ 
stant,  p.  The  membership  function  is 


A  rhombus  requires  information  on  its 
sides  for  classification.  Therefore  to  com-  i^n-axded 
pute  Prhombua X  it  is  assumed  that  the  shape 
least  like  a  rhombus  has  length  equal  to 
some  real  value,  and  length  b,  c,  and  d 
equal  to  zero,  therefore: 


1  Pn— aided  ni(lX-{^\Ai 

in-2)  X  180°, 
n  ’ 

(n-2)xl80°, 
lA, - ;; - 1,..., 


Prhombua 


0  —  1  prhombua 
- Prhombua  —  1 


The  same  approach  used  to  calculate  the 
p’s  for  the  triangle  and  quadrangle  mem- 
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bership  function  is  used  to  calculate  the 
p  for  this  function;  the  shape  most  unlike 
the  shape  we  wish  to  classify  is  assumed. 
In  the  case  of  this  function  the  result  of 
applying  the  max  portion  of  the  function 
will  always  produce 

Pn— sided  ~  0=1  Pn— sided  ^ 

(n-2)x  l80°/n 


Pn-sided  Can  be  generalized  to  be 

_  f  I  for  n  =  3 
Pn-sided  -  I  ^  forn  >  3 

Using  Pn-sided  3^  just  illustrated  yields 
5  3  7 

Ppentagon  ~  3  1  Phexagon  —  2 )  Pheptagon  —  5 » 

and  poctagon  —  3- 

As  was  the  case  with  the  triangle 
and  quadrangle  classification,  a  thresh¬ 
old  value  is  chosen  and  compared  to  the 
membership  grade  for  the  particular  n 
points  we  wish  to  classify.  If  the  member¬ 
ship  function  is  greater  than  the  threshold 
value  then  the  n  points  are  classified  ap¬ 
propriately,  e.g.  if  the  number  of  points  is 
five  then  the  polygon  is  classified  as  being 
a  pentagon.  If  the  membership  grade  is 
not  greater  than  the  threshold  value  then 
the  n  points  are  classified  as  forming  a  reg¬ 
ular  n-sided  polygon. 

A  line  with  n  points  can  also  be  clas¬ 
sified  using  the  membership  function  ap¬ 
proach.  The  following  membership  func¬ 
tion  is  employed: 


piine  =  max{\A^  -  A2I,  IA2  -  A3I, 
■■■{An- A,\}/l80° 

Before  any  of  the  forementioned  mem¬ 
bership  functions  are  computed  the  line 
membership  function  should  be  computed 
because  Pisoceles  and  ptrapezoid  will  yield  a 
value  of  one  when  the  points  form  a  line 
or  an  “approximate”  line.  For  example,  a 
perfectly  straight  line  of  three  points  con¬ 
sist  of  angles  0°,  180°,  and  0°,  when  these 
angles  are  evaluated  using  the  the  isoce- 
les  membership  function  the  result  is  one 
and  the  straight  line  wiU  be  classified  as  an 
isoceles  triangle.  An  apppropriate  thresh¬ 
old  value  should  be  selected  and  once  the 
line  membership  grade  no  longer  exceeds 
the  threshold  value,  the  other  membership 
functions  should  be  employed. 

3.4  Interior  Points 

The  discussion  up  until  this  point  heis  cen¬ 
tered  on  the  shapes  formed  by  the  exte¬ 
rior  points  of  the  point  set,  but  mention 
should  also  be  given  to  the  interior  points. 
Because  the  focus  of  this  paper  is  to  de¬ 
scribe  the  geometrical  shapes  of  a  set  of 
points,  the  interior  points  of  the  convex 
huD  will  also  be  described  geometrically. 
The  method  is  to  compute  the  convex 
hull  of  the  remaining  points  and  apply  the 
membership  functions  to  determine  its  ge¬ 
ometrical  shape.  The  method  is  repeated 
if  more  than  two  points  remain  whose  con¬ 
vex  hull  has  not  been  determined.  In  this 
way  a  nesting  of  geometrical  shapes  is  pro¬ 
duced,  as  shown  in  figure  4. 
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A  problem  arises  when  the  interior 
points  are  not  as  neatly  arranged  as  shown 
in  figure  4,  but  rather  they  look  like  those 
of  figure  5.  The  nesting  algorithm  would 
classify  A  in  figure  5  as  a  square  within 
a  square  and  classify  B  as  a  seven-sided 
polygon.  It  is  apparent,  however,  that 
such  a  classification  isn’t  necessarily  the 
most  intuitive  for  the  particular  arrang- 
ment  of  points.  The  problem  lies  in  the 
fact  that  the  shape  description  has  been 
confined  to  those  that  have  specific  names 
associated  with  them,  e.g.  squares,  tri¬ 
angles  etc.  Description  for  shapes  like 
those  of  figure  5  do  not  structurally  ex¬ 
ist.  What  does  exist  is  a  manner  of  de¬ 
scribing  the  general  arrangement  of  their 
points.  For  example  figure  B  can  be  de¬ 
scribed  as  consisting  of  one  right  angle  and 
a  bowl-shaped  arrangement  of  the  remain¬ 
ing  points. 

A  possible  solution  to  this  problem 
would  be  to  come  up  with  a  grammar 
for  describing  those  shapes  that  don’t  fall 
under  any  of  the  prescribed  shapes  men¬ 
tioned  so  far,  but  that  nevertheless  have 
their  points  distinctly  arranged.  Some 
possible  descriptions  may  include:  sharp¬ 
angled  ,  curved,  S-shaped,  L-shaped, 
bowl-shaped,  etc.  The  li«t  can  easily  grow 
to  formidable  lengths.  Obviously  some  re¬ 
strictions  must  be  made  in  governing  what 
grammars  to  choose.  Work  in  this  area  is 
currently  in  progress. 

4  Conclusion 

This  paper  has  explored  an  approach  to 


the  problem  of  extracting  the  geometric 
shape  from  a  set  of  points,  in  relation  to 
the  problem  of  visual  navigation.  The  nat¬ 
ural  next  step  is  to  devise  an  approach 
that  can  use  the  ideas  put  forth  in  this 
paper  to  further  describe  the  point-like  ob¬ 
jects.  For  example,  a  point-like  object  can 
be  defined  as  belonging  to  a  right  triangle, 
and  can  be  further  described  as  being  that 
point  which  forms  the  right  angle.  An¬ 
other  step  that  is  currently  in  the  making 
is  that  of  devising  a  grammar  to  describe 
those  shapes  that  can  not  be  classified  by 
the  geometric  shapes  presented  in  this  pa¬ 
per.  It  is  hoped  that  the  forementioned 
approaches  to  point  descriptions  will  lead 
to  a  qualitative  way  of  direction-giving  in 
visual  navigation.  This  approach,  how¬ 
ever,  is  not  limited  to  problems  of  visual 
navigation  alone,  it  is  clearly  applicable  to 
such  areas  as  pattern  recognition  and  spa¬ 
tial  analysis. 
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Abstract 

We  present  a  model  for  flexible  extruded  objects, 
such  as  wires,  tubes,  or  grommets,  and  demonstrate 
a  novel,  self-adjusting  seven-dimensional  Hough 
transform  that  derives  and  analyzes  their  three-space 
curved  axes  born  position  and  surface  normal  Mor- 
mation.  The  method  is  purely  local  and  is  very  cheap 
to  compute.  The  model  considers  such  objects  as 
piecewise  toroidal,  and  decomposes  the  seven  pa¬ 
rameters  of  a  torus  into  three  nested  subspaces,  the 
structure  of  which  counteract  the  errors  implicit  in 
the  analysis  of  objects  of  great  size  and/or  small 
curvature.  We  believe  it  is  the  first  example  of  a 
parameter  space  stmcture  designed  to  cluster  ill-con¬ 
ditioned  hypotheses  together  so  that  they  can  be 
easily  detected  and  ignored.  This  work  complements 
existing  shape-from-contour  approaches  for  analyz¬ 
ing  tori:  it  uses  no  edge  information,  and  it  does  not 
require  the  solution  of  high-degree  non-linear  equa¬ 
tions  by  iterative  techniques.  Most  of  the  results, 
including  the  conditions  for  the  existence  of  more 
that  one  solution  (phantom  "anti-tori"),  have  been 
verified  using  a  symbolic  mathematical  analysis  sys¬ 
tem.  We  present,  in  the  environment  of  the  IBM 
ConVEx  system,  robust  results  on  both  synthetic 
CAD-CAM  range  data  (the  hasp  of  a  lock),  and  actual 
range  data  (a  knotted  piece  of  coaxial  cable),  and 
discuss  several  system  tuning  issues. 

1.  Introduction 

We  consider  the  problem  of  analyzing  dense  depth  images 
to  determine  the  parameters  of  flexible  extruded  objects.  Our 
approach  views  such  objects  as  being  piecewise  toroidal,  since 
the  torus  is  the  simplest  solid  geometric  object  whose  gener¬ 
ating  axis  exhibits  curvature.  This  is  justified  by  our  observa¬ 
tion  that  many  piecewise  sections  of  flexible  extruded  objects 
have  co-planar  spines,  either  because  they  are  easy  to  manu¬ 
facture  that  way,  or  because  they  represent  a  minimum  energy 
configuration  (these  reasons  are  often  equivalent).  However, 
a  torus  is  an  object  with  seven  free  parameters:  three  of 
position,  two  of  orientation,  one  of  size,  and  one  of  relative 
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"thickness".  This  pushes  the  limits  of  the  current  state  of  the 
art  in  parsing  high-degree  parametric  object  models,  particu¬ 
larly  since  the  parameters  of  position  and  size  are  potentially 
unfunded,  especially  for  objects  that  are  nearly  cylindrical. 

Following  what  might  ^  called  the  principle  of  least 
variability,  we  first  recover  the  thickness  parameter,  since 
among  all  seven  it  is  the  one  parameter  most  likely  to  be 
constant  across  the  toroidal  pieces.  The  method  exploits  a 
novel  but  efficient  algorithmic  interpretation  of  a  result  in 
differential  geometry  called  Meusnier’s  theorem.  Then,  using 
the  knowledge  gained  fi^om  deriving  the  thickness,  we  recover 
the  next  most  well-behaved  parameters,  the  orientation  and 
size.  These  are  computed  simultaneously  and  in  a  manner  that 
compensates  for  the  ill-conditioning  of  orientation  estimates 
when  size  is  large.  Lastly,  using  knowledge  of  both  thickness 
and  size,  we  recover  the  position  of  the  local  toroidal  section, 
again  automatically  compensating  for  the  ill-conditioning  of 
large  and/or  nearly  straight  objects. 

The  significance  of  the  work  rests  on  its  two  principal 
results:  the  elegance  and  speed  of  the  thickness-finding  trans¬ 
form  (reviewed  only  briefly  here,  for  more  details  see  [Kender 
and  Kjeldsen,  1990],  and  the  novel  way  in  which  the  parameter 
space  stmcture  decomposes  a  difficult  seven-dimensional 
problem  so  that  ill-conditioned  hypotheses  cluster  together  for 
easy  detection  and  removal.  The  work  as  a  whole  is  applicable 
in  vision  systems  wherever  depth  and  surface  orientation 
(however  sparse)  are  obtainable,  particularly  for  those  cases 
where  object  boundaries  are  occluded,  and  wdiere  contour- 
based  methods  therefore  fail. 

1.1.  The  Torus  in  Brief 

We  adopt  the  terminology  of  DoCarmo  [1976],  and  review 
the  results  on  tori  that  we  wUl  exploit. 

A  toms  is  a  solid  of  revolution  formed  by  a  generating 
circle  ("minor  circle")  of  radius  r  being  swept  in  a  circle  of 
revolution  ("major  circle")  of  radius  a  /Note  that  we  have 
reversed  the  sense  of  r  &  a  from  DoCarmo).  Generally  speak¬ 
ing,  we  will  assume  that  r<a,  that  is,  that  the  "donut"  has  a 
"hole".  This  constraint  will  be  exploited  in  the  decomposition 
of  the  parameter  spaces. 

Surface  properties  are  best  represented  in  the  framework 
of  the  following  parameterization,  which  identifies  the  center 
of  the  toms  (and  hence,  the  center  of  the  major  circle)  with  the 
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Fig.  1:  Torus  terminology. 

center  of  a  cylindrical  coordinate  system,  making  a  to  be 
cylindrical  ra^us  and  v  as  cylindricd  angle;  the  axis  of  the 
torus  becomes  the  remaining  coordinate  z.  We  parameterize 
the  surface  of  the  minor  circle  by  the  angle  u;  unlike  the  angle 
V,  its  origin  is  critical,  and  is  set  to  equal  0  when  it  is  most 
distant  from  the  torus  center.  The  locus  of  points  with  a  given 
V  lie  on  a  "meridian".  The  locus  of  points  with  a  given  u  lie 
on  a  "parallel". 

As  with  any  other  regular  surface,  each  point  of  the  torus 
has  two  principal  curvatures,  with  directions  of  principal  cur* 
vature  that  are  perpendicular  to  each  other.  Every  point  has 
one  positive  curvature,  k\,  which  is  oriented  along  the  merid¬ 
ian.  The  other  curvature  of  the  point,  te.  is  more  elaborate, 
although  it  is  always  oriented  along  the  parallel.  ii:2  is  0  at  the 
"top"  and  "bottom"  parallels  of  the  torus,  where  the  surface  is 
locally  cylindrical.  It  attains  its  extreme  positive  value  at  the 
"outermost"  parallel,  and  its  extreme  negative  value  at  the 
"innermost"  parallel.  We  note  and  will  heavily  exploit  the 
observation  that  the  largest  positive  curvature  at  any  point  is 
always  the  value  of  ki. 

We  wOl  adopt  the  following  notation.  A  point  on  the  torus 
is  denoted  as  Pi,  it  is  considered  to  be  vector  in  three  ^ace. 
We  will  abbreviate  the  vector  Pi-P/  as  Pij.  Likewise,  we  will 
refer  to  the  unit  normal  vector  of  the  surface  at  Pi  as  Ni,  and  a 
similar  comment  applies  to  vectors  of  the  form  Nij.  Note  that 
the  values  of  the  points  and  the  normals  are  availiible  as  input 
data,  measured  in  the  coordinate  system  of  the  imaging  appa¬ 
ratus.  We  wiU  often  refer  to  the  "translated"  points 
Ti  -  Pi  -  rNi ;  if  the  r  is  the  true  minor  radius,  then  Ti  -  S/,  that 
is  a  point  on  the  torus  spine. 

2.  Computing  Minor  Radius 


Thus,  two  points  and  their  normals  are  insufficient  to  uniquely 
determine  a  toms,  since  they  fix  only  six  parameters.  How¬ 
ever,  three  points  and  their  normals  in  general  overdetermine 
the  torus  in  the  following  way.  There  must  exist  a  local  space 
circle  that  serves  as  the  torus  spine,  and  it  must  be  equidistant 
from  all  three  points  (this  common  distance  is  the  v^ue  of  r). 
But,  in  addition,  this  space  circle  must  have  the  property  that 
at  the  point  of  closest  approach  to  each  of  the  Pi,  the  tangent 
to  the  space  circle  must  be  perpendicular  to  the  corresponding 
Ni.  It  is  the  need  for  some  means  to  guarantee  this  tangent 
property  that  leads  to  the  following  observations  and  constmc- 
dons. 

We  select  a  distinguished  point  Pi;  without  loss  of  gener¬ 
ality  assume  that  it  is  the  point  Pi.  We  then  construct  all  the 
space  circles  through  the  translated  point  Ti  whose  tangents 
at  Ti  are  perpendicular  to  Ni,  and  that  also  pass  through  a 
second  translated  point,  call  it  72-  There  is  a  one  degree  of 


freedom  family  of  such  circles,  and  they  all  lie  on  the  surface 
of  a  sphere.  Tliat  such  a  constmcdon  is  possible  can  be  seen 
both  geometrically  and  algebraically;  it  is  also  a  special  case 
of  the  theorem  of  Meusnier.  (Meusnier’s  theorem  states  that 
if  a  set  of  planes  are  drawn  through  a  tangent  to  a  surface  in  a 
non-zero  curvature  direction,  then  the  osculating  circles  of  the 
intersections  with  the  surface  lie  upon  a  sphere  [Struik,  1961].) 


Geometrically,  we 
consider  Ti  to  be  the 
south  pole  of  a  sphere 
whose  north-south  axis  is 
coUinear  with  Ni.  The 
size  of  the  sphere  is  deter¬ 
mined  by  T2.  Call  this 
sphere  the  supporting 
sphere.  Any  circle  (great 
or  little)  that  passes 
through  T\  and  T2  now 
also  has  its  local  tangent 


at  Ti  perpendicular  toNi. 
Algebraically,  we  can 
look  for  the  size  of  the 


support  sphere,  s;  since 
we  know  Aat  the  center  is 


Fig.  2;  Minor  r  extraction  geometry. 


constrained  to  lie  along  the  direction  of  the  Ni,  the  center  is 


given  by  C-Ti+sNi.  Both  Ti  and  Tz  lie  on  a  common  sphere 
if  their  distances  from  the  center  are  equal.  This  is  captured  by 
equating  the  norms  of  their  reladve  position  vectors,  giving  in 
vector  form  the  equation  of  the  plane  of  their  perpendicular 
bisectors:  |C-ri|»|C-r2|.  Expan^ng,  we  find  that  S2(.r),  that 
is,  the  size  of  the  support  sphere  needed  to  accommodate  Tz 


,is  a  function  of  r. 


It  is  not  hard  to  show  that  when  computing  the  parameters 
of  a  two-dimensional  surface  in  three  space,  the  information 
that  a  point  is  on  the  surface  can  be  used  to  fix  one  parameter, 
and  the  information  that  a  given  vector  is  the  normal  to  the 
surface  at  a  point  can  be  used  to  fix  two  parameters.  (Analo¬ 
gous  statements  can  be  made  about  curvature,  but  we  avoid 
curvature  information  because  of  its  intrinsic  susceptibility  to 
noise.)  Thus,  a  sphere  is  uniquely  determined  by  four  points, 
or,  more  simply,  by  two  points  and  the  normal  at  one  of  them. 


^2^'’^'2r(l-AriN2)+N,P,;, 

Note,  however,  that  the  local  tangent  at  Tz  is  not  necessar¬ 
ily  perpendicular  to  N2.  Worse,  T3  may  not  even  lie  on  the 
support  sphere  at  all. 

We  remedy  this  second  defect  first,  by  constructing  with 
Ti  and  T3  the  same  sort  of  support  sphere  as  was  constructed 
with  Ti  and  Tz.  What  results  is  a  symmetric  relationship. 
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S3(0  - 


IrNyPi^  -P i^2'Pl,3 
2r  (1-Ni-Nj)  +N1P12 


which  expresses  the  size  of  the  support  sphere  which  would 
not  only  accommodate  73,  but  which  would  also  allow  us  to 
draw  a  one-dimensional  family  of  circles  through  Ji  and  T3 
whose  local  tangents  at  Ti  are  perpendicular  ^1. 

Now  note  that  if  S2(r)^s3(r),  for  whatever  r,  the  support 
spheres  must  be  coincident,  since  they  are  both  defined  relative 
to  Pi,  and  all  three  Ti  lie  upon  it.  Basically,  equating 
S2(r)»s3(r)  will  find  those  r  that  allow  a  solution  to  the  now 
overdetermined  problem  of  finding  a  sphere  that  passes 
through  three  points  and  attains  a  specified  normal  at  a  given 
point  (here,  the  south  pole).  Equating  S2(r)=s3(r)  gives  a 
2 

quadratic  in  r;  Ar  +Br+C  =  0,  where 


that  if  a  second  consensus  existed,  it  was  always  easily  distin¬ 
guishable  by  its  sign  and/or  magnitude  from  the  true  consensus 
r,  which  is  always  the  smallest  positive  radius  of  curvature. 

Further,  the  analysis  suggested  several  intriguing  exam¬ 
ples  of  what  should  probably  be  called  anti-tori.  The  most 
straightforward  example  is  what  happens  when  two  of  the  data 
points  are  on  a  meridan,  and  the  Aird  is  on  the  innermost 
parallel.  The  transform  properly  returns  two  consensus:  one 
is  r,  and  the  other  is -ifa-r),  the  value  of  ii2  at  the  "innermost" 
parallel.  What  the  transform  "sees"  is  an  anti-torus  whose 
thickness  is  equal  to  the  torus’  hole,  and  whose  hole  is  equal 
to  the  torus’  thickness.  That  is,  it  interprets  the  data  to  be  lying 
on  the  negative  image  of  an  anti-torus,  whose  axis  is  perpen¬ 
dicular  to  the  given  torus,  with  the  torus  and  anti-torus  inter¬ 
locking:  much  like  confusing  the  impression  in  plaster  of  a 
face  with  the  face  itself. 


A  -  2N2-Pi,2  (l-Ni-Na)  -  2N3  F1 3  (1-Ni-N2) 
B-2N2-P,2A/i-Pi3  -  2N,-Pi3NiP.2 

^  =  ~^l,2'^l,2^l'^  1,3  ■'•^1,3 ‘^1,3^1 

This  equation  will  have  at  most  two  solutions  for  the  value 
of  r,  and  they  are  easily  obtainable  by  the  quadratic  formula. 
Since  they  are  based  on  Ti  being  the  sou^  pole,  call  them 
ri,i  and  ri,2.  We  still  have  to  address  the  need  for  T2  and 
73  to  satisfy  their  tangent  condition,  however.  We  do  this  in 
a  completely  symmetric  fashion,  by  finding  two  more  support 
sphere  systems,  first  by  considering  72  a  south  pole,  and  then 
73.  The  quadratic  equations  in  r  that  result  are  derived  by 
inspection  by  permuting  the  indices  of  the  Pi  and  Ni. 

From  72  as  south  pole  we  get  two  more  candidates  for  the 
value  of  r,  r2,l  and  r2,2i  and  from  73  we  get  r3,i  and  r3,2.  This 
would  appear  to  call  for  the  calculation  of  24  inner  products, 
but  it  is  easy  to  show  that  any  two  of  these  three  quadratics 
have  four  inner  products  in  common;  thus,  there  is  a  total  of 
only  12  innerproducts. 

We  now  have  three  pairs  of  candidate  r  values.  Let  us 
define  a  consensus  r  to  be  any  nj  that  satisfies 

(2a)  (ri  lor  rj  2)  -  (r2.iOr  r2_2)  “  (''3,101  ''3,2) 

There  may  be  zero,  one,  or  two  consensus  r.  They  can  be 
accumulated  and  filtered  in  the  usual  Hough  way  using  a 
one-dimensional  parameter  space.  We  note  several  properties 
of  this  consensus  algorithm. 

The  values  of  r  are  derived  by  independently  solving  three 
quadratics  in  IS  image  observables  (three  for  each  point  Pi, 
and  two  for  each  unit  normal  Ni).  The  virtue  of  this  method  is 
that  despite  the  non-linearities  that  would  result  when  vari¬ 
ables  are  eliminated,  the  method  does  not  require  any  iterative 
root  finding.  Nor  does  it  suffer  from  the  attendant  problems 
of  choosing  starting  values,  guaranteeing  convergence,  or 
tracking  multiple  roots. 

Enlisting  the  aid  of  IBM’s  proprietary  symbolic  math 
system  Scratchpad  111  [Jenks  et  ai,  1986],  we  were  able  to 
show  that  the  transform  always  produced  one  consensus  value 
for  r,  corresponding  to  the  true  minor  radius  of  the  torus,  and 


3.  Computing  Orientation  (and  M^jor  Radius) 

Following  the  principle  of  least  variability,  we  next  re¬ 
cover  orientation  by  examining  the  spine  points  produced  by 
our  candidate  r.  Orientation  is  a  naturally  bounded  quantity, 
and  is  easily  represented  as  a  point  on  the  surface  of  the  upper 
Gaussian  hemisphere  (mathematically,  on  r). 

The  approach  is  straight  forward:  given  a  candidate  r,  we 
form  the  spine  points  Sj  =  Pi  -  rNi  from  the  pixels  supporting 
r,  and  determine  the  orientation  of  the  plane  on  which  they  lie. 
The  direction  of  the  plane’s  normal  is  easy  to  compute:  simply 
take  the  normalized  unit  vector  of  the  following  cross  product 
(or  any  of  its  variants  obtained  by  permuting  the  Si): 

(Si-S2)x(S3-52) 

However,  this  orientation  is  ill-conditioned  when  the  spine 
points  are  nearly  co-linear,  which  occurs  when  a  is  large. 
Using  derivatives,  it  is  not  hard  to  show  that  small  changes  of 
52  in  a  direction  perpendicular  to  this  plane  have  an  effect  on 
the  plane’s  orientation  that  is  roughly  proportional  to  the  major 
radius,  a.  This  suggests  that  the  computation  for  orientation 
should  be  accompanied  by  the  computation  of  a,  so  that 
orientation  can  be  "weighted"  inversely  by  a  somehow. 

Rather  than  use  a  as  a  pure  parameter  space  voting  weight, 
we  note  the  following.  'The  value  of  a  eventually  has  to  be 
recovered  anyway,  and  if  it  disappears  into  a  weighted  sum  it 
is  not  recoverable.  Instead,  we  scale  the  unit  orientation  vector 
by  dividing  it  by  a.  The  parameter  space  now  becomes  the 
interior  of  the  upper  Gaussian  hemisphere,  which  is  three 
dimensional,  and  the  values  of  a  are  recoverable  simply  by 
taking  each  vector’s  length.  More  importantly,  orientations 
accompanied  by  high  a  cluster  near  the  origin,  where  they  are 
easy  to  detect  and  remove.  Conversely,  other  orientations 
receive  parameter  space  representation  proportional  to  their 
certainty,  since  they  are  proportionately  distant  from  the  hemi¬ 
sphere  center. 

'This  parameter  space  has  two  other  advantages.  First,  it 
uses  the  interior  of  what  would  otherwise  be  a  very  inefficient 
use  of  three-space,  thus  obviating  the  need  to  cleverly  tessel- 
late  the  surface  of  the  Gaussian  sphere.  Secondly,  points  that 
cluster  at  its  origin  can  immediately  be  considered  evidence 
of  cylindrical  objects  in  the  image  (which  have  no  orientation 
and  infinite  a). 
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Since  the  orientation  and  a  spaces  are  now  combined,  the 
computation  of  the  resulting  parameter  point  (the  scaled  ori¬ 
entation  vector)  can  also  be  combined;  it  is  not  necessary  to 
actually  plot  the  ^ine  circle  or  calculate  its  center.  By  means 
of  a  three-space  geometric  construction  related  to  the  planar 
"Law  of  Sines"  for  circles,  it  can  be  shown  that  the  scaled 
vector  is  directly  given  by  the  following  (or  by  any  of  its 
variants  obtained  by  permuting  the  Si): 

where 

W'i=[53-52],  W'2=[5i-53].  and  lV3=[5i-S2]. 

4.  Computing  Center 

What  remains  are  the  three  most  difficult  parameters,  the 
center  parameters,  which  are  potentially  unbounded  and 
highly  sensitive  to  the  value  of  a.  Local  tori  with  small 
curvature  have  distant  centers  which  are  difficult  to  compute 
accurately.  Since  triangulation  error  of  the  center  is  also 
roughly  proportional  to  a,  this  rinal  space  computes  and  accu¬ 
mulates  vectors  to  the  center  point,  scaled  again  by  a.  In  effect, 
this  measures  each  center  in  units  of  major  radius,  thus  relative 
error  is  nearly  constant.  It  is  not  hard  to  show  that  this  space 
is  now  bounded,  since  for  local  tori  with  large  a,  their  center 
must  be  about  a  units  from  the  image  origin,  otherwise  they 
would  not  fall  within  the  image.  Thus,  most  tori  have  centers 
within  one  unit  from  the  origin.  The  upper  limit  of  this  space 
is  determined  by  small  tori  at  the  image  edge;  since  a  is 
bounded  by  r,  and  r  is  bounded  by  physical  considerations,  this 
limit  is  calculable  directly. 

We  note  that  although  many  tori  might  map  to  the  same 
scaled  vector,  they  have  already  been  classified  and  separated 
by  the  value  of  a  in  the  prior  parameter  space.  Thus,  the 
nesting  of  these  spaces  is  critical  both  for  adjustment  to  error, 
and  for  disambiguation  of  results. 

The  computation  is  straightforward  but  a  bit  messy  (how¬ 
ever,  it  is  again  invariant  to  the  permutation  of  the  5(); 

oS-^^hs^'^s^ 

s 


Picking  good  triples  is  determined  by  enforcing  a  mini¬ 
mum  and  maximum  distance  between  image  points/^/;  we  call 
this  distance  the  "radius  of  coherence"  (ROC).  If  the  ROC  is 
too  small,  accuracy  suffers  due  to  small  triangulation  baseline; 
too  large,  and  most  triples  do  not  lie  on  the  same  toms.  For 
the  minor  radius  space,  it  can  be  set  in  accordance  with  the 
expected  range  of  minor  radii,  which  can  be  determined  from 
the  imaging  parameters.  Empirically,  we  have  found  that  a 
wide  range  between  minimum  and  maximum  ROC  works  well 
for  minor  r.  The  two  other  spaces  have  similar  considerations, 
however  the  range  between  minimum  and  maximum  ROC 
must  be  rather  narrow  or  the  system  becomes  oveAvhelmed 
with  noise  hypotheses.  Fortunately,  experience  shows  that  a 
narrow  range  of  ROC  about  r  works  very  well  for  a  large  range 
of  toms  shapes. 

Sanity  checks  are  inexpensive  checks  on  the  data  before  it 
is  used  in  a  parameter  transform.  Our  sanity  checks  are  of  the 
following  form.  In  computing  r,  no  points  in  a  planar  neigh¬ 
borhood  are  used;  they  are  easy  to  detect,  since  they  have  very 
small  and  equal  principal  curvatures.  Nor  are  points  near  a 
depth  discontinuity  used,  as  the  surface  approximations  be¬ 
come  inaccurate  there.  Further,  degenerate  quadratics  (with 
i4=0  or  with  imaginary  roots)  are  ignored  .  In  computing 
orientations  and  centers,  spine  points  within  a  pixel  of  each 
other  are  ignored.  These  checks  take  a  minute  percentage  of 
the  computation  times,  but  reduce  the  noise  in  parameter  space 
dramatically. 

In  practice,  the  scaling  of  orientation  vectors  by  Va  results 
in  vectors  too  tightly  clustered  around  the  origin.  This  is 
because  any  torus  large  enough  to  be  seen  in  the  image  as  a 
torus  will  have  a  major  radius  of  at  least  eight  pixels,  approxi¬ 
mately.  Thus,  all  the  activity  in  the  space  happens  in  a  hemi¬ 
sphere  of  radius  Vk;  this  uses  less  than  1%  of  the  space.  If  a 
lower  bound  on  a  is  known  (and  it  usually  can  be  approxi¬ 
mated),  then  the  scaling  function  should  be  of  the  form 
^Aa*c),  with  c  serving  to  shift  small  values  of  a  closer  to  the 
surface  of  the  hemisphere.  Heuristic  choices  of  constants 
based  on  expected  torus  and  image  sizes  can  be  selected;  one 
good  one  maps  the  smallest  torus  onto  the  surface  of  the 
hemisphere,  and  the  largest  torus  that  can  be  fully  seen 
(a=64,  assuming  a  range  image  of  size  256)  into  the  midpoint 
of  the  hemisphere;  the  scaling  function  becomes  54^0+48). 

6.  System  Description 


where 

b— Z^Z?2y  c^ZJjZ)^— Z^Z)^ 
and 

D2=‘W2  W2,  D=W^  Wy 

5.  System  Considerations 

In  practice,  the  above  method  relies  on  the  ability  to  pick 
"good"  triples  of  points  Pt.  Additionally,  the  performance  and 
behavior  of  the  method  can  be  enhanced  by  simple  "sanity 
checks"  on  computed  intermediate  results.  Lastly,  the  orienta¬ 
tion  space  can  improved  by  a  judicious  choice  of  "offset" 
for  the  scaled  orientation  vector.  We  handle  these  in  turn. 


We  briefly  survey  the  complete  system  of  which  this 
transform  is  a  part,  highlighting  those  aspects  most  germane 
to  our  results.  A  more  complete  description  can  be  found  in 
[Kjeldsen  et  al.,  1989]. 

Recognition  is  stmctured  as  a  hierarchy  of  layered  and 
concurrent  parameter  transforms  [Ballard,  1981][Sabbah, 
1985].  Each  transform  examines  input  data  or  previously 
established  features  and  accumulates  evidence  for  new  feature 
hypotheses  in  an  associated  parameter  space.  Compatibility 
relations  accumulate  evidence  for  or  against  a  hypothesis  on 
the  basis  of  peer  hypotheses.  A  large  number  of  hypotheses 
are  typically  generated.  The  evidence  for  and  against  each  is 
integrated  using  an  iterative  refinement  process  in  a  dynami¬ 
cally  constructed  constraint  satisfaction  network  [Feldman  & 
Ballard,  1981], 
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Each  parameter  space  is  instantiated  as  a  subnetwork 
where  nodes  correspond  to  hypotheses.  The  links  in  the  net¬ 
work  are  (1)  bottom-up  connections,  representing  support 
from  input  data,  and  (2)  lateral  links  between  hypotheses.  The 
former  links  can  be  thought  of  as  votes  generated  by  the 
parameter  transforms  [Hough,  1962].  The  latter  links,  gener¬ 
ated  by  the  compatibility  relations,  can  be  inhibitory,  or  exci¬ 
tatory.  For  example,  surface  hypotheses  generated  from  the 
same  pixels  are  connected  by  an  inhibitory  link,  because  they 
represent  conflicting  interpretations  of  the  same  data. 

Evidence  integration  works  as  follows.  Each  node  i  com¬ 
putes  an  activation  level  ui  representing  a  confidence  level  in 
the  corresponding  feature.  At  each  iterative  step  f,  the  activa¬ 
tion  level  of  a  node,  denoted  by  U|(r),  is  computed  as 

«i(0)=0 

«,(r  +  l)=M,<f)  +  /j  +  21  “/O  -  ’ 

i 

where  li  represents  bottom-up  support  for  feature  i,  and  the 
summation  embodies  the  collective  inhibition  and  excitation 
of  conflicting  and  cooperating  hypotheses  j.  The  weight  factor 
wij  is  negative  when  hypothesis  j  conflicts  with  hypothesis  i 
and  positive  when  hypothesis  j  is  supporting  i.  Di  is  a  decay 
term  that  suppresses  spurious  hypotheses  with  little  support, 
and  helps  insure  stability. 

A  unit  "survives"  iteration  when  it  has  sufficient  ui  and 
insignificant  inhibition.  It  is  then  passed  to  the  next  parameter 
transform  in  order  to  create  hypotheses  in  higher-level  spaces. 
Units  also  feed-back  to  their  component  features  in  lower- 
level  spaces  and  to  consistent  hypotheses  in  parallel  spaces. 
Thus,  surviving  features  form  stable  coalitions  which  repre¬ 
sent  globally  consistent  interpretations  of  the  scene. 

Additional  features  are  added  to  the  system  by  defining 
parameter  transforms  and  compatibility  relations  which  work 
within  a  well  defined  I/O  structure.  These  generally  make  use 
of  techniques  described  in  [Califano,  1988]  and  [Califano  et 
al.,  1988],  Our  first  transform  takes  triples  of  data  points  and 
returns  values  of  r  on  the  basis  of  the  consensus  in  ^  2a.  The 
second  transform  takes  triples  of  the  data  points  that  support  a 
surviving  r  hypothesis,  and  returns  the  parameters  of  the  scaled 
vector  of  section  3.  Finally  data  points  supporting  surviving 
orientation/major  radii  are  used  as  described  in  section  4.  The 
current  system  contains  21  parameter  transforms  for  various 
features. 

The  lowest  level  of  the  system  extract  local  features  such 
as  surface  approxima¬ 
tions  or  depth  discon¬ 
tinuities  from  the  data 
for  use  by  the  parame¬ 
ter  transforms.  The 
local  features  impor¬ 
tant  to  this  work  use 
bicubic  interpolations 
to  obtain  the  least 
mean  square  error  fit 
to  dense  depth  data, 
then  computes  di¬ 
rectly  from  the  poly¬ 
nomial  coefficients  of 
the  approximation 


both  the  surface  gradient  vector,  and  the  directions  and 
amounts  of  principal  surface  curvature  [Sabbah  &  Bolle, 
1986].  Experiments  comparing  computed  values  with  known 
true  values  indicate  that  over  a  wide  range  of  imagery  and 
circumstance  the  inaccuracy  of  the  approximations  is  no  more 
that  5%  from  the  ideal. 

7.  Results 

The  torus  extraction  transforms  have  been  run  on  about  a 
dozen  images.  We  will  present  two  of  the  more  interesting 
cases.  The  first  is  a  depth  map  generated  from  a  CSG  repre¬ 
sentation  of  a  padlock  (figure  3).  The  image  contains  a  single 
torus  segment,  as  well  as  several  other  surfaces.  The  second 
image  is  an  actual  range  image  of  a  knotted  length  of  cable 
(figure  4).  The  cable  forms  a  continuously  varying  tube  of 
constant  cross  section,  which  can  be  reasonably  approximated 
as  piecewise  toroidal. 

In  both  cases  we  used  only  a  small  percentage  of  the 
possible  triples  within  the  ROC;  the  percentage  was  large 
enough  to  give  reasonable  coverage  but  small  enough  to  give 
acceptable  mnning  times.  Additionally,  hypotheses  receiving 
votes  from  fewer  than  20  triples  were  not  instantiated.  Since 
many  noise  hypotheses  receive  just  a  few  votes,  this  pruning 
helped  cut  down  on  the  time  and  memory  needed  to  support 
them. 


Fig.  3:  Depth  map  of  lock.  Fig.  4:  Depth  map  of  knot. 

7.1.  Lock 

Figure  3  shows  a  64x64  plot  of  the  256x256  depth  map  of 
the  lock.  Surface  approximations  were  taken  using  a  5x5 
window  around  each  point.  After  the  sanity  checks,  only  the 
points  hilighted  in  the  dithered  image  in  figure  5  were  passed 
to  the  minor  radius  parameter  transform.  ROC  was  9  to  10, 
and  50%  of  the  possible  triples  were  used.  10  hypotheses 
received  votes  from  more  than  20  triples,  and  so  took  part  in 
iteration.  After  11  iteration  steps  a  single  hypothesis  of  16.5 
for  r  survived,  which  corresponds  closely  to  the  apparent 

radius  of  the  hasp. 

Figure  6  show's 
the  points  which  sup¬ 
ported  the  winning  r. 
To  find  the  orienta¬ 
tion/major  radius, 
ROC  was  set  to  range 
from  15  to  16  (ap¬ 
proximately  r)  and 
50%  of  the  triples 
formed  from  the  spine 
points  were  used. 
102  hypotheses  re¬ 
ceived  sufficient  .sup¬ 
port  to  be  instantiated. 
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Fig.  7:  Orientation  hyotheses  plotted  in  3-space.  Fig.  8;  Pixels  voting  for  orientation  1.  Fig.  9;  Pixels  voting  for  orientation  2. 


Figure  7  shows  the  hypotheses  receiving  votes  plotted  in 
3-space  (it  does  not  show  the  number  of  votes  each  received). 
You  can  make  out  the  hemispherical  outlines  of  the  parameter 
space,  as  well  as  a  distinct  cluster  of  hypotheses  in  the  -x  +y 
+z  quadrant.  Two  hypotheses  survived  after  iteration.  The 
points  supporting  them  are  shown  in  figures  8  and  9.  The  first 
captured  the  orientation  of  the  torus  forming  the  top  of  the  hasp 
as  closely  as  we  were  able  to  measure  it  ourselves.  The  major 
radius  was  captured  as  accurately  as  the  parameter  space 
resolution  (bucket  size)  would  allow  (actual  a=M,  computed 
a=37).  The  second  surviving  orientation  hypothesis  "found" 
a  very  large  diameter  torus  (a=523)  on  the  straight  segment  of 
the  hasp.  As  we  mentioned  in  section  3,  it  is  possible  to 
eliminate  hypotheses  of  very  large  radius,  that  is  near-cylin¬ 
ders  by  discarding  those  very  close  to  the  origin  of  the  parame¬ 
ter  space.  Since  the  length  of  the  parameter  vector  of  the  torus 
on  top  of  the  hasp  is  .65,  and  the  length  of  the  parameter  vector 
for  the  torus  found  on  the  cylinder  was  .1,  the  cutoff  could 
easily  be  adjusted  to  eliminate  such  misinterpretations. 

^th  orientation  hypotheses  now  voted  into  location  space. 
The  same  ROC  and  hit  rate  were  used.  The  cylindrical  seg¬ 
ment  was  not  able  to  find  any  consistent  location,  and  therefore 
created  no  hypotheses  strong  enough  to  be  considered.  The 
actual  torus  created  25  hypotheses.  Only  one  survived  itera¬ 
tion.  The  surviving  center  was  (30  67.5  -30),  with  the  actual 
center  at  approximately  (31  66  -32).  Thus,  the  system  found 
the  results  to  within  one  bucket  of  available  resolution. 

These  runs  were  done  on  a  Symbolics  3650.  With  the 


utes.  Generating  votes  for  location  space  took  3  minutes.  In 
all,  30  iterations  were  needed  to  prune  down  the  hypotheses. 
Total  time  for  all  iteration  was  roughly  3  minutes,  including  a 
rather  elaborate  trace  of  system  status. 

7.2.  Knot 

The  second  image  is  a  range  image  of  a  knot  of  coax  cable 
(figure  4)  taken  using  a  laser  triangulation  range  finder  [Tech¬ 
nical  Arts,  1986].  ROC  was  the  range  4  to  5,  hit  rate  was  50%. 
Figure  10  shows  the  points  supporting  the  winning  (and  cor¬ 
rect,  as  far  as  we  can  measure)  minor  radius  hypothesis. 

These  points  were  passed  thru  the  orientation/major  radius 
parameter  transform.  Despite  the  lack  of  distinct  peaks  in  the 
histogram  of  votes  (figure  11),  the  iterative  refinement  was 
able  to  find  three  distinct  clusters,  representing  competing 
hypotheses  of  orientation  from  different  areas  of  the  knot.  The 
strongest  hypothesis  in  each  cluster  survived  iteration.  The 
points  supporting  them  are  shown  in  figures  13, 14  and  15. 
The  light  coverage  is  due  to  the  low  hit  ratios  (10%)  we  had 
to  use.  The  low  hit  rate  also  appears  responsible  for  some 
regions  of  the  knot  not  being  covered  by  any  hypotheses. 

Figure  12  shows  a  histogram  of  the  votes  the  three  orien¬ 
tation  hypotheses  generated  in  location  space.  One  peak  was 
generated  by  the  segment  in  fig.  15,  and  the  other  peak  is  an 
overlap  of  the  votes  firom  both  fig.  13  and  fig.  14.  The  two 
locations  which  survived  iteration  correspond  well  to  the 
center  points  we  expected  from  those  torus  segments. 

Running  times  were  somewhat  better  than  those  for  the 


parameters  set  as  described  the  entire  recognition  took  roughly  lock  test  case,  except  for  the  time  to  generate  votes  for  orien- 
70  minutes.  Computing  the  surface  approximations  took  10  1  tation  hypotheses.  Here  the  continuously  varying  curve  cre- 
minutes.  Generating  the  votes  for  minor  r  space  took  30  ated  a  very  large  number  of  noise  hypotheses,  and  very  wide 
minutes.  Generating  votes  for  orientation  space  took  25  min-  ILpeaks  in  the  histogram  of  votes.  With  the  10%  hit  rate  used 


Fig.  10;  Pixels  supporting  winning 
minor  r^ius. 


Fig.  11;  Knot  orientation  votes  /  2D  projection. 


Fig.  12:  Pixels  supporting  winning  location. 


Fig.  13;  Pixels  supporting  orientation 
hypothesis  1. 


Fig.  14:  Pixels  supporting  orientation 
hypothesis  2. 


Fig.  15:  Pixels  supporting  orientation 
hypothesis  3. 


here,  roughly  4000  hypotheses  were  created.  Linking  them 
took  well  over  an  hour.  With  somewhat  higher  hit  rates,  21000 
hypotheses  were  created.  Many  hours  were  needed  to  create 
and  link  them. 


[Califeno  et  al.,  1988]  A  Califano,  R.M.  Bolle,  and  R.W. 
Taylor,  “Generalized  neighborhoods:  A  new  approach  to 
complex  feature  extraction,”  IEEE  Conf.  on  Comp.  Vision 
and  Pattern  Recognition,  Nov.  1988. 


73.  Discussion 

These  experiments  demonstrate  the  ability  to  break  a  con¬ 
tinuously  varying  curve  into  piecewise  toroidal  segments,  to 
deal  with  non-toroidal  surfaces  and  cylinderical  segments. 
We  hope  that  feeding  the  implementation  and  a  future  port 
to  faster  hardware  will  improve  the  coverage,  but  these  results 
are  very  promising. 

System  parameters  must  cunently  be  adjusted  for  each 
image.  An  initial  setting  generally  gives  either  impossibly 
long  running  times,  due  to  the  number  of  hypotheses  created, 
or  poor  coverage  of  points  supporting  the  winning  hypotheses. 
The  latter  makes  recognition  in  subsequent  parameter  spaces 
difficult.  ROC  and  hit  rate  can  generally  be  adjusted  for  good 
coverage  in  reasonable  time,  but  it  takes  several  attempts.  A 
faster  implementation  would  allow  us  to  use  all  triples  and  a 
wider  ROC,  especially  in  minor  radius  space.  We  believe  this 
would  almost  eliminate  the  need  to  tune  system  parameters. 
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Abstract 

All  surfaces  encountered  in  practice  appear  rough  at 
some  level  of  detail.  Existing  shape  recovery  methods  are 
effective  when  applied  to  smooth  diffuse  surfaces  but  pro¬ 
duce  sparse  shape  information  when  applied  to  rough  sur¬ 
faces.  Images  of  rough  surfaces  are  characterized  by  high 
frequency  intensity  variations,  and  it  is  difficult  to  per¬ 
ceive  the  shapes  of  these  surfaces  from  their  images.  The 
shape-from-focus  method  described  in  this  paper  uses  dif¬ 
ferent  focus  levels  to  obtain  a  sequence  of  object  images. 
The  sum-modified-Laplacian  (SML)  operator  is  developed 
to  compute  local  measures  of  the  quality  of  image  focus. 
The  SML  operator  is  applied  to  the  image  sequence,  and 
the  focus  measures  obtained  at  each  image  point  are  used 
to  compute  local  depth  estimates.  We  present  two  algo¬ 
rithms  for  depth  estimation.  The  first  algorithm  simply 
looks  for  the  focus  level  that  maximizes  the  focus  measure 
at  each  image  point.  The  second  algorithm  uses  a  model 
to  interpolate  the  focus  measures  to  obtain  more  accurate 
depth  estimates.  The  algorithms  were  implemented  and 
tested  on  surfaces  of  different  roughness  and  reflectance 
properties.  We  conclude  with  a  description  a  fully  auto¬ 
mated  shape-from-focus  system  that  has  been  applied  to 
industrial  samples. 


1  Introduction 

The  advancement  of  three-dimensional  machine  vision  is 
largely  dependent  on  the  development  of  efficient  and  reli¬ 
able  shape  extraction  methods.  Shape  extraction,  in  turn, 
requires  a  clear  understanding  of  surface  reflectance  and 
image  formation.  Several  extraction  methods,  for  diffuse 
and  specular  surfaces,  have  been  developed  in  the  past. 
The  extraction  problem  associated  with  rough  surfaces, 
however,  has  not  received  sufficient  attention.  Ail  surfaces 
encountered  in  practice  are  rough  at  some  level  detail.  At 
that  level,  they  exhibit  high-frequency  spatial  surface  vari¬ 
ations  that  are  often  random  in  nature.  In  many  vision 
applications,  the  spatial  surface  variations  are  compara¬ 
ble  in  dimensions  to  the  resolution  of  the  imaging  system. 
Image  intensities  produced  by  such  surfaces  vary  in  an  un¬ 
predictable  manner  from  one  sensor  element  (pixel)  to  the 
next.  Hence,  it  is  difficult  to  obtain  dense  and  accurate  sur¬ 


face  shape  information  by  using  existing  passive  or  active 
sensing  techniques  such  as  stereo,  shape  from  shading,  and 
structured  light.  Therefore,  a  practical  and  reliable  solu¬ 
tion  to  this  rather  difficult  extraction  problem  is  desirable. 
In  this  paper,  we  develop  a  shape  extraction  technique  that 
uses  focus  analysis  to  recover  dense  depth  maps  of  rough 
textured  surfaces. 

1.1  Background 

Previously,  focus  analysis  has  been  used  to  automatically 
focus  imaging  systems  or  to  obtain  sparse  depth  informa¬ 
tion  from  the  observed  scene.  Horn  [1]  proposed  focus¬ 
ing  imaging  systems  by  using  the  Fourier  transform  and 
analyzing  the  frequency  spectrum  of  the  image.  Tenen- 
baum  [2]  developed  the  gradient  magnitude  maximization 
method  that  uses  the  sharpness  of  edges  to  optimize  fo¬ 
cus  quality.  A  modification  to  this  approach  was  later 
proposed  by  Jarvis  [3].  He  formulated  the  sum-modulus- 
difference  as  the  sum  of  the  first  intensity  differences  be¬ 
tween  neighboring  pixels  along  a  scan-line  and  used  it  as 
a  measure  of  focus  quality.  Several  automatically  focusing 
algorithms  were  implemented  and  tested  by  Schlag  et.al. 

[4]. 

More  recently,  Krotkov  [5][6]  evaluated  and  compared 
the  performance  of  different  focus  criterion  functions. 
Krotkov  also  proposed  a  method  to  estimate  the  depth 
of  an  image  area.  Pentland  [7]  suggested  estimating  the 
depth  of  image  points  by  evaluating  image  blur  due  to  de- 
focusing.  A  similar  approach  was  applied  to  edge  points 
by  Grossmann  [8].  Darrell  and  Wohn  [9]  have  developed  a 
depth  from  focus  method  that  obtains  an  image  sequence 
of  a  scene  by  varying  the  focus  level,  and  uses  Laplacian 
and  Gaussian  pyramids  to  calculate  depth.  Subbarao  [10] 
suggests  the  change  of  intrinsic  camera  parameters  to  re¬ 
cover  the  depth  map  of  a  scene.  Ohta  et.al.  [11]  and 
Kaneda  et.al.  [12]  have  used  images  corresponding  to  dif¬ 
ferent  focus  levels  to  obtain  a  single  image  of  high  focus 
quality. 

1.2  Shape  firom  Focus 

In  this  paper,  we  develop  a  shape-from-focus  method.  In 
contrast  to  previous  work  in  this  area,  we  avoid  the  follow¬ 
ing  approaches. 


593 


•  We  do  not  attempt  to  estimate  depth  from  a  pair  of 
images  by  evaluating  local  estimates  of  the  blurring 
function.  The  accuracy  of  such  a  method  is  greatly 
dependent  on  the  blurring  model  used.  The  mod¬ 
els  used  thus  far  are  only  approximations  to  the  ac¬ 
tual  physical-optics  model  and  therefore  do  not  ensure 
high  quality  results. 

•  We  do  not  apply  our  method  to  general  scenes.  Depth 
estimation  based  on  focus  analysis  relies  on  the  pres¬ 
ence  of  high  frequency  brightness  variation  in  the 
scene.  General  scenes  often  have  areas  with  little  or 
no  brightness  variation.  For  this  reason,  experiments 
in  the  past  have  only  produced  sparse  depth  informa¬ 
tion. 

Here,  we  restrict  ourselves  to  visibly  rough  surfaces  that 
produce  textured  images  with  high  frequency  intensity 
variations.  We  review  the  image  formation  process  and 
show  that  a  defocused  imaging  system  plays  the  role  of  a 
low-pass  filter.  The  shape-from-focus  method  moves  the 
unknown  object  with  respect  to  the  imaging  system  and 
obtains  a  sequence  of  images  that  correspond  to  different 
levels  of  object  focus.  The  sum-modified-Laplacian  (SML) 
focus  operator  is  developed  to  measure  the  relative  degree 
of  focus  between  images.  The  operator  is  applied  to  the 
image  sequence  to  obtain  a  set  of  focus  measures  at  each 
image  point.  A  model  is  used  to  describe  focus  measure 
variations  due  to  defocusing.  This  model  is  used  to  inter¬ 
polate  between  a  finite  number  of  focus  measures  to  obtain 
accurate  depth  estimates.  Experimental  results  indicate 
that  the  method  is  capable  of  extracting  dense  and  accu¬ 
rate  shape  information  from  a  few  images  of  the  object. 
The  results  demonstrate  appreciable  invariance  to  surface 
roughness  and  reflectance. 

1.3  An  Automated  System 

We  have  implemented  a  fully  automated  shape-from-focus 
system.  The  system  is  applicable  to  objects  that  are  a 
few  hundred  microns  in  size.  An  optical  microscope  is 
used  to  image  the  object.  The  translation  stage  of  the 
microscope  is  motorized  to  enable  the  controlled  movement 
of  the  unknown  object  through  the  plane  of  focus  of  the 
im^tging  system.  The  stage  is  translated  in  increments  and 
for  each  new  position  an  image  of  the  object  is  obtained. 
The  image  sequence  is  processed  on  a  workstation.  Two 
results  are  produced  by  the  recovery  algorithm.  The  first 
is  a  focused  image  of  the  object  that  is  reconstructed  from 
the  sequence  of  partially  focused  images.  The  second  is  a 
depth  map  of  the  object  surface. 

The  automated  system  has  been  applied  to  a  variety 
of  surfaces.  Our  initial  results  were  obtained  using  indus¬ 
trial  samples  such  as  via-hole  fillings  on  a  ceramic  circuit 
boards.  We  are  currently  using  the  system  to  reco^'er  the 
shapes  and  focused  images  of  a  variety  of  biological  sam¬ 
ples,  including  micro-organisms  and  chromosomes. 


2  Visibly  Rough  Surfaces 

In  the  study  of  reflection,  a  rough  surface  is  defined  as  one 
whose  smallest  spatial  variations  have  dimensions  that  are 
much  larger  than  the  wavelength  of  the  incident  electro¬ 
magnetic  wave.  This  is  the  concept  of  optical  roughness. 
In  this  paper,  we  introduce  the  notion  of  visible  rough¬ 
ness;  a  surface  is  visibly  rough  if  the  dimensions  of  its 
spatial  variations  are  comparable  to  the  viewing  area  of 
individual  elements  (e.g.  pixels)  of  the  sensor  (e.g.  cam¬ 
era)  used  to  observe  the  surface.  The  surface  shown  in 
Fig.l  is  comprised  of  a  large  number  of  facets*.  While  the 
surface  appears  to  have  a  smoothly  varying  global  shape, 
z{x,  y),  the  orientation  or  of  individual  facets  may  devi¬ 
ate  considerably  from  the  mean  surface  orientation  in  the 
facet  vicinity.  Although  facet  orientations  are  dependent 
on  the  global  shape  of  the  surface  and  on  the  orientations 
of  neighboring  facets,  they  often  exhibit  some  degree  of 
randomness. 


Figure  1:  Surface  roughness  and  sensor  resolution. 

Now  let  us  consider  the  image  of  a  rough  surface  ob¬ 
tained  using  a  finite  resolution  sensor.  The  number  of 
facets  that  contribute  to  the  image  irradiance  at  a  pixel 
location  depends  on  the  magnification  of  the  optical  sys¬ 
tem  used  to  project  the  surface  onto  the  image  plane  of  the 
sensor.  We  define  two  levels  of  magnification;  multi-facet 
level  and  facet  level.  At  the  multi-facet  level,  the  pixel 
width  w  is  very  large  compared  to  the  facet  size  toy  (e.g. 
w  =  wi).  In  this  case,  the  surface  patch  projected  on  a 
pixel  may  be  modeled  by  assigning  it  a  mean  orientation 
value  and  a  roughness  which  is  determined  by  the  prob¬ 
ability  distribution  of  its  facet  orientations  [13]  [14].  The 
pixel  intensities  are  continuous  functions  of  the  angle  of  in¬ 
cident  light  and  can  be  expressed  as  a  linear  combination 
of  the  diffuse  lobe  and  specular  lobe  components  [14],  the 
relative  strengths  of  the  two  components  depending  on  the 
reflectance  properties  of  the  facets. 

For  facet  level  magnification,  on  the  other  hand,  the 
pixel  width  w  is  comparable  to  the  facet  width  wj  (e.g.  w 

*  No  assumptions  are  made  regarding  the  size  of  the  facets. 
Hence,  these  facets  may  or  may  not  represent  the  micro-faceU 
defined  in  [13],  [14]. 
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=  tBj),  and  only  one  or  few  facets  are  viewed  by  each  pixel. 
As  a  result  of  the  randomness  in  facet  orientations,  image 
intensity  values  are  expected  to  vary  drastically  and  unpre- 
dictably  from  one  pixel  to  the  next.  This  is  true  for  both 
specular  as  well  as  diffuse  facets  as  the  radiance  of  both 
are  dependent  on  the  angle  of  incident  light.  Therefore,  at 
the  facet  level,  a  surface  produces  images  that  are  rich  in 
texture*  and  we  say  that  the  surface  is  visibly  rough.  Why 
then  do  we  use  facet  level  measurements  when  multi-facet 
level  measurements  will  provide  us  with  image  intensities 
that  can  perhaps  be  used  to  recover  shape  information?  In 
many  practical  situations,  the  desirable  resolution  of  shape 
information  is  unobtainable  at  the  multi-facet  level. 


3  Focused  and  Defocused  Images 

In  this  section,  we  briefly  review  the  image  formation  pro¬ 
cess  and  describe  defocused  images  as  processed  versions 
of  focused  images.  Fig.2  shows  the  basic  image  formation 
geometry.  All  light  rays  that  are  radiated  by  the  object 
point  P  and  intercepted  by  the  lens  are  refracted  by  the 
lens  to  converge  at  the  point  Q  on  the  image  plane.  The 
relationship  between  the  object  distance  o,  focal  distance 
of  the  lens  /,  and  the  image  distance  t,  is  given  by  the 
Gaussian  lens  law: 
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Figure  2:  Formation  of  focused  and  defocused  images. 

Each  point  on  the  object  plane  is  projected  onto  a  single 
point  on  the  image  plane,  thus  causing  a  clear  or  focused 
image  Jf{x,y)  to  be  formed  on  the  image  plane.  If,  how¬ 
ever,  the  sensor  plane  does  not  coincide  with  the  image 
plane  and  is  displaced  from  it  by  a  distance  6,  the  energy 
received  from  the  object  by  the  lens  is  distributed  over  a 

^There  are  many  notions  of  what  is  meant  by  the  term  tex¬ 
ture.  Here,  we  define  texture  as  a  noticeable  fluctuation  in 
the  intensities  of  neighboring  image  pixels  [18].  The  textures 
produced  by  rough  surfaces  may  be  periodic,  nearly  periodic, 
or  random.  No  assumptions  are  made  regarding  the  type  of 
texture. 


circular*  patch  on  the  sensor  plane.  Fig.2  may  be  used 
to  establish  the  relationship  between  the  radius  r  of  the 
circular  patch  and  the  sensor  displacement  6.  From  Fig.2 
we  find  that: 

— 
i 

where  R  is  the  radius  of  the  lens.  It  is  also  possible  to 
convince  oneself  that  the  radius  r  of  the  circular  patch 
is  independent  of  P's  location  on  the  object  plane.  The 
distribution  of  light  energy  over  the  circular  patch,  or  the 
blurring  function,  can  be  modeled  using  physical  optics 
[17].  Very  often,  a  two-dimensional  Gaussian  function  is 
used  to  approximate  the  physical  model  [7].  Then,  the 
blurred  or  de/ocused  image  Ij(x,y)  formed  on  the  sensor 
plane  can  be  described  as  the  result  of  convolving  the  fo¬ 
cused  image  lj(x,y)  with  the  blurring  function  fi(i,y): 

Id{x,y)  =  fi(x,y)  ♦  //(x,y)  (3) 

where: 

_  -f 

where  (T;,,  the  spread  parameter,  is  assumed  to  be  propor¬ 
tional  to  the  radius  r  [7].  The  constant  of  proportionality 
is  dependent  on  the  optics,  sampling,  etc.  We  will  see 
shortly  that  the  value  of  this  constant  is  not  important  in 
our  approach.  Note  that  defocusing  is  observed  for  both 
positive  amd  negative  sensor  displacements. 

Now  let  us  analyze  the  defocusing  process  in  the  fre¬ 
quency  domain  (u,v).  If  If’(u,v),  ff(u,v),  and  /£)(«,  v) 
are  the  Fourier  transforms  of  If(x,y),  h(x,y),  and 
I d[x,y),  respectively,  we  can  express  eq.  3  as: 

/d(“.v)  =  H(u,y).I  f(u,v)  (5) 

where: 

H(u,y)  =  (6) 

We  see  that  H(u,  v)  allows  low  frequencies  to  pass  while 
it  attenuates  the  high  frequencies  in  the  focused  image. 
Furthermore,  as  the  sensor  dbplacement  6  increases,  the 
defocusing  radius  r  increases,  and  the  spread  puameter  irj, 
increases.  Hence  defocusing  is  a  low-pass  filtering  process 
where  the  bandwidth  decreases  with  increase  in  defocusing. 

From  Fig.2,  it  is  seen  that  a  defocused  image  of  the 
object  can  be  obtained  in  three  ways:  by  displacing  the 
sensor  with  respect  to  the  image  plane,  by  moving  the 
lens,  or  by  moving  the  object  with  respect  to  the  object 
plane.  Moving  the  lens  or  sensor  plane  with  respect  to  one 
another  causes  the  following  problems: 

•  The  magnification  of  the  system  varies,  thereby  caus¬ 
ing  the  image  coordinates  of  focused  points  on  the 
object  to  change. 

*The  shape  of  the  patch  also  depends  on  the  shape  of  the 
aperture  of  the  imaging  sy  em.  We  are  assuming  the  aperture 
to  be  circular. 
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•  The  area  on  the  sensor  plane  over  which  light  energy 
is  distributed  varies,  thereby  causing  a  variation  in 
image  brightness. 

These  effects  are  described  in  detail  by  Willson  and 
Shafer  [20],  In  order  to  overcome  these  problems,  we  pro¬ 
pose  to  vary  the  degree  of  focus  by  moving  the  object* 
with  respect  to  a  fixed  configuration  of  the  optical  system 
and  sensor  (Fig.  3).  This  approach  ensures  that  as  the 
object  passes  through  the  plane  S,  surface  points  that  lie 
on  S  are  perfectly  focused  onto  the  image  plane  with  the 
same  magnification.  In  other  words,  as  the  object  moves, 
the  magnification  of  imaging  system  can  be  assumed  to  be 
constant  for  the  image  areas  that  are  perfectly  focused. 


Figure  3:  Effect  of  object  displacement  on  magnification. 


However,  from  Fig.  3  we  see  that  points  that  lie  out¬ 
side  of  the  plane  S  will  be  projected  onto  the  image  plane 
with  different  magnifications.  In  fact,  the  magnification  of 
the  defocused  object  points  will  depend  on  their  distance 
from  the  plane  S.  Note  that  for  small  displacements  Ad, 
magnification  may  be  assumed  to  be  constant.  We  will 
use  this  assumption  while  developing  the  depth  estimation 
algorithm. 

4  Shape  from  Focus:  An  Overview 

The  shape-from-focus  method  is  based  on  the  observations 
made  in  the  previous  sections. 

•  At  facet  level  magnification,  rough  surfaces  produce 
images  that  are  rich  in  texture. 

•  A  defocused  optical  system  plays  the  role  of  a  low- 
pass  filter. 

Fig.4  shows  a  rough  surface  of  unknown  shape  placed 
on  a  translational  stage.  The  reference  plane  shown  cor¬ 
responds  to  the  initial  position  of  the  stage.  The  configu¬ 
ration  of  the  optics  and  sensor  defines  a  single  plane,  the 
"focused  plane*,”  that  is  perfectly  focused  onto  the  sensor 

•  Object  movement  is  easily  realized  in  industrial  inspection 
applications. 

*The  focused  plane  is  the  same  as  the  object  plane  defined 
in  the  previous  section.  A  different  term  is  introduced  here  as 
the  object  does  not  necessarily  lie  on  the  focused  plane. 


plane.  The  distance  dj  between  the  focused  and  reference 
planes,  and  the  displacement  d  of  the  stage  with  respect 
to  the  reference  plane,  are  always  known  by  measurement. 
Consider  the  surface  element,  s,  that  lies  on  the  unknown 
surface,  5.  If  the  stage  is  moved  towards  the  focused  plane, 
the  image  of  s  will  gradually  increase  in  its  degree  of  fo¬ 
cus  (high  frequency  content)  and  will  be  perfectly  focused 
when  s  lies  on  the  focused  plane.  Further  movement  of  the 
element  s  will  again  increase  the  defocusing  of  its  image. 
If  we  observe  the  image  area  corresponding  to  s  and  record 
the  stage  displacement  d  =  rf  at  the  instant  of  maximum 
focus,  we  can  compute  the  height  ds  of  s  with  respect  to 
the  stage  ^  ds  =  dj  -  d.  In  fact,  we  can  use  d  to  de¬ 
termine  the  distance  of  s  from  the  focused  plane,  sensor 
plane,  or  any  other  coordinate  system  defined  with  respect 
to  the  imaging  system.  This  procedure  may  be  applied  in¬ 
dependently  to  all  surface  elements  to  obtain  the  shape  of 
the  entire  surface  S. 


sensor  oiane 


optics 


To  automatically  detect  the  instant  of  "best”  focus,  we 
will  develop  an  image  focus  measure.  In  the  above  discus¬ 
sion,  the  stage  motion  and  image  acquisition  were  assumed 
to  be  continuous  processes.  In  practice,  however,  it  is  not 
feasible  to  acquire  and  process  such  a  large  number  of  im¬ 
ages  in  a  reasonable  amount  of  time.  Therefore,  we  obtain 
only  a  finite  number  of  images;  the  stage  is  moved  in  in¬ 
crements  of  Ad,  and  an  image  is  obtained  at  each  stage 
position  (d  =  n.Ad).  By  studying  the  behavior  of  the 
focus  measure,  we  develop  an  interpolation  method  that 
uses  a  small  number  of  focus  measures  to  compute  accu¬ 
rate  depth  estimates.  An  important  feature  of  the  method 
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is  its  local  nature;  the  depth  estimate  at  an  image  point  is 
computed  only  from  focus  measures  recorded  at  that  point. 
Consequently,  the  method  can  adapt  well  to  variations  in 
texture  type  and  content  over  the  object  surface. 


where  I(x,y)  is  the  image  intensity  at  the  point  (x,yj.  In 
frequency  domain,  applying  the  Laplacian  L{u,\)  to  the 
defocused  image  /£)(u,v)  (eq.  5)  gives: 

f,(u,  v) . //(u,  v) . /f(u,  v)  (10) 


5  A  Focus  Measure  Operator 

To  mezisure  the  quality  of  focus  in  a  small  image  area,  we 
develop  a  focus  measure  operator.  The  operator  must  re¬ 
spond  to  high  frequency  variations  in  image  intensity,  and 
ideally,  must  produce  maximum  response  when  the  image 
area  is  perfec- ly  focused.  The  high  frequency  content  of  an 
image  area  can  be  determined  by  using  the  Fourier  trans¬ 
form.  However,  since  Fourier  transforms  are  expensive  to 
compute  and  analyze  without  special  purpose  hardware, 
we  seek  an  alternative  method. 

A  few  focus  measure  operators  have  been  proposed  and 
used  in  the  past  [5].  Generally,  the  objective  has  been 
to  find  an  operator  that  behaves  in  a  stable  and  robust 
manner  over  a  variety  of  images  such  as  images  of  indoor 
and  outdoor  scenes.  Such  an  approach  is  essential  while 
developing  automatically  focusing  imaging  systems  th?t 
have  to  dead  with  general  scenes.  Bearing  in  mind  that  we 
are  dealing  with  textured  images,  we  develop  an  operator 
that  is  particularly  well-suited  to  such  images.  In  the  next 
section  we  will  evaluate  the  performance  of  our  operator. 

An  interesting  observation  can  be  made  regarding  the 
application  of  focus  measure  operators.  Eq.  3  relates  a 
defocused  image  to  a  focused  image  using  the  blurring 
function.  Assume  that  a  focus  measure  operator  o(x,y)  is 
applied  (by  convolution)  to  the  defocused  image  /j(x,y). 
The  result  is  a  new  image  r(x,  y)  that  may  be  expressed 
as: 


r(j;,y)  =  o(x,y)  ♦  (/i(x,y)  *  //(i,y))  (7) 

Since  convolution  is  a  linear  operation,  we  can  rewrite  the 
above  expression  as; 


r(i,y)  =  h(x,y)  *  (o(x,y)  *  //(x,y))  (8) 

Therefore,  applying  a  focus  measure  operator  to  a  defo¬ 
cused  image  is  equivalent  to  defocusing  a  new  image  ob¬ 
tained  by  convolving  the  focused  image  with  the  focus  mea¬ 
sure  operator.  The  focus  measure  operator  only  selects  the 
frequencies  in  the  focused  image  that  will  be  attenuated 
due  to  defocusing.  Since  defocusing  is  a  low-pass  filter¬ 
ing  process,  its  effects  on  the  image  are  more  pronounced 
and  detectable  if  the  image  has  strong  high  frequency  con¬ 
tent.  An  effective  focus  measure  operator,  therefore,  must 
high-pass  filter  the  image. 

One  way  to  high-pass  filter  an  image  is  to  determine  its 
second  derivative.  For  two-dimensional  images,  the  Lapla¬ 
cian  may  be  used: 


= 


d^i  d^i 
dx^  dy^ 


(9) 


where; 

„  „  u”  +  u”  2 

f,(u,v).f/(u,v)  =  -(u-  +  u“).e - ? - (II) 

Fig.5  shows  the  frequency  distribution  of  |  L.H  |  as  a 
function  of  the  defocusing  parameter  For  any  given 
frequency  (u,v),  \  L.H  \  varies  as  a  Gau.ssian  function 
of  the  defocusing  parameter  u/,.  In  general,  however,  the 
result  would  depend  on  the  frequency  distribution  of  the 
imaged  scene.  Though  our  texture  is  random,  it  may  be  as¬ 
sumed  to  have  a  set  of  dominant  frequencies.  Then,  loosely 
speaking,  each  frequency  is  attenuated  by  a  Gaussian  func¬ 
tion  in  (Tft  and  its  width  is  determined  by  the  frequency. 
Therefore,  the  result  of  applying  the  Laplacian  operator 
may  be  expressed  as  o.  ■^um  of  Gaussian  functions  in  tr/,. 
The  result  is  expected  to  be  maximum  when  tr/,  =  0,  i.e. 
when  the  image  is  perfectly  focused.  Since  the  frequency 
distribution  of  the  texture  is  random,  the  widths  of  the 
Gaussian  functions  are  also  random.  Using  central  limit 
theorem,  the  result  of  applying  the  Laplacian  operator  to 
an  image  point  may  be  assumed  to  be  a  Gaussian  function 
of  the  defocusing  parameter 


|l.  H| 


Figure  5:  The  effect  of  defocusing  and  second-order  dif¬ 
ferentiation  in  frequency  domain. 

This  general  behavior  is  expected  irrespective  of  the  fo¬ 
cus  measure  operator  used.  The  focus  measure  operator 
only  selects  the  frequencies  that  will  play  a  dominant  role 
in  this  process.  Our  experiments  (section  .'5)  as  well  as 
Krotkov’s  empirical  evaluation  of  various  focus  measure 
operators  [5]  support  the  above  argument.  As  seen  in  [5], 
image  noise  and  magnification  variations  will  of  course  de¬ 
grade  the  performance  any  focus  measure  operator. 
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We  note  that  in  the  case  of  the  Laplacian  the  second 
derivatives  in  the  x  and  y  directions  can  have  opposite 
signs  and  tend  to  cancel  each  other.  An  example  of  such  an 
instance  is  illustrated  in  Fig.6;  the  partial  derivatives  are 
equal  in  magnitude  but  opposing  in  sign,  i.e.  /  =  0.  In 
the  case  of  textured  images,  similar  instances  may  occur 
frequently  and  the  Laplacian  may-  at  times  behave  in  a 
unstable  manner.  We  overcome  this  problem  by  defining 
the  modified  Laplacian  as: 


M  1 


dx^ 


-f 


52/ 

w 


(12) 


where  the  parameter  N  determines  the  window  size  used  to 
compute  the  focus  measure.  In  contrzist  to  auto-focusing 
methods,  we  typically  use  a  small  window  of  size  3x3  or 
5x5,  i.e.  N  =  \  OT  N  =  2.  We  shall  refer  to  the  above  focus 
measure  as  the  sum-modified- Laplacian  (SML).  Note  that 
as  a  result  of  definition  of  the  modified  Laplacian  and  the 
use  of  the  threshold  T / ,  the  SML  is  not  a  linear  operator 
and  cannot  be  implement  as  a  simple  convolution.  How¬ 
ever,  the  SML  can  be  computed  using  a  straightforward 
algorithm. 

6  Evaluating  the  Focus  Measure 


Note  that  the  modified  Laplacian  is  always  greater  or  equal 
in  magnitude  to  the  Laplacian.  In  [15]  the  advantages 
of  the  above  modification  have  been  empirically  demon¬ 
strated.  However,  the  experiments  described  in  [15]  also 
indicate  that  the  response  of  the  modified  Laplacian  is 
slightly  more  stable  but  not  very  different  from  that  of 
the  Laplacian. 
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Figure  6:  A  texture  instance  with  zero  Laplacian  value. 

The  discrete  approximation  to  the  Laplacian  is  usually 
a  3x3  operator.  In  order  to  accommodate  for  possible 
variations  in  the  size  of  texture  elements,  we  compute  the 
partial  derivatives  by  using  a  variable  spacing  (step)  be¬ 
tween  the  pixels  used  to  compute  the  derivatives.  Hence, 
the  discrete  approximation  to  the  modified  Laplacian  is 
computed  as: 

ML{x,y)  =  (13) 

I  2/(x,y)  -  f(x  -  step,  y)  -  I{x  -f-  step,  y)  | 

-H  I  2/(x,y)  -  /(x,y  -  step)  -  /(x,y-f- step)  I 

Finally,  the  focus  measure  at  a  point  (ij)  is  computed  as 
the  sum  of  modified  Laplacian  values,  in  a  small  window 
around  (i,j),  that  are  greater  than  a  threshold  value: 

i+N  ji+J^ 

F(i,j)=  ^  ^  ML(x,y)  for  ML{x,y)  >  T ,{U) 

xmi-N  y=j-N 


We  evaluate  the  SML  focus  measure  by  analyzing  its  be¬ 
havior  as  a  function  of  the  distance  between  the  observed 
surface  and  the  focused  plane.  A  detailed  description  of 
the  experimental  set-up  is  given  in  a  later  section.  In 
the  following  experiments,  texture  samples  are  attached 
to  a  translational  stage  (Fig.4)  and  the  distance,  ds,  from 
each  sample  to  the  stage  is  known  by  measurement.  Im¬ 
ages  of  the  samples  are  obtained  using  a  microscope  and 
a  512x512  pixel  CCD  camera.  The  complete  imaging  sys¬ 
tem  has  a  physical  resolution  of  approximately  1pm  per 
pixel  width. 


Figure  7:  SML  focus  measure  function  computed  for  two 
texture  samples. 


598 


In  Fig. 7,  the  focus  measure  functions  of  two  samples  are 
shown.  Sample  X  has  high  texture  content  while  sample  Y 


has  relatively  weaker  texture.  Both  samples  are  made  of  a 
paste  containing  resin  and  tungsten  particles.  The  variable 
size  of  the  tungsten  particles  gives  the  surfaces  a  randomly 
textured  appearance.  For  each  sample,  the  stage  is  moved 
in  increments  (Ad)  of  1/im,  an  image  of  the  sample  is 
obtained,  and  the  SML  focus  measure  is  computed  using 
an  evaluation  window  size  of  10x10  pixels.  The  vertical 
lines  in  Fig.7  indicate  the  known  initial  distances  (dj  -  ds) 
of  the  samples  from  the  focused  plane.  The  focus  measures 
were  computed  using  parameter  values  of  step  =  1  and  T j 
=  7.  No  form  of  temporal  filtering  was  used  to  reduce  the 
effects  of  image  noise,  as  we  intend  to  use  unfiltered  focus 
measures  to  estimate  the  depth  of  surface  points.  Though 
the  measure  values  are  slightly  noisy,  they  peak  very  close 
to  the  expected  peak  positions  (vertical  lines  in  Fig.7).  We 
see  that  the  focus  measure  function  peaks  sharply  for  the 
stronger  texture  and  it  peaks  relatively  slowly  and  with 
a  lower  peak  value  for  the  weaker  texture.  However,  the 
sharpness  of  the  focus  measure  function  depends  not  only 
on  the  texture  strength  but  also  the  depth  of  field  of  the 
imaging  system.  The  depth  of  field,  in  turn,  depends  on  the 
magnification  and  aperture  size  of  the  imaging  system.  We 
will  assume  that  the  depth  of  held  is  constant  for  all  our 
experiments.  Note  that  the  focus  measure  functions  for 
both  samples  have  Gaussian-like  distributions  near  their 
peak  values.  The  fringes  of  the  focus  measure  functions  are 
less  symmetric  as  the  magnification  can  vary  substantially 
from  one  fringe  to  the  other. 

Fig.  8  shows  the  focus  measure  computed  as  a  function 
of  the  parameter  step  for  the  sample  X  shown  in  Fig.7. 
Once  again,  an  evaluation  window  size  of  10x10  and  a 
threshold  value  ol  T i  =7  were  used.  We  see  that,  for 
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Figure  8:  SML  focus  measure  and  the  parameter  step. 

sample  X,  a  maximum  measure  value  is  computed  at  step 
=  4.  However,  it  may  be  noted  that  the  effective  size  of  the 
measure  evaluation  window  increases  with  the  step  size. 
Since  we  are  interested  in  local  depth  estimates,  step  values 
of  1  or  2  are  usually  used.  Fig.  9  shows  the  effect  of  varying 
the  threshold  T/ ,  for  both  focused  and  defocused  images 


of  sample  X.  A  good  value  for  T  j  is  one  that  produces  a 
high  measure  value  for  the  focused  image  and  low  measure 
values  for  defocused  images.  From  Fig.  9,  we  see  that,  for 
sample  X,  T  j  =7  appears  to  be  a  good  choice.  However, 
from  a  number  of  unreported  experiments,  we  find  that 
though  the  peak  value  of  the  focus  measure  function  varies 
with  the  parameter  values,  the  same  parameter  values  may 
be  used  to  obtain  sharp,  unimodal,  focus  measure  functions 
for  a  large  range  of  texture  types  and  strengths. 


SML  Focus  Measure 


In  [15],  the  SML  focus  measure  has  been  compared  with 
three  other  measures  that  have  been  previously  used  for 
auto-focusing;  Tenengrad,  variance,  and  sum-Laplacian 
(SL).  These  experiments  indicate  that,  among  these  op¬ 
erators,  the  SML  operator  is  best  suited  for  measuring  the 
focus  quality  of  textured  images. 

7  Sampling  the  Focus  Measure  Func¬ 
tion 

The  focus  measure  function  of  an  image  point  (x,  y )  may  be 
represented  as  F(x,y,d).  Since  depth  estimation  is  a  local 
operation,  we  focus  our  attention  on  a  single  image  point, 
bearing  in  mind  that  the  same  estimation  method  can  be 
applied  to  all  other  image  points.  The  focus  measure  func¬ 
tion  at  the  image  point  is  F[d).  From  the  previous  sec¬ 
tions  we  know  that  F(d)  has  a  Gaussian  distribution  near 
the  peak,  with  mean  value  d  and  standard  deviation  vp 
(Fig.lO).  The  mean  d  corresponds  to  the  stage  displace¬ 
ment  at  which  F{d)  is  maximum,  i.e.  F{d)  =  Fp.  As 
the  texture  content  on  the  surface  element  increases,  Fp 
increases  and  ap  decreases.  Each  surface  element,  there¬ 
fore,  is  expected  to  have  its  own  Fp  and  <rp  values. 

If  we  use  very  small  stage  displacements  (Ad  «  0),  the 
number  of  images  to  be  obtained  and  processed  is  too  large 
from  the  perspective  of  practical  implementation.  Hence, 
we  use  large  displacements  to  obtain  a  few  images  of  dif¬ 
ferent  focus  levels  and  use  the  Gaussian  model  to  inter- 
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polate  a  small  number  of  focus  measures  in  the  peak  re¬ 
gion  to  obtain  depth  estimates.  Computing  focus  mea¬ 
sures  at  a  finite  number  of  stage  displacements  is  equiva¬ 
lent  to  sampling  the  function  f  (d);  at  each  displacement 
di  the  focus  measure  F(di)  is  computed  to  obtaiin  the  set 
{F{di)  I  i  =  1 ,2,  ....M).  We  show  in  the  following  sec¬ 
tion  that  a  minimum  of  three  focus  measures  are  needed 
perform  the  Gaussian  interpolation.  Since  the  Gaussian 
model  is  valid  only  in  the  peak  region  of  /’(d),  these  three 
focus  measures  must  be  computed  in  this  region.  This  is 
ensured  by  using  the  condition  trp  <  Ad  <  2ap.  Note 
that  all  object  points  are  subjected  to  the  same  displace¬ 
ments.  Therefore,  by  applying  the  above  condition  to  the 
image  area  that  has  maximum  texture  content,  we  can  en¬ 
sure  that  a  few  or  many  focus  measures  will  be  computed 
in  the  ±  ap  range  at  all  other  image  points. 

The  value  oi  erp  also  increases  with  the  depth  of  field 
of  the  imaging  system.  Therefore,  for  objects  of  larger 
dimensions  also,  only  a  small  number  of  images  may  used 
by  increasing  the  depth  of  field. 

8  Depth  Estimates  from  Focus  Mea¬ 
sures 

We  now  describe  the  estimation  of  depth  of  a  surface  point 
(*,y)  from  the  focus  measure  set  {/’(dj)  |  i  = 

The  parameter  d  represents  the  depth  of  the  surface  point. 
For  convenience  the  notation  F|  is  used  to  represent  the 
focus  measure  value  F(dt).  We  present  algorithms  for  two 
different  depth  estimation  methods.  Each  algorithm  may 
be  applied  to  all  points  in  the  image  to  obtain  depth  maps. 


8.1  Coarse  Resolution  Depth  Estimation 

The  first  algorithm  simply  looks  for  the  displacement  value 
di  that  maximizes  the  focus  measure  and  assigns  that  value 
to  d. 

Algorithm  1 

Step  1:  Let  k  =  1,  Fmax  =  0. 

Step  2:  If  Fjfc  >  Fmax,  then  F max  =  Fi-  and  d  =  d*. 

Step  3:  If  k  <  Af,  then  k  =  k  +  /,  go  to  step  2.  Else 
stop. 

Step  4;  If  Fmax  <  T 2,  the  point  (*,y)  belongs  to  the 
background. 

This  simple  algorithm  may  be  used  to  compute  low  reso¬ 
lution  depth  estimates.  The  performance  of  the  algorithm 
is  directly  dependent  on  the  selection  of  Ad.  If  Ad  is 
small,  a  large  number  of  object  images  are  obtained  and 
the  depth  maps  are  more  accurate. 


8.2  Depth  Estimation  by  Gaussian  Interpolation 

The  second  algorithm  uses  the  Gaussian  distribution  to 
model  the  peak  region  of  the  focus  measure  function  F(d) 
and  interpolates  the  computed  measure  values  to  obtain 
more  accurate  depth  estimates.  The  following  algorithm 
uses  only  three  focus  measures,  namely,  Fm-i,  Fm,  and 
Fm+/,  tkat  lie  on  the  largest  mode®  of  F(d),  such  that, 
Fm  >  Fm-i  and  Fm  >  Fm+/  (Fig.lO). 
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Figure  10;  Gaussian  interpolation  of  focus  measures. 

Using  the  Gaussian  model,  the  focus  measure  function 
may  be  expressed  as: 

{-I  (^)'} 

where  d  and  ap  are  the  mean  and  standard  deviation  of  the 
Gaussian  distribution  (Fig.lO).  Using  natural  logarithm, 
we  can  rewrite  eq.  15  as; 


By  substituting  each  of  the  three  measures  Fm-/,  Fm, 
and  Fm+j,  and  its  corresponding  displacement  value  in 
eq.  16,  we  obtain  three  equations  that  can  be  solved  for  d 
and  ap: 

d  =  (17) 

{InFm  -  fnFm+/)(dm^  -  dm-I^) 
2Ad{(lnFm  -  InFm-t)  +  {InFm  -  fnFm+/)} 

{InFm  -  lnFm-l){dm^  -  dm^./^) 
2Ad{(lnFm  -  InFm-i)  +  {InFm  -  lnFm+/)} 

®Due  to  image  noise  and  variations  in  magnification,  the  fo¬ 
cus  measure  function  may  be  multi-moded  with  one  strong  peak 
and  one  or  more  weak  ones. 
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2  _ _ {dtn^  —  dm-1^)  +  {dm^  ~  <^m+7^) 

~  2{(lnFm  -  InFm.l)  +  (InFm  -  lnFm+])r 

Using  eq.15,  we  can  find  Fp  from  trp  and  d  as: 

F,  .  (19) 

If  Fp  is  large  and  ffp  is  small,  the  focus  measure  function 
hiis  a  strong  peak,  indicating  high  surface  texture  content 
in  the  vicinity  of  the  image  point  (i,y).  Hence,  the  values 
of  Fp  and  ap  can  be  used  to  segment  the  observed  scene 
into  regions  of  different  texture  content.  Fig.  11  show 
the  results  of  Gaussian  interpolation  applied  to  the  focus 
measures  computed  for  a  real  sample. 


^  P  X  X  X  mmt 


Figure  11:  Gaussian  interpolation:  Experimental  result. 

The  following  algorithm  first  finds  the  measures  Fm-i, 
F m,  and  7  that  correspond  to  the  strongest  peak  of 

F(d),  and  then  uses  these  measures  to  estimate  the  depth 
d  by  Gaussian  interpolation. 

Algorithm  2 

Step  Is  Let  k  =  S,  F —  Oj  Fm  ~  Oj  Fm+i  —  dm 

=  0. 


Step  2:  If  >  Fm  ,  Fi_j  >  F*  ,  and  >  Fk-z, 
then: 

Fm  —  F  k-l 
Fm-]  =  Fk-z 
F  m+/  =  Ffc 
dm  =  dk-i 


step  4;  dm-i  =  dm  —  Ad  and  dm+i  =  dm  +  Ad. 
Determine  d,  (rp,  and  Fp  using  Eqs.  17,  18,  and  19. 

Step  5:  If  Fp  <  T 3  or  ap  >  T^,  the  image  point  (i,y) 
belongs  to  background.  Stop. 

Since  the  values  of  Fp  and  ap  are  only  useful  for  tex¬ 
ture  segmentation,  their  evaluation  may  be  avoided  to  save 
computations. 

9  An  Automated  System 

9.1  Implementation 

We  have  implemented  a  fully  automated  shape  from  fo¬ 
cus  system  for  the  recovery  of  microscopic  objects.  A 
photograph  of  the  system  is  shown  in  Fig.l2.  A  Nikon 
Alphaphot-2  model  microscope  is  used  to  image  the  ob¬ 
jects.  Objects  can  be  magnified  using  objective  lenses  with 
xlO,  x40,  and  xlOO  magnification.  The  object  is  illumi¬ 
nated  using  bright  field  illumination  where  light  energy  is 
focused  on  the  object  by  the  same  lenses  that  are  used  to 
magnify  the  object.  A  CCD  camera  with  512x512  pixels 
is  mounted  on  the  microscope  to  obtain  digital  images  of 
the  object.  The  z-axis  of  the  microscope  stage  is  driven 
by  a  stepper  motor  and  the  position  of  the  stage  can  be 
computer  controlled  with  a  resolution  and  accuracy  of  0.02 
fitn.  The  shape  from  focus  algorithm  is  programmed  and 
executed  on  a  Sun  SPARC  2  workstation. 


Figure  12:  Automated  shape  from  focus  system. 


Step  3:  U  k  <  M ,  k  =  k  +  1,  go  to  step  2. 


601 


The  object  is  placed  on  the  microscope  stage  and  the 
appropriate  objective  lens  is  used  to  magnify  the  object. 


The  f  lie  parameters  ( 7 j  and  step)  and  the  stage 

displ.  sd)  are  provided  to  the  program.  The  pro¬ 

gram  .  .utomatically  increments  the  stage  position, 
digitizes  and  stores  an  image  for  each  new  position,  and 
uses  the  image  sequence  to  compute  a  depth  map  of  the 
object.  The  program  also  reconstructs  a  focused  image  of 
the  object  from  the  sequence  of  defocnsed  images.  The  re¬ 
construction  algorithm  uses  the  estimated  depth  to  locate 
and  patch  together  the  best  focused  image  areas  in  image 
sequence. 

9.2  Results 

Prior  to  automating  the  shape  from  focus  system,  experi¬ 
ments  were  conducted  to  determine  the  accuracy  and  fea^ 
sibility  of  the  method.  The  first  experiment  was  conducted 
on  a  steel  ball  sample  that  is  1590/im  in  diameter.  The  ball 
has  a  rough  surface  that  gives  it  a  textured  appearance.  A 
c^meia  image  of  the  ball  under  bright  field  illumination  is 
shown  in  Fig.l3(a).  Incremental  displacements  of  Ad  = 
100  pm  were  used  to  obtun  12  images  of  the  ball,  and  a 
5x5  SML  operator  was  applied  to  the  image  sequence  to 
obtain  focus  measures.  Depth  maps  of  the  ball,  generated 
by  the  coarse  resolution  and  Gaussian  interpolation  algo¬ 
rithms,  are  shown  in  Fig.l3(b)  and  13(c),  respectively.  The 
known  size  and  location  of  the  ball  were  used  to  obtain  er¬ 
ror  maps  by  subtracting  a  smooth  ball  from  the  two  depth 
maps.  It  is  difficult  to  define  the  accuracy  of  the  method  as 
it  depends  on  several  factors:  the  surface  texture,  depth 
of  field  of  the  imaging  system,  and  the  incremental  dis¬ 
placement  Ad.  The  table  shown  in  Fig.  13(d)  shows  the 
error  statistics  computed  from  the  error  maps  correspond¬ 
ing  to  the  two  algorithms.  A  total  of  23235  image  pixels 
lie  within  the  boundary  of  the  ball.  The  number  of  depth 
values  computed  by  each  algorithm  depends  on  the  values 
selected  for  the  thresholds  Ts,  Ts,  and  7^. 

The  automated  system  has  been  used  to  recover  the 
shapes  of  a  variety  of  industrial  samples.  Fig.  14  shows  a 
tungsten  paste  filling  in  a  via-hole  on  a  ceramic  substrate 
[16].  The  filling  is  used  to  establish  electrical  connections 
between  different  components  on  a  circuit  board.  Condi¬ 
tions  such  as  excess  filling  and  lack  of  filling  ccuse  electri¬ 
cal  defects  such  as  short  and  open  circuits.  The  via-hole 
shown  in  Fig.l4  is  approximately  70  pm  in  diameter  an  is 
not  sufficiently  filled  with  tungsten  paste.  A  total  of  18 
images  of  the  via-hole  were  obtained  using  stage  position 
increments  of  4pm.  Some  of  these  images  are  shown  in 
Fig.l4(a>-f).  The  specular  reflectance  properties  and  vari¬ 
able  size  of  the  tungsten  particles  gives  the  surface  a  ran¬ 
dom  texture.  The  white  background  is  the  substrate  area 
that  has  weak  texture.  A  depth  map  of  the  viarhole  was 
obtained  using  the  Gaussian  interpolation  algorithm.  A 
5x5  median  filter  was  used  to  remove  a  few  erroneous 
depth  estimates  that  resulted  from  the  lack  of  texture  in 
some  image  areas.  Fig.l4(g)  and  Fig.l4(h)  show  the  re¬ 
constructed  image  and  two  views  of  the  depth  map. 

The  object  in  Fig.l5  is  a  contamination  particle  on  the 
surface  of  a  drcuit  board.  The  particle  if  approximately 


250  pm  in  length  and  150  pm  in  height.  Though  the  par¬ 
ticle  surface  produces  images  with  very  weak  texture,  the 
reconstructed  image  and  the  depth  map  generated  by  the 
system  are  accurate  and  detailed. 

The  above  experiments  indicate  that  the  Gaussian  in¬ 
terpolation  algorithm  performs  stably  over  a  wide  range  of 
textures.  Errors  in  computed  depth  estimates  result  from 
factors  such  as  image  noise,  Gaussian  approximation  of  the 
SML  focus  measure  function,  and  weak  textures  in  some 
image  areas.  We  are  currently  using  the  system  to  recover 
the  shapes  and  focused  images  of  a  variety  of  microscopic 
biological  samples,  including  micro-organisms  and  chromo¬ 
somes. 

10  Summary 

We  conclude  with  a  brief  summary  of  the  shape  from  focus 
method  presented  in  this  paper. 

•  Most  surfaces  appear  rough  at  some  level  of  magni¬ 
fication.  Visibly  rough  surfaces  produce  highly  tex¬ 
tured  images  and  it  is  difficult  to  recover  the  three- 
dimensional  shapes  of  these  surfaces  using  existing 
passive  and  active  sensing  techniques. 

•  The  shape  from  focus  method  obtains  a  sequence 
of  object  images  by  translating  the  object  through 
the  focused  plane  of  the  imaging  system.  The  snm- 
modified-Lapladan  operator  is  applied  to  the  image 
sequence  to  compute  a  set  of  focus  measures  at  each 
point  on  the  object  surface. 

•  Using  a  model  for  the  focus  measure  function,  the 
focus  measures  at  each  point  are  interpolated  to  com¬ 
pute  accurate  depth  estimates.  The  local  nature  of 
the  depth  estimation  technique  enables  it  to  adapt  to 
substantial  variations  in  image  texture. 

•  A  fully  automated  shape  from  focus  system  has  been 
developed.  The  system  has  been  used  to  recover  dense 
and  accurate  depth  maps  of  several  microscopic  ob¬ 
jects. 
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(a)  Camera  image. 


coarse  resolution. 


(c)  Depth  map:  Gaussian  interpolation. 


Diimeicr  of  Tesi  Sphere :  IS90  iud 

Coarse 

Gaussian 

Inietpolaiion 

Number  of  Poinu 

226«2 

23257 

Mean  Error  Oun) 

7.861 

3.857 

Mean  Absolute 
Error  (pm) 

30.32 
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Maxinuim  Absolute 
Enor  (Mtn) 

187.80 

175.82 

(d)  Error  statistic.s. 


Figure  13;  Results:  Steel  ball. 
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Abstract 

We  address  the  problem  of  obtaining  natural  (intuitive) 
descriptions  of  planar  shapes.  Shape  description  is  a 
major  problem  in  machine  perception,  and  is  the  basis 
for  recognition.  Many  approaches  have  been  suggested, 
but  none  provide  a  complete  and  natural  solution.  In 
this  paper  we  suggest  a  method  for  producing  an  axial 
representation  of  a  shape  based  on  a  hierarchical  decom¬ 
position  of  the  shape  into  its  parts.  The  novelty  of  our 
approach  lies  in  the  combination  of  several  competing 
approaches  and  tools,  into  a  unified  scheme  and  an  ef¬ 
ficient  implementation  producing  natural  descriptions. 
We  use  Smooth  Local  Symmetries  for  the  axial  repre¬ 
sentation  of  parts.  We  also  use  parallel  symmetries  to 
provide  information  on  global  relationships  within  the 
shape.  This  information  is  used  for  parsing  the  shape 
into  a  hierarchy  of  parts.  Currently  we  assume  that  our 
shape  is  a  closed  planar  curve.  Our  approach  uses  both 
region  and  contour  information,  can  handle  shapes  with 
corners,  and  addresses  the  issues  of  local  vs.  global  in¬ 
formation,  the  issue  of  scale  and  the  notion  of  part.  Our 
method  is  computationally  efficient,  parameter  free,  sta¬ 
ble,  and  we  present  results  which  show  that  it  provides 
an  intuitive  shape  description. 

1  Introduction 

Shape  description  is  the  basis  for  recognition,  and  is 
one  of  the  key  problems  in  machine  perception.  Vari¬ 
ous  methods  for  shape  description  have  been  suggested 
through  the  years  of  research  in  machine  and  human 
perception,  but  none  provide  a  complete  and  natural  so¬ 
lution  to  the  problem. 

Many  researchers  have  discussed  the  requirements 
from  a  good  shape  description.  Examples  can  be  found 
in  [2,  14,  15,  13,  8,  17].  A  good  description  should  be 
rich,  stable  and  invariant  to  changes  in  the  viewing  con¬ 
ditions.  In  addition,  it  should  be  capable  of  describing 
partially  occluded  parts.  These  requirements  lead  to  de¬ 
scriptions  which  are  segmented  and  hierarchical. 

*This  research  was  supported  by  the  Advanced  Research 
Projects  Agency  of  the  Department  of  Defense  and  was  mon¬ 
itored  by  the  Air  Force  Office  of  Scientific  Research  under 
Contract  No.  F49620-90-C-0078.  The  United  States  Govern¬ 
ment  is  authorized  to  reproduce  and  distribute  reprints  for 
governmental  purposes  notwithstanding  any  copyright  nota¬ 
tion  hereon. 


Hoffman  and  Richards  [8,  18],  suggest  a  contour 
based  representation.  Following  psychophysical  obser¬ 
vations,  they  segment  the  contour  into  segments,  called 
codons,  at  negative  curvature  minima.  Curvature  max¬ 
ima  and  inflections  are  used  for  internal  part  descrip¬ 
tion.  This  scheme,  being  contour  based,  does  not  incor¬ 
porate  any  region  information.  It  also  does  not  account 
for  any  global  relationships  between  distant  sections  of 
the  curve. 

Many  authors  have  suggested  a  representation  scheme 
based  on  a  skeleton  or  an  axis  around  which  the  shape 
is  locally  symmetric.  Examples  are:  Blum’s  Symmeiry 
Axis  Transform  (SAT)  [3],  Brooks’s  Generalized  Rib¬ 
bons  [5],  which  are  a  two-dimensional  version  of  Bin- 
ford’s  Generalized  Cylinders  [2],  and  Brady’s  Smooth  Lo¬ 
cal  Symmetries  (SLS)  [4].  These  methods  incorporate 
contour  and  region  information.  An  in  depth  comparison 
of  these  three  methods  was  performed  by  Rosenfeld  [20], 
zmd  extended  by  Ponce  [16].  These  methods  are  locd 
in  nature,  they  are  very  sensitive  to  noise,  eind  they  Me 
expensive  to  compute,  even  when  using  analytical  ap¬ 
proximations  to  the  shape. 

Leyton  [12]  has  suggested  a  process-grammM  for 
shape.  He  argues  that  a  shape  can  be  understood  as  the 
outcome  of  processes  that  formed  it.  After  proving  an 
important  duality  between  curvature  extrema  and  sym¬ 
metry  structure,  he  uses  symmetry  axes  to  infer  process 
history.  This  approach  assumes  all  shapes  are  basically 
circles  which  have  been  deformed  by  protrusions  and  in¬ 
dentations.  No  distinction  is  made  between  pMts  and 
protrusions,  and  global  relationships  within  the  shape 
are  ignored,  making  this  description  unnatural  in  many 
cases. 

Recently,  Kimia  et  al.  [10]  have  proposed  a  theory 
of  shape  based  on  a  reaction-diffusion  equation.  Their 
method  produces  a  hierMchical  decomposition  of  shape 
into  parts  and  protrusions.  The  issues  of  part  and  pro¬ 
trusion,  as  well  as  the  issue  of  scale  Me  addressed.  How¬ 
ever,  it  is  a  continuous  process  which  is  computationally 
expensive.  In  addition  their  scheme  reduces  every  part 
to  a  circle.  It  provides  only  a  parsing  of  the  shape  into 
parts  not  a  description  of  the  pMts. 

We  are  interested  in  producing  natural  descriptions  of 
planar  shapes.  Our  ultimate  goal  is  in  achieving  human 
like  performance.  In  this  paper  we  suggest  a  method 
for  producing  an  axial  representation  of  a  shape,  along 
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with  a  hierarchical  decomposition  of  the  shape  into  its 
parts.  The  novelty  of  our  approach  lies  in  the  combi¬ 
nation  of  several,  often  competing  approaches  and  tools, 
into  a  unified  scheme  and  an  implementation  producing 
natural  descriptions.  We  use  SLS  for  the  axial  descrip¬ 
tion  of  parts.  We  also  use  parallel  symmetries  [23]  to 
provide  information  on  global  relationships  within  the 
shape.  This  information  is  used  to  parse  the  shape  into 
a  hierarchy  of  parts.  When  conflicts  between  the  local  in¬ 
terpretation  and  the  global  interpretation  arise,  we  gen¬ 
erate  descriptions  based  on  both  interpretations.  Cur¬ 
rently  we  assume  that  our  shape  is  a  closed  planar  curve 
which  we  approximate  using  quadratic  B-splines.  Our 
approach  is  region  and  contour  based,  and  addresses  the 
issues  of  local  and  global  ‘nformation,  the  issue  of  scale 
and  the  notion  of  part.  Our  method  is  computationally 
efficient,  parameter-free,  stable,  and  we  present  results 
which  show  that  it  provides  intuitive  shape  descriptions. 

The  next  section  outlines  the  basic  ideas  of  our  ap¬ 
proach.  Section  3  presents  the  details  of  our  approach 
and  implementation.  Examples  of  the  results  we  ob¬ 
tained  are  presented  in  Section  4  followed  by  some  con¬ 
cluding  remarks  in  Section  5. 

2  Overview  of  the  Approach 

We  suggest  a  hierarchical  approach  for  shape  description, 
combining  local  and  global  information.  We  produce  a 
decomposition  of  the  shape  into  parts  together  with  an 
axis!  description  of  these  parts.  We  outline  our  method 
here,  the  details  of  our  implementation  are  presented  in 
the  next  section. 

A  strategy  often  used  for  obtaining  axial  representa¬ 
tions,  is  to  generate  ail  possible  axes,  then  to  select  the 
appropriate  ones  according  to  some  criteria  [6, 17].  Some 
axes,  unfortunately,  cannot  be  found  until  some  parts  are 
removed.  It  is  necessary  to  use  a  hierarchical  strategy 
instead,  in  which,  at  each  step,  local  and  well  defined 
parts  are  described  and  removed.  Once  these  parts  are 
removed,  the  next  level  parts  can  now  be  described.  This 
process  is  efficient  and  produces  a  decomposition  of  the 
shape  into  its  intuitive  parts  ”’ith  a  stable  axial  descrip¬ 
tion  of  these  parts.  One  remaining  problem  with  this 
approach  is  that  it  ignores  global  relationships  between 
different  parts  of  the  shape.  We  use  parallel  symmetries, 
introduced  by  Ulupinar  and  Nevatia  [23],  to  detect  such 
global  relationships.  Once  conflicts  between  the  local 
description  and  the  global  relationships  are  detected,  we 
create  a  branch  in  our  decomposition  process,  thus  allow¬ 
ing  multiple  plausible  interpretations,  which  turn  out  to 
be  rare. 

Figure  1  presents  an  outline  of  the  data  flow  of  our 
approach.  Initially,  we  perform  some  preprocessing  on 
the  image  including  edge  detection  and  linking.  We 
then  approximate  the  input  shape  with  approximating 
B-splines.  Since  we  are  interested  in  an  axial  represen¬ 
tation,  we  find  SLS  to  be  a  powerful  descriptive  tool. 
However,  applied  globally  on  the  entire  shape  it  pro¬ 
duces  noisy  and  poor  descriptions.  Therefore,  we  first 
segment  the  contour  at  curvature  sign  changes  into  ini¬ 
tial  local  parts.  We  represent  these  parts  using  SLS.  A 
theorem  proved  by  Leyton  [11],  relates  the  existence  and 


Figure  1:  Outline  of  the  data  flow  in  the  system. 


Figure  2;  The  decomposition  of  a  man  shape.  The  orig¬ 
inal  shape  is  on  the  left.  At  every  step  the  parts  already 
explained  are  shaded.  The  skeleton  description  is  also 
presented. 


uniqueness  of  an  SLS  axis  describing  these  parts  to  the 
curvature  extrema  of  their  contour.  This  enables  us  to 
efficiently  generate  stable  and  noise  free  descriptions  for 
these  local  parts.  In  addition  to  this  local  description,  we 
find  global  relationships  within  the  shape,  by  computing 
the  parallel  symmetries. 

Now  that  we  have  the  local  parts  and  the  information 
on  the  global  relationships,  we  decompose  the  shape  into 
parts.  The  decomposition  is  done  hierarchically,  first  re¬ 
moving  the  small  and  well  defined  parts,  and  then  an¬ 
alyzing  the  remaining  shape.  We  generate  the  possible 
parsings  of  the  shape  into  parts,  starting  from  the  initial 
shape  and  producing  the  axial  representation  determined 
by  every  parsing. 

In  order  to  clarify  what  we  mean  by  an  hierarchical 
decomposition,  we  present  an  example  here  (we  present 
it  in  more  detatil  in  Section  4).  Figure  2  presents  the 
decomposition  of  a  man  shape.  The  original  shape  is  on 
the  left.  At  every  step  the  parts  which  were  previously 
removed  (explained)  are  shaded.  The  current  shape  not 
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yet  explained  is  left  white.  The  skeleton  description  ob¬ 
tained  is  also  presented. 

3  Issues  and  Implementation 

In  this  section  we  describe  the  details  of  our  approach 
and  implementation.  We  first  present  the  tools  we  use 
and  the  motivation  behind  our  choice  of-these  tools,  then 
how  these  tools  are  combined  into  the  shape  decompo¬ 
sition  process  producing  the  hierarchical  shape  descrip¬ 
tion. 

3.1  B-Spline  Approximation 

Following  Smnt-Marc  and  Medioni  [21,  22],  we  use  ap¬ 
proximating  B-splines  to  represent  planar  curves.  As 
noted  there,  this  representation  is  rich,  compact,  stable 
and  local.  In  the  rest  of  this  paper,  we  ignore  the  prob¬ 
lem  of  obtaining  this  B-spline  representation.  From  now 
on,  we  assume  that  our  input  consists  of  a  single  closed 
planar  curve,  represented  by  B-spline  segments. 

A  B-spline  is  a  piecewise  polynomial  which  is  ex¬ 
pressed  as  a  linear  combination  of  polynomial  basis  func¬ 
tions.  The  coefficients,  or  control  points,  are  the  vertices 
of  the  B-spline  guiding  polygon.  In  our  application  we  use 
quadratic  B-splines,  every  curve  segment  is  a  quadratic 
polynomial  depending  on  three  adjacent  control  points. 

Besides  providing  a  simple  analytical  representation, 
the  following  features  make  this  approximation  attrac¬ 
tive  for  our  application: 

•  The  B-spline  is  easily  manipulated  by  modifying  its 
guiding  polygon. 

•  The  B-spline  is  defined  locally,  changing  the  position 
of  a  vertex  of  the  guiding  polygon  does  not  have  a 
global  effect  on  the  representation. 

•  Quadratic  B-splines  are  continuous  and  smooth. 
Discontinuities  of  the  tangent  could  be  introduced 
by  using  multiple  control  points. 

•  Each  quadratic  spline  has  constant  sign  curvature. 
This  implies  that  zero  crossings  of  curvature  can  oc¬ 
cur  only  at  the  knots  (the  connection  points  between 
two  splines). 

•  Quadratic  splines  have  either  one  or  no  curvature 
extrema.  As  shown  later,  limits  on  the  number  and 
location  of  the  curvature  extrema  simplify  the  com¬ 
plexity  of  our  process  significantly. 

3.2  Smooth  Local  Symmetries 

Brady  and  Asada  [4]  have  introduced  Smooth  Local  Sym¬ 
metries  (SLS)  as  a  method  for  shape  representation. 
Two  points  on  a  planar  curve  form  a  local  symmetry 
if  th  j  line  between  them  produces  equal  angles  with  the 
normals  to  the  curve  at  the  two  points  respectively.  The 
line  between  the  two  points,  is  known  as  the  cross  sec¬ 
tion.  The  loci  of  the  mid  points  of  the  cross  sections, 
are  the  symmetry  axes.  A  symmetry  aucis  together  with 
its  cross  sections  is  sometimes  called  an  SLS  ribbon  or 
a  Brady  ribbon.  We  say  that  the  region  of  the  shape 
covered  by  the  ribbon  is  explained  by  the  ribbon. 


(a)  Original  shape.  (b)  All  SLS  axes. 

Figure  3:  Computing  all  SLS  is  expensive  and  does  not 
result  in  a  good  description. 

Two  of  the  major  problems  with  using  SLS  are  sensi¬ 
tivity  to  noise  and  the  computational  difficulty  of  recov¬ 
ering  the  SLS  axis.  Even  in  the  case  of  analytical  curves, 
it  is  generally  impossible  to  find  analytical  solutions  for 
the  symmetry  axis.  Brady  and  Asada  [4]  have  overcome 
these  problems  by  approximating  the  curve  with  circular 
arcs.  Circular  arcs,  however,  cannot  guarantee  the 
continuity  of  the  approximation. 

Our  curve  representation  is  based  on  quadratic  B- 
splines  which  are  continuous.  Since  we  are  unable 
to  solve  analytically  for  the  SLS  of  two  quadratic  curve 
segments  (this  requires  finding  the  roots  of  a  polynomial 
of  degree  six),  we  first  find  the  a.xis  termination  points 
using  search,  and  then  interpolate  for  the  rest  of  the  axis. 
The  complexity  of  the  algorithm  for  computing  the  SLS 
of  the  whole  shape  is  O(n^),  where  n  is  the  number  of 
quadratic  curve  segments  in  the  spline  approximation. 
This  is  similar  to  the  circular  arc  case,  but  with  a  larger 
constant.  For  a  complete  description  of  the  algorithm, 
refer  to  [19]. 

3.3  Parts 

We  believe  that  segmentation  into  parts  is  the  key  to  any 
shape  description  used  by  higher  tasks,  such  as  recogni¬ 
tion.  However,  a  well  known  problem  is  that  although  a 
good  description  may  help  segmentation,  the  segmenta¬ 
tion  may  be  needed  for  obtaining  a  good  description. 

Computing  the  SLS  within  the  whole  shape  results  in 
many  axes  which  are  ambiguous  or  not  relevant  for  a 
natural  representation.  For  example,  two  very  distant 
segments  may  also  create  a  symmetry  axis,  which  may 
even  lie  outside  the  shape.  Figure  3  shows  an  example 
of  an  airplane  shape  and  all  of  its  SLS  axes.  Grouping 
the  axes  and  selecting  the  significant  ones  is  a  difficult 
and  computationally  expensive  problem  (for  an  example 
of  such  an  implementation  refer  to  [6]).  Therefore,  we 
suggest  that  the  shape  be  segmented  into  “natural”  parts 
prior  to  the  computation  of  the  SLS.  The  SLS  are  then 
computed  within  the  parts  only.  This,  of  course,  forces 
us  to  address  the  issue  of  defining  parts. 

The  Definition  of  a  Part 

Hoffman  and  Richards  [9],  and  more  recently  Bieder- 
man  [1]  argue  that  a  segmentation  into  parts  occurs  at 
the  negative  minima  of  curvature.  This  corresponds  to 
the  human  intuition  about  parts,  and  is  b2ised  on  the 
fact,  known  as  transversality  [7],  by  which,  in  general, 
concavities  arise  when  two  convex  volumes  are  arbitrar- 
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(a)  Two  Parts  and  a  blob.  (b)  Four  parts. 

Figure  4;  Segmenting  at  curvature  minima  results  in  4 
parts  in  both  cases. 


ily  joined.  Kimia  et  al.  [10],  however,  note  that  although 
parts  are  bounded  by  curvature  minima,  the  converse 
does  not  necessarily  hold.  See  for  example  Figure  4: 
We  perceive  the  two  shapes  to  have  different  structure, 
one  being  composed  of  two  parts  and  a  large  blob,  and 
the  other  is  composed  of  four  similar  parts.  However, 
segmenting  at  curvature  minima  results  in  four  parts  in 
both  cases. 

We  also  recognize  the  importance  of  the  transversality 
principle  for  the  perception  of  parts.  A  positive  curva¬ 
ture  curve  section  bounded  by  negative  curvature  curve 
sections  would  imply  a  possible  part  in  the  process  of 
interpreting  the  shape.  We  overcome  the  above  problem 
raised  by  Kimia  et  al.  by  using  a  hierarchical  decompo¬ 
sition,  in  which  the  smaller  parts  are  removed  before  the 
larger  ones  are  analyzed.  Note  that  in  contrast  to  Hoff¬ 
man  and  Richards,  we  believe  that  segmenting  exactly  at 
negative  curvature  minima  does  not  always  correspond 
to  the  intuition  of  the  parts  termination.  The  final  de¬ 
lineation  of  each  part  (the  exact  termination  points),  is 
based  on  the  symmetry  axis  describing  it. 

Describing  Parts 

A  major  drawback  of  the  descriptive  power  of  the  SLS 
is  that,  in  general,  the  local  symetries  are  not  unique. 
This  may  lead  to  several  ribbons,  giving  different  expla¬ 
nations  to  the  same  regions  of  the  shape.  The  Symmetry- 
Curvature  Duality  Theorem,  proposed  and  proved  by 
Leyton  [11],  relates  the  curvature  of  the  curve  and  the 
axis  of  symmetry  describing  its  shape.  The  theorem 
guarantees  the  existence  and  uniqueness  of  an  SLS  axis 
describing  parts  with  one  positive  curvature  extrema. 
For  any  additional  positive  curvature  extremum  within 
the  part,  there  exists  an  additional  axis  going  into  it. 

Given  a  shape,  we  first  segment  the  contour  into  sec¬ 
tions  bounded  by  consecutive  negative  curvature  curve 
sections.  These  sections  are  considered  parts.  We  com¬ 
pute  the  axes  of  symmetry  of  all  parts.  From  Leyton’s 
theorem,  we  now  know  the  origins  of  the  axes,  and  we 
also  know  on  which  side  of  the  ribbon  every  quadratic 
segment  may  be.  Therefore,  we  have  very  few  compar¬ 
isons  to  make  and  we  can  recover  the  SLS  axes  explaining 
each  part  very  efficiently.  Figure  5  shows  an  example  of 
a  shape,  and  its  initial  segmentation  into  parts  with  their 
axial  description. 

Special  Cases:  Terminations,  Bends,  T-shapes 

Parts  which  have  more  than  one  positive  curvature 
extremum  within  them  are  initially  described  with  more 
than  one  SLS  axis.  We  have  identified  three  special  cases, 
which  occur  frequently,  where  a  more  intuitive  single  axis 


Figure  5:  A  shape  iind  the  axial  representation  of  its 
parts. 


(a)  SLS  axes  (The  arrows  point  to  positive  curvature 
maxima) 


(b)  Selected  axis  for  description 

Figure  6:  Multiple  extrema:  Termination,  Bend,  T- 
shape 


can  be  found  to  describe  these  parts.  Refer  to  Figure  6 
for  examples  of  these  cases  which  we  have  intuitively 
named:  Termination,  Bend,  and  T  (or  Mushroom).  Fig¬ 
ure  6a  presents  the  SLS  axial  description  of  the  corre¬ 
sponding  shapes.  Figure  6b  presents  our  preferred  de¬ 
scription. 

Once  a  part  with  two  metxima  of  curvature  is  detected 
the  three  possible  interpretations  are  tested  and  the  best 
is  chosen  as  the  parts  description.  In  our  current  imple¬ 
mentation  the  interpretations  are  compared  in  terms  of 
the  vuiance  of  their  cross  section  function.  The  results 
in  Figure  6b  and  in  Section  4  demonstrate  that  the  cor¬ 
rect  decision  is  made  by  using  this  simple  criterion. 

Removing  Parts 

To  produce  an  hierarchical  decomposition,  we  need  a 
mechanism  for  removing  parts  from  the  shape  (see  Sec¬ 
tion  3.5).  Given  a  part,  it  is  not  clear  how  one  removes 
it  “gracefully”.  The  direct  method  of  simply  connecting 
the  two  minima  does  not  always  produce  the  intuitively 
correct  result.  By  removing  control  points  of  the  guid¬ 
ing  polygon  we  are  able  to  remove  parts  efficiently  and 
locally.  Figure  7  demonstrates  the  performance  of  the 
above  removing  procedure  on  a  shape  and  on  its  guiding 
polygon  (refer  to  [19]  for  more  details). 
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Figure  7;  Removing  parts;  The  shape  and  its  guiding 
polygon. 


(a)  A  snake  (b)  Descrip-  (c)  Description 

shape.  tion  using  only  lo-  using  global  lela- 

cal  information.  tionships. 


Figure  8;  Global  relationships  are  important  for  natural 
descriptions. 


3.4  Representing  Global  Relationships 

In  Sections  3.2  and  3.3,  we  have  presented  SLS  as  a 
method  for  describing  local  parts.  Leyton’s  theorem  pro¬ 
vides  the  basis  for  the  efficient  computation  of  the  axes 
in  these  parts.  Unfortunately,  this  description  provides 
only  local  information.  It  is  clear  that  this  cannot  suf¬ 
fice  for  an  intuitive  description.  Consider  for  example 
the  “snake”  in  Figure  8a:  Using  only  the  local  informa¬ 
tion  the  snake  would  be  described  as  a  composition  of  7 
different  parts  (Figure  8b).  We,  however,  would  prefer 
to  capture  the  global  relationship  and  describe  the  snake 
as  a  single  entity  using  the  more  intuitive  symmetry  axis, 
as  in  Figure  8c. 

We  have  found  parallel  symmetries,  suggested  by 
Ulupinar  and  Nevatia  [23],  to  be  very  useful  in  cap¬ 
turing  global  relationships  within  planar  shapes.  Let 
C'<(s)  =  (*»(«)>  yi(«)),  for  i  =  1,2  be  two  parametric 
planar  curves,  and  let  0,(s)  be  their  tangent  orienta¬ 
tion.  Ci(s)  and  C'2(s)  are  said  to  be  parallel  symmetric 
if  there  exists  a  continuous  monotonic  function  /(s)  such 
that  tfi(s)  =  tf2(/(«))-  The  symmetry  axis  is  the  loci  of 
the  mid  points  of  the  crdks  sections  between  Ci(s)  and 
for  all  s  where  /(s)  exists.  The  symmetry  axis 
together  with  the  cross  sections  is  a  parallel  ribbon.  We 
say  that  the  part  of  the  shape  covered  by  the  ribbon  is 
explained  by  the  ribbon.  Figure  8c  is  an  example  of  a 
parallel  symmetry  axis,  in  this  case  the  two  curves  are 
sections  of  the  contour  of  the  same  object.  As  shown  by 
Saint- Marc  and  Medioni  [21, 22],  the  detection  of  parallel 
symmetries  between  quadratic  B-splines  is  computation¬ 
ally  very  efficient.  Using  their  algorithm,  we  compute  all 
elementary  parallel  ribbons.  From  these  elementary  rib¬ 
bons,  we  select  the  parallel  ribbons  which  are  significant 


for  the  description  of  the  shape.  We  shall  often  refer  to 
these  significant  parallel  ribbons  as  global  ribbons.  The 
selection  is  done  based  on  some  simple  and  intuitive  fil¬ 
tering  and  grouping  criteria  (see  [19]  for  details). 

Once  such  a  global  ribbon  (significant  parallel  ribbon) 
is  found,  it  is  important  to  record  its  relationships  with 
the  local  parts.  A  local  ribbon  (part)  may  continue  the 
global  ribbon,  thus  the  two  ribbons  support  each  other 
in  the  parsing.  A  local  ribbon  may  also  conflict  with 
the  global  ribbon,  when  both  ribbons  give  a  different 
explanation  of  the  same  region  of  the  shape.  In  this  case 
they  cannot  be  used  together  to  describe  the  shape. 

3.5  Shape  Decomposition 

In  this  section  we  describe  the  decomposition  of  the 
shape,  a  planar  closed  curve,  into  a  hierarchy  of  parts, 
based  on  the  size  of  the  parts  and  on  global  relationships. 

The  shape  is  decomposed  using  a  recursive  procedure: 
Given  the  current  shape,  we  compute  all  loced  parts,  as 
defined  in  Section  3.3,  and  all  the  global  ribbons,  as  de¬ 
fined  in  Section  3.4.  Every  local  part  is  represented  by 
an  SLS  ribbon  (axis  and  cross  sections),  and  every  global 
ribbon  is  represented  by  a  parallel  ribbon  and  pointers  to 
its  local  continuations  and  local  conflicts.  We  now  create 
a  new  shape,  generated  from  the  current  shape  by  remov¬ 
ing  its  smallest  paurts  in  parallel.  In  case  of  a  conflict  with 
a  global  ribbon  we  generate  the  two  possible  interpreta¬ 
tions,  the  first  ignoring  the  global  relationships,  and  the 
second  considering  the  global  relationships.  (We  do  not 
attempt  to  decide  which  is  the  correct  interpretation. 
This  decision  usually  requires  higher  level  knowledge.) 
Under  the  second  interpretation  a  new  shape  is  gener¬ 
ated  from  the  current  shape  by  removing  its  smallest 
parts,  this  time  using  the  information  on  the  global  re¬ 
lationships.  The  global  information  is  used  by  assigning 
high  size  values  to  parts  which  are  related  to  the  global 
ribbon,  either  continuing  or  conflicting  with  the  global 
axis.  Therefore,  these  parts  are  removed  only  if  there  are 
no  local  parts  which  are  unrelated  to  the  global  ribbon. 
Note  that  the  second  interpretation  is  generated  only  if 
the  removed  parts  are  different  from  the  removed  parts 
in  the  “all  local”  case.  The  number  of  interpretations 
is,  therefore,  bounded  by  n  -I-  1  where  n  is  the  number 
of  global  ribbons  in  the  shape.  However,  in  most  cases 
there  are  no  significant  glob^d  relationships  within  the 
shape,  resulting  in  unique  parsing  of  the  shape.  The 
process  is  continued  recursively  on  each  branch. 

Since  we  only  remove  parts,  the  complexity  of  the 
shape,  in  terms  of  the  number  of  control  points  of  the 
guiding  polygon,  reduces  from  level  to  level,  and  the  pro¬ 
cess  is  bound  to  terminate  rapidly.  The  recursive  pro¬ 
cedure  is  stopped  at  one  of  the  following  atomic  cases: 
The  shape  is  empty  (in  the  previous  step  all  parts  had 
about  the  same  size),  the  shape  is  a  positive  curvature 
blob  (no  zero  crossings  of  curvature),  or  the  shape  has 
exactly  two  zero  crossings  of  curvature  (a  bean  shape). 

The  axes  of  the  parts  removed  are  accumulated  down 
the  decomposition  hierarchy.  Therefore,  at  every  atomic 
level  we  have  a  skeleton  representation  of  the  shape, 
given  by  the  single  parsing  of  the  shape,  defined  by  the 
path  from  the  original  shape  to  that  atomic  state. 
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Figure  9:  The  decomposition  of  a  man  shape.  The  orig¬ 
inal  shape  is  on  the  left.  The  skeleton  description  is  on 
the  right  (see  text). 


Figure  10:  The  decomposition  of  a  snake  shape.  The 
original  shape  on  the  left.  The  two  possible  skeleton 
descriptions  are  on  the  right  (see  text). 
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Figure  11:  The  decomposition  of  airplane  shapes. 
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4  Experimental  Results 


Figure  12:  The  stability  of  the  process  (see  text) 


We  have  applied  our  method  to  several  shapes.  For  each, 
we  present  the  decomposition,  which  is  the  output  of  our 
description  process.  The  original  shape  is  on  the  left.  At 
every  node  the  parts  which  were  previously  removed  are 
shaded.  The  current  shape,  not  yet  explained,  is  left 
white.  As  explained  earlier,  in  Section  3.5,  the  axial 
representations  of  the  parts  removed  are  accumulated. 
Therefore,  every  interpretation  generates  a  skeleton  de¬ 
scription  of  the  shape.  That  skeleton  description  is  pre¬ 
sented  at  the  end  of  the  decomposition  on  the  right. 

The  first  example  is  the  man  shape  (Figure  9).  Since 
there  are  no  significant  global  ribbons  in  this  shape  and 
in  its  sub  shapes,  there  are  no  branches  in  the  interpre¬ 
tation  process.  There  is  a  unique  decomposition  of  the 
shape  into  a  hierarchy  of  parts.  Note  that,  since  in  our 
current  implementation  we  define  the  size  of  the  part  to 
be  the  length  of  the  axis,  at  the  last  step  the  legs  and 
the  torso  are  removed  in  parallel,  leaving  the  final  blob 
at  the  last  stage. 

The  next  example  is  the  snake  shape  (Figure  10). 
This  is  an  example  of  the  importance  of  global  relation¬ 
ships.  Simply  segmenting  at  minima  of  curvature  (e.g. 
Richards  and  Hoffman  [18]),  or  interpreting  the  shape 
as  a  result  of  a  history  of  protrusions  and  indentations 
(e.g.  Leyton  [12]),  results  in  an  unintuitive  (but  possi¬ 
ble)  description,  as  in  the  top  interpretation  path.  Using 
the  global  ribbon,  we  are  able  to  produce  the  intuitive 
description  on  the  bottom  path. 

In  the  next  two  examples,  the  sketches  of  airplanes 
in  Figure  11,  there  is  once  again  no  significant  global 
information  conflicting  with  the  local  one.  Therefore, 


they  each  have  a  unique  decomposition  leading  to  the 
skeleton  description. 

The  next  example  illustrates  the  inherent  stability  of 
our  hierarchical  approach.  In  Figure  12a  we  present  the 
decomposition  of  an  unspecified  shape.  We  have  taken 
this  shape  and  added  several  bumps  to  its  curve.  A  good 
shape  description  mechanism  should  be  stable  enough  to 
produce  a  description  which  is  similar  to  the  original  and 
yet  rich  enough  to  show  the  differences.  As  Figure  12b 
shows,  our  algorithm  performs  well  on  this  example,  by 
detecting  and  removing  all  the  small  bumps  in  the  first 
step,  after  which  the  process  is  identical  to  the  original 
case.  A  higher  level  task  interpreting  these  results  could 
easily  examine  the  similarities  and  differences  between 
the  two  shapes. 

All  the  examples  we  have  presented  so  far  were  based 
on  synthetic  data.  We  have  also  applied  our  method 
to  shapes  obtained  from  real  images  of  model  airplanes 
taken  on  a  light  table.  The  r^lts  are  presentd  in  Fig¬ 
ure  13.  Note  also  that  the  “F16”  shape  of  Figure  13d 
has  sharp  corners  (i.e.  discontinuities  of  the  tangents). 
These  corners  are  simply  represented  using  multiple  con¬ 
trol  points  in  the  guiding  polygon.  They  are  considered 
as  any  other  curvature  extrema,  and  do  not  require  any 
special  treatment  in  the  decomposition  process. 

5  Concluding  Remarks 

We  have  presented  a  powerful  method  for  obtaining  nat¬ 
ural  descriptions  of  planar  shapes.  Our  method  produces 
an  axial  representation  of  a  shape,  and  a  discrete  hierar- 
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(a)  The  decomposition  of  an  “F14”  shape. 


(b)  The  decomposition  of  an  “F106”  shape. 


(c)  The  decomposition  of  an  “F4”  shape. 


(d)  The  decomposition  of  an  “Fl6”  shape. 


Figure  13:  The  decomposition  of  airplane  shapes  ob¬ 
tained  from  real  data. 


chical  decomposition  of  the  shape  into  its  parts.  We  use 
Smooth  Local  Symmetries  for  the  axial  representation  of 
parts.  We  also  use  parallel  symmetries  to  provide  infor¬ 
mation  on  global  relationships  within  the  shape.  This 
information  is  used  to  parse  the  shape. 

Our  approach  is  both  region  and  contour  based,  com¬ 
bines  local  and  global  information,  can  handle  corners, 
and  addresses  the  issue  of  scale  and  the  notion  of  part. 
Our  method  is  computationally  efficient,  parameter  free 
and  stable.  We  have  presented  results  which  show  that  it 
provides  intuitive  shape  descriptions  for  various  shapes. 

Currently  we  assume  that  our  shape  is  a  closed  curve. 
We  do  not  solve  the  figure-ground  problem.  It  is  clear 
that  for  many  applications  this  problem  has  to  be  solved 
prior  to  shape  description.  In  addition,  due  to  artifacts 
of  the  spline  representation,  our  current  implementation 
may  not  be  stable  on  straight  lines.  This  is  because  when 
approximating  a  straight  line,  the  spline  approximation 
may  fluctuate  between  positive  and  negative  curvature 
segments.  This  may  be  solved  by  ignoring  such  changes 
of  the  sign  of  the  curvature  if  the  absolute  values  are  very 
small.  We  also  consider  only  positive  curvature  parts. 
For  intuitive  descriptions  it  is  sometimes  needed  to  cap¬ 
ture  negative  curvature  parts.  For  example,  our  method 
will  not  work  well  on  the  famous  example  of  a  rectangle 
with  a  sharp  indentation.  These  problems  and  others 
are  subject  to  further  research. 

References 

[1]  I.  Biederman.  Recognition  by  components.  Psychological 
Review,  94:115-147,  1987. 

[2]  T.  O.  Binford.  Visual  perception  by  computer.  In  IEEE 
Conference  on  Systems  and  Controls,  Miami,  FL,  1971. 


[3]  H.  Blum.  j4  Transformation  for  Extracting  New  Descrip¬ 
tors  of  Shape.  MIT  Press,  Cambridge,  MA,  1967. 

[4]  M.  Brady  and  H.  Asada.  Smoothed  local  symmetries 
and  their  implementation.  The  International  Journal  of 
Robotics  Research,  3(3):36-61,  1984. 

[5]  R.  A.  Brooks.  Symbolic  reasoning  among  3-D  mod¬ 
els  and  2-D  images.  Artificial  Intelligence,  17:285-348, 
1981. 

[6]  J.  H.  Connell  and  M.  Brady.  Generating  and  gener¬ 
alizing  models  of  visual  objects.  Artificial  Intelligence, 
31(2):159-183,  1987. 

[7]  V.  Guillemin  and  A.  Pollack.  Differential  Topology. 
Prentice  Hall,  1974. 

[8]  D.  D.  Hoffman.  Representing  Shapes  for  Visual  Recogni¬ 
tion.  PhD  thesis,  Massachusetts  Institute  of  Technology, 
May  1983. 

[9]  D.  D.  Hoffman  and  W.  A.  Richards.  Parts  of  recognition. 
Cognition,  18:65-96,  1985. 

[10]  B.  B.  Kimia,  A.  Tannenbaum,  and  S.  W.  Zucker.  Toward 
a  computational  theory  of  shape:  An  overview.  In  Pro¬ 
ceedings  of  European  Conference  on  Computer  Vision, 
pages  402-407,  Antibes,  France,  1990. 

[11]  M.  Leyton.  Symmetry-Curvature  duality.  Computer  Vi¬ 
sion,  Graphics  and  Image  Processing,  38:327-341,  1987. 

[12]  M.  Leyton.  A  proces-grammar  for  shape.  Artificial  In¬ 
telligence,  34:213-247,  1988. 

[13]  D.  Marr.  Vision.  W.  H.  Freeman  and  Co.,  San-Fransisco, 
CA,  1982. 

[14]  D.  Marr  and  K.  Nishihara.  Representation  and  recog¬ 
nition  of  the  spatial  organization  of  three-dimensional 
shapes.  Proceedings  of  the  Royal  Society  of  London, 
B(200):269-294,  1977. 

[15]  R.  Nevatia.  Machine  Perception.  Prentice  Hall,  1982. 

[16]  J.  Ponce.  On  characterizing  ribbons  and  finding  skewed 
symmetries.  Computer  Vision,  Graphics  and  Image  Pro¬ 
cessing,  52:328-340,  1990. 

[17]  K.  Rao.  Shape  Description  from  Sparse  and  Imperfect 
Data.  PhD  thesis.  University  of  Southern  California, 
December  1988.  IRIS  Technical  Report  250. 

[18]  W.  A.  Richards  and  D.  D.  Hoffman.  Codon  constraints 
on  closed  2d  shapes.  Computer  Vision,  Graphics  and 
Image  Processing,  31:265-281,  1985. 

[19]  H.  Rom  and  G.  Medioni.  Hierarchical  decomposition  and 
axial  representation  of  shape.  In  Proceedings  of  the  SPIE 
conference  on  Geometric  Methods  in  Computer  Vision, 
San  Diego,  California,  July  1991. 

[20]  A.  Rosenfeld.  Axial  representations  of  shape.  Com¬ 
puter  Vision,  Graphics  and  Image  Processing,  33:156- 
173,  1986. 

[21]  P.  Saint-Marc  and  G.  Medioni.  B-spline  contour  rep¬ 
resentation  and  symmetry  detection.  In  Proceedings  of 
European  Conference  on  Computer  Vision,  pages  604- 
606,  Antibes,  France,  1990. 

[22]  P.  Saint-Marc  and  G.  Medioni.  B-spline  contour  rep¬ 
resentation  and  symmetry  detection.  Technical  Report 
IRIS-262,  University  of  Southern  California,  Los  Ange¬ 
les,  California,  February  1990. 

[23]  F.  Ulupinar  and  R.  Nevatia.  Inferring  shape  from  con¬ 
tour  for  curved  surfaces.  In  Proceedings  of  International 
Conference  on  Pattern  Recognition,  pages  147-154,  At¬ 
lantic  City,  NJ,  1990. 


613 


Reconstructing  Surfaces  from  Unstructured  3D  Points* 


P.  Fua  and  P.  Sander 

SRI  International  INRIA  Sophia-Antipolis 

333  Ravenswood  Avenue  2004  Route  des  Lucioles 
Menlo  Park,  CA  94025  06565  Valbonne  Cedex 

USA  France 


Abstract 

Most  active  and  passive  range  finding  tech¬ 
niques  yield  unstructured  and  generally  noisy 
3D  points.  In  order  to  build  useful  world  rep¬ 
resentations,  one  must  be  able  to  remove  spu¬ 
rious  data  points  and  group  the  remaining  into 
meaningful  surfaces. 

In  this  paper,  we  propose  an  approach  based 
on  fitting  local  surfaces.  Differential  proper¬ 
ties  of  these  surfaces  are  first  used  iteratively 
to  smooth  the  points,  and  then  to  group  them 
into  more  global  surfaces  while  eliminating  er¬ 
rors. 

We  present  results  on  complex  indoor  and  out¬ 
door  scenes  using  stereo  data  as  our  source  of 
3D  information. 

1  Introduction 

To  reconstruct  object  surfaces,  one  can  start  with  a  num¬ 
ber  of  measuring  techniques,  for  example  laser  rangefind¬ 
ing,  stereo  or  3D  scanners,  which  all  provide  raw  in¬ 
formation  about  the  location  of  points  in  space.  These 
points,  however,  often  form  potentially  noisy  “clouds”  of 
data  instead  of  the  surfaces  one  expects. 

Deriving  the  surfaces  &om  such  data  is  a  difficult  task 
because: 

•  the  3D  points  may  form  a  very  irregular  sampling 
of  the  space, 

•  they  may  have  been  produced  by  several  sensors  or 
derived  from  several  viewpoints  so  that  it  becomes 
impossible  to  work  only  in  the  imaging  plane  of  any 
one  sensor, 

•  several  surfaces  can  overlap  —  the  2  1/2  D  hypoth¬ 
esis  required  by  simple  interpolation  schemes  is  not 
necessarily  valid, 

•  the  sensors  and  algorithms  make  mistakes  that  must 
be  properly  dealt  with. 

In  this  paper,  we  address  the  problem  of  determin¬ 
ing  surfaces  from  a  set  of  points  in  space,  where  the 

*Snpport  for  this  research  was  partially  provided  by  ES¬ 
PRIT  P2602  (VOILA)  and  ESPRIT  BRA  3001  (INSIGHT) 
and  a  Defense  Advanced  Research  Projects  Agency  contract. 


points  are  assumed  to  be  nonregular  samples  from  under¬ 
lying  imaged  surfaces.  Most  existing  approaches  to  this 
problem,  such  as  deformable  superquadrics  [Terzopou- 
los  and  Metaxas,  1990],  modal  representations  [Horowitz 
and  Pentland,  1991]  or  3D  deformable  surfaces  [Cohen  ei 
al.,  1991],  assume  that  all  data  points  belong  to  a  single 
object  to  which  a  model  can  be  fit.  In  other  words,  they 
assume  that  the  data  is  already  segmented,  even  though 
it  is  well  known  that  segmentation  is  hard  when  the  data 
is  noisy  and  originates  from  multiple  objects  in  a  scene. 

To  overcome  this  problem,  we  propose  using  local 
3D  surfaces  to  smooth  the  data  points  which  are  then 
grouped  into  globals  surfaces.  We  proceed  by  fitting 
a  local  quadric  patch  to  a  small  neighborhood  of  each 
3D  point  and  using  the  estimated  surfaces  to  iteratively 
smooth  the  raw  data.  We  then  use  these  local  surfaces  to 
define  binary  relationships  between  points:  points  whose 
local  surfaces  are  consistent  are  considered  as  related, 
i.e.,  sampled  from  the  same  underlying  surface.  Given 
this  relation,  we  can  impose  a  graph  structure  upon  our 
data  and  define  the  surfaces  we  are  looking  for  as  sets 
of  points  forming  connected  components  of  the  graph. 
The  surfaces  can  then  be  interpolated  using  simple  tech¬ 
niques  such  as  Delaunay  triangulation.  In  effect,  we  are 
both  segmenting  the  data  set  and  reconstructing  the  3D 
surfaces. 

In  the  next  section,  we  introduce  our  fitting  procedure 
(mathematical  details  are  left  to  App.  A)  and  describe 
our  implementation  for  real  3D  data.  We  then  demon¬ 
strate  our  technique  using  stereo  depth  images  corre¬ 
sponding  to  complex  indoor  and  outdoor  scenes.  Our 
3D  data  is  produced  by  a  stereo  algorithm  that  has  been 
developed  in  previous  work  [Fua,  1991a;  Fua,  1991b], 
and  is  briefly  described  in  App.  B.  Note,  however,  that 
closely  related  methods  have  also  been  applied  to  mag¬ 
netic  resonance  imagery  [Sander  and  Zucker,  1990]  and 
laser  rangefinder  images  [Ferrie  tt  a/.,  ]. 

2  FVom  Points  in  Space  to  Global 
Surfaces 

Our  goal  is  to  determine  explicit  object  surfaces  from 
“clouds”  of  unstructured  3D  points.  In  this  section,  we 
describe  the  implementation  of  our  method  and  show 
how  it  allows  us  to  go  all  the  way  from  raw  data  sets  to 
triangulated  maps.  The  procedure  consists  of  the  four 


steps  described  in  the  following  subsections. 

1.  Iterative  smoothing  of  the  points  by  local  surface 
fitting. 

2.  Resampling  and  merging  of  the  points  on  a  regular 
3D  grid. 

3.  Computation  of  an  adjacency  graph  and  clustering 
of  connected  components  into  surfaces. 

4.  Triangulation  of  the  clusters. 

We  take  the  local  surfaces  to  be  quadric  because  they 
allow  curvature  computation  while  having  few  enough 
puameters  to  allow  for  a  reasonably  stable  fitting  pro¬ 
cess. 

Our  method  is  inherently  local  and  parallel;  the  pro¬ 
cedures  have  been  designed  with  a  SIMD  architecture 
in  mind  and  have  been  implemented  on  a  Connection 
Machine.^ 

2.1  Local  Surfaces 

We  fit  local  surfaces  by  iteratively  fitting  quadric  patches 
around  every  data  point  and  then  using  them  to  move 
the  points  themselves,  as  shown  in  Figure  1. 

For  our  algorithm  to  be  effective  with  real  data,  it 
must  be: 

•  Orientation  independent:  we  do  not  assume  any 
particular  orientation  of  the  surfaces,  and  the  result 
should  not  be  affected  by  rotations  of  the  objects  of 
interest. 

•  Insensitive  to  outliers:  In  the  neighborhood  of  any 
point,  the  possibility  of  finding  points  that  are  in 
gross  error  or  belong  to  more  than  one  surface  al¬ 
ways  exist  and  must  be  addressed  since  least-squares 
techniques  are  notoriously  sensitive  to  such  prob¬ 
lems. 

Orientation.  To  achieve  orientation  independence 
around  a  point  Po  =  (xo,yo>2o),  we  use  all  the  points 
in  a  spherical  neighborhood  and  estimate  the  orientation 
of  the  local  tangent  plane.  To  initialize,  we  simply  fit  a 
plane;  thereafter  we  use  the  quadric  fit  of  the  previous 
iteration.  Given  this  orientation,  we  define  a  reference 
frame  whose  origin  is  Po  itself  and  whose  z  axis  is  per¬ 
pendicular  to  the  plane;  we  fit  a  quadric  of  the  form 

z  =  qu&d{x,y)  =  ax^  +  bxy+ cy^  +  dx  +  ey  +  f  ,  (1) 

by  minimizing  a  least  squares  criterion  6, 

6  =  ^  Wi{zi  -  quad(a:,-,  y,))’  ,  (2) 

t 

where  the  (x.-.yi,  «i)i<i<n  are  the  n  neighbors  of  Pq  and 
the  Wi  are  associated  weights.  We  then  transform  Po  as 
follows: 

Po-->(0,0,/=?«ad(0,0))  (3) 

expressed  in  the  local  reference  frame. 

In  effect,  we  are  approximating  the  2nd  order  Tay¬ 
lor  expansion  of  the  surface  around  P.  Since  we  iterate 
the  estimates,  this  procedure  is  a  form  of  relaxation  (see 
Appendix  A)  where  the  amount  of  smoothing  increases 
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with  the  number  of  iterations  and  the  size  of  the  neigh¬ 
borhoods.  However,  because  we  use  quadric  patches  as 
opposed  to  planar  ones,  the  procedure  preserves  curva¬ 
ture  and  does  not  smooth  out  relevant  features.  For  typ¬ 
ical  stereo  data  sets  (ss  100,000  points),  the  algorithm 
converges  within  five  to  ten  iterations. 

Outliers.  To  deal  with  outliers,  we  define  a  metric 
dquad  that  measures  whether  or  not  two  points  appear 
to  belong  to  the  same  surface.  We  take  dquad  to  be 

dquad{Pi,  Pz,  qvodi,  quad-i)  =  mzoc(dis<i,(fisf2)  (4) 


disti  =  abs(zi  -  quad2ixi,yi)) 

expressed  in  the  reference  frame  of  quad^ 
disti  =  abs{z2  —  quadi(x2,y2)) 

expressed  in  the  reference  frame  of  quadi  . 

We  depict  dquad  graphically  in  Figure  2.  It  is  zero  when 
the  two  points  belong  to  the  same  local  surface  and  in¬ 
creases  when  their  respective  local  surfaces  become  in¬ 
consistent.  It  can  therefore  be  used  to  discount  outliers 
by  computing  the  weighting  factor  Wi  of  equation  2  at 
iteration  t  to  be: 

1  -(-  {disti /a)^  ^  ^ 

disti  =  d,„ad(^o,^<,quad*o"*,quadJ"‘) 

where  the  (guadJ~^)o<i<n  are  the  quadrics  that  had  been 
computed  at  the  previous  iterations  and  a  is  an  estimate 
of  the  variance  of  the  process  generating  the  data  points. 
Note  that  for  processes  such  as  stereo  compilation  or 
laser  range  finding,  <t  can  actually  be  estimated. 

In  this  manner,  as  the  algorithm  progresses,  the  points 
that  are  on  the  same  surface  as  Pq  gain  influence  while 
the  others  are  increasingly  discounted.  The  discrimina¬ 
tory  power  of  the  algorithm  is  substantially  increased  by 
fitting  the  quadrics  several  times  at  every  iteration  and 
updating  the  weights  without  moving  the  data  points. 
We  illustrate  this  behaviour  in  Figure  3.  The  two  noisy 
hemispheres  are  smoothed  without  being  merged  and  the 
points  between  them  are  left  as  outliers;  such  a  result 
would  be  difficult  to  achieve  with  a  simple  2  1/2D  inter¬ 
polation  scheme. 

Implementation.  The  main  obstacle  to  performing 
the  relaxation  smoothing  is  the  very  large  amount  of 
data  that  has  to  be  taken  into  account.  A  dense  512x512 
depth  map  represents  over  250,000  points  and  as  many 
local  surfaces,  and  the  problem  obviously  gets  worse  as 
more  and  more  views  are  being  merged.  One  solution 
is  to  merge  the  points  as  they  are  acquired  [Szeliski, 
1990].  In  this  work,  however,  we  are  interested  more 
in  the  behaviour  of  the  smoothing  algorithm  than  in  the 
possible  ways  of  reducing  the  amount  of  computation. 
We  therefore  start  with  all  the  data  points  and  assume 
that,  in  the  z  direction,  at  most  a  limited  number  of 
surfaces  can  overlap.  This  is  typically  true  for  ground 
level  scenes  if  z  is  the  vertical  direction  because  only  a 
finite  (and  usually  small)  number  of  objects  are  stacked 
on  top  of  one  another.  Given  this  assumption,  we  can 
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Figure  1;  Local  surface  fitting:  the  neighbors  of  the  circled  point  are  used  to  fit  a  quadric  surface  onto  which 
the  point  is  then  projected. 


(a)  (b) 

Figure  2:  (a)  The  distance  between  two  points  is  taken  to  be  the  maximum  value  of  the  distance  of  one 
point  to  the  local  surface  corresponding  to  the  other,  represented  by  the  arrows.  The  dotted  lines 
represent  the  axes  of  the  reference  frames  in  which  the  computations  are  performed,  (b)  For  this 
metric,  point  B  is  “close”  to  A  but  C  is  not,  even  though  their  euclidean  distances  are  comparable. 


create  a  cube-shaped  data  structure  by  quantizing  the  x 
and  y  coordinate  axes  and  stacking  the  3D  points  with 
the  same  quantized  x,  y  values  in  columns  ordered  by  in¬ 
creasing  z  values.  In  this  way,  we  can  guarantee  that  all 
neighbors  can  be  found  in  a  cubic  neighborhood  around 
every  point  in  the  data  structure;  this  allows  efficient  4- 
connected  (NEWS)  access  to  the  neighbors.  We  can  also 
reduce  the  z  dimension  of  the  cube  data  structure  and 
make  efficient  use  of  the  Connection  Machine  processors. 

2.2  Resampling 

When  the  smoothing  is  done,  as  shown  in  Figure  3(c), 
the  data  points  still  form  an  irregular  sampling  of  the 
underlying  surfaces  that  is  ill  suited  for  the  generation 
of  a  map.  However,  to  every  point  is  associated  a  local 
surface  defined  by  the  quad  function  of  equation  1.  In 
order  to  produce  meaningful  triangulations,  we  need  a 
more  regularly  spaced  set  of  vertices.  We  therefore  pick 
spatial  step  sizes  and  6^  along  an  the  X,Y  and  Z 
axes  of  an  absolute  referential.  For  points  whose  local 
surface  patch  has  a  normal  within  45  degrees  of  the  Z 
axis,  we  use  the  following  updating  scheme 


z - >  quad(aro,yo)- 

For  points  whose  normals  are  closer  to  the  A  or  Y  axes, 
we  use  the  same  method  but  permute  the  roles  of  x,y  and 
z.  Furthermore,  after  updating,  several  points  may  have 
two  identical  coordinates  and,  provided  that  their  third 
ones  are  close  enough,  be  merged.  If  these  points  have 
been  seen  in  different  views,  we  retain  that  information 
for  later  use  as  will  be  discussed  in  Section  3.1.  In  this 
manner,  points  are  “resampled”  on  a  regular  3D  grid  as 
can  be  seen  in  Figure  3(d). 

In  effect  we  are  replacing  a  large  number  of  irreguleuly 
spaced  3D  points  by  a  smaller  set  of  regularly  spaced 
ones  and  their  local  surfaces:  we  are  achieving  both  data 
organization  and  compression.  This  turns  out  to  be  an 
effective  way  to  merge  data-points  coming  from  several 
views  or  sensors. 

2.3  Clustering 

To  cluster  the  isolated  3D-points  into  more  global  enti¬ 
ties,  we  use  again  the  metric  of  Equation  4  to  define  a 
“same  surface”  relationship  H  as  follows: 


X  —  >  *0  =  Stmod(x,St) 

y  — >  I/O  =  ^»mod(y,fiy) 


ptiR.pl2  <=> 


dfuad(ptupt2)  <  max, 

‘*eucl(P^i-P*2)  <  maxd 


(6) 
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(<1)  (e)  (f) 

Figure  3:  (a)  Two  superposed  and  noisy  hemispheres,  (b)  The  top  of  the  closest  parts  of  the  spheres,  (c) 
Smoothed  spheres  after  several  iterations,  (d)  Resampled  points,  (e)  (f)  Two  possible  segmenta¬ 
tions  of  the  data  points  for  two  different  values  of  the  maxq  parameter  of  Equation  6.  In  both  cases 
the  clusters  of  points  have  been  triangulated  using  a  2D  Delaunay  triangulation  in  the  horizontal 
plane  and  in  (f)  one  of  them  has  been  shaded. 


where  d^ycl  euclidean  distance  maxi  and  max, 

are  two  thresholds.  In  other  words,  two  points  are  as¬ 
sumed  to  belong  to  the  same  surface  if  their  local  fits  are 
consistent  with  one  another. 


The  data  set  equipped  with  the  relationship  H  can  now 
be  viewed  as  a  graph  whose  connected  components  are 
the  surfaces  we  are  looking  for.  In  practice,  there  may 
be  erroneous  points  in  the  original  range  data,  resulting 
in  situations  like  the  one  shown  in  Figure  4,  where  le¬ 
gitimate  clusters  are  weakly  linked.  In  such  cases,  we 
have  found  that  removing  all  points  that  do  not  have  a 
minimum  number  of  neighbors  allows  us  to  throw  away 
the  gross  errors  and  generate  meaningful  clusters.  In  the 
case  of  the  two  hemispheres  of  Figure  3,  depending  on 
the  value  the  max,  threshold  of  Equation  6,  the  algo¬ 
rithm  finds  either  the  two  obvious  clusters  or  separated 
the  hemispheres  themselves  and  their  flat  bases  because 
of  the  sudden  change  of  orientation  at  the  junctions. 


2.4  Triangulating 

The  clusters  we  have  generated  so  far  are  collections  of 
points  and  their  associated  local  surfaces.  For  many 
applications,  such  as  robotics  or  graphics,  it  is  impor¬ 
tant  to  be  able  to  unambiguously  interpolate  the  sur¬ 
faces.  Delaunay  triangulation  is  an  excellent  way  of  do¬ 
ing  this  and,  furthermore,  lets  us  compute  shaded  mod¬ 
els  of  our  data  sets.  Alternatively,  because  our  data 
points  are  now  grouped  into  sets  of  points  that  be¬ 
long  to  the  same  global  surface,  we  could  use  a  num¬ 
ber  of  the  other  published  methods  [Cohen  e<  o/.,  1991; 
Horowitz  and  Pentland,  1991;  Terzopoulos  and  Metaxas, 
1990). 

For  applications  in  which  the  global  surfaces  can  be 
projected  onto  a  plane  without  losing  their  topology,  we 
do  so  and  use  2D  Delaunay  triangulation  in  that  plane. 
This  is  typically  the  case  for  mobile  robotics  when  one 
needs  to  model  the  world  as  a  ground  level  surface  plus 
obstacles.  If  this  ground  level  surface  has  no  overhangs, 
it  can  be  projected  onto  the  horizontal  plane.  The  tri¬ 
angulations  of  Figure  3(e)  and  (f)  have  been  computed 
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Figure  4;  Two  legitimate  surfaces  are  can  be  weakly  linked  by  a  few  outliers. 


in  this  m2mner. 

For  applications  where  the  surfaces  can  have  any  ori¬ 
entation,  we  perform  a  full  3D  Delaunay  triangulation 
that  yields  the  convex  hull  of  the  points  and  use  the 
faces  of  the  tetrahedrons  as  our  surface  primitives. 

In  both  cases  we  retain  only  the  triangles  whose  orien¬ 
tation  is  consistent  with  the  local  surface  patches  of  the 
vertices.  In  the  current  implementation,  we  test  only 
the  consistency  of  the  surface  normals  with  the  triangle 
orientation,  as  shown  in  Figure  2.4. 

When  the  3D  data  is  sufficiently  clean,  as  will  be 
shown  in  the  results  section,  this  heuristic  is  sufficient 
to  remove  the  spurious  triangles  and  “hollow  out”  the 
triangulation.  However,  in  more  difficult  cases,  more 
powerful  constraints  should  be  brought  to  bear: 

•  Visibility  constraint:  If  a  point  is  visible  in  a  given 
view  but  appears  to  be  behind  one  of  the  trian¬ 
gles,  one  of  the  two  must  be  spurious.  Since  the 
resampling  operation  described  in  section  2.2  pro¬ 
vides  such  information,  it  ought  to  be  used. 

•  Supporting  evidence:  Given  an  estimate  of  the  vari¬ 
ance  of  the  process  generating  the  points,  the  v  of 
Equation  5,  one  can  estimate  how  much  data  sup¬ 
ports  the  existence  of  each  triwgle. 

These  extensions  will  be  be  the  subject  of  future  work. 

3  Reconstructing  3D  Surfaces  from 
Stereo  Data 

In  this  section,  we  show  how  our  technique  can  be  used 
to  reconstruct  3D  surfaces  using  stereo  depth-maps.  We 
first  reconstruct  natural  scenes  by  merging  several  maps 
and  then  demonstrate  the  potential  of  our  algorithm  on  a 
complex  3D  scene  containing  stacked-up  mam-made  ob¬ 
jects. 

3.1  Mergeing  Results  Computed  from  Severad 
Viewpoints 

The  stereo  adgorithm  we  use  [Fua,  1991a;  Fua,  1991b] 
produces  semi-dense  maps  that  are  2  1/2D  representar 
tions  of  the  world.  They  are  necessarily  incomplete  and, 
in  particular,  cannot  account  for  the  occluded  parts  of 
the  scene,  such  as  the  back  of  the  rocks  in  Figure  6(a). 


To  reconstruct  the  scene  more  completely,  we  have 
used  5  sets  of  stereo-pairs  corresponding  to  camera  posi¬ 
tions  between  that  of  figures  6(a)  auid  6(c).  After  having 
registered  the  stereo  results  with  one  another  [Zhang  and 
Faugeras,  1990],  we  can  use  our  technique  to  generate 
clusters  of  triangulated  3D  points.  The  largest  one,  de¬ 
picted  in  Figure  7,  accounts  for  all  the  large  rocks.  The 
erroneous  data  points  have  been  discarded  as  outliers. 

In  Figure  8,  we  use  two  stereo  pairs  to  reconstruct  the 
ground  surface.  Note  that  the  raw  points  correspond¬ 
ing  to  the  tree  and  the  background  wall,  as  well  as  the 
erroneous  matches,  have  been  segmented  out. 

In  the  two  examples  above,  the  depth  maps  were  quite 
dense  and  the  final  surface  of  interest  could  be  projected 
onto  the  horizontal  plane  to  compute  a  2D  Delaunay 
triangulation.  In  the  next  subsection,  we  show  a  more 
difficult  case  for  which  none  of  these  conditions  holds. 

3.2  Reconstructing  a  Complex  3D  Scene 

In  Figure  9,  we  show  three  objects  that  have  been 
stacked  on  a  turntable.  By  rotating  the  table,  we 
have  produced  12  stereo  triplets  and  their  correspond¬ 
ing  depth  maps  such  as  the  one  of  Figure  9  (c).  Be¬ 
cause  the  objects  are  not  very  textured,  the  individual 
maps  are  neither  very  dense  nor  very  precise;  however 
by  merging  them  using  our  smoothing/resampling  pro¬ 
cedure  we  can  generate  the  set  of  points  shown  in  Figure 
9(d)  and  (e)  that  correctly  captures  the  geometry  of  the 
main  structures.  Unfortunately,  there  also  are  a  few  er¬ 
roneous  points  that  seem  to  float  in  space  that  must  be 
removed.  The  shaded  models  of  Figure  10  have  been  ob¬ 
tained  by  computing  a  3D  Delaunay  triangulation  of  the 
points  and  retaining  only  the  faces  whose  orientation  is 
consistent  with  the  local  surfaces,  as  depicted  in  Figure 
2.4.  The  surfaces  corresponding  to  the  three  objects  are 
clearly  separated  but  still  exhibit  a  few  “holes”  due  to 
the  lack  ot  stereo  data. 

4  Conclusion 

In  this  paper  we  have  presented  a  method  for  estimating 
explicit  surfaces  from  given  data  consisting  of  a  nonreg¬ 
ular  sampling  of  the  surfaces.  The  data  was  assumed 
in  the  form  of  a  general  “cloud  of  points”  with  no  par¬ 
ticular  structure,  and  we  made  no  a  priori  assumptions 
about  the  form  of  the  underlying  surfaces,  other  than 
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Figure  5:  (a)  A  valid  triangle;  the  normal  at  the  vertices  are  consistent  with  the  triangle  orientation,  (b) 
An  invalid  triangle. 


smoothness.  We  have  shown  that  we  can  effectively  re¬ 
construct  global  surfaces  under  such  difficult  conditions. 
However,  when  additional  assumptions  or  knowledge  can 
be  brought  to  bear,  one  should  obviously  do  so  and  our 
technique  can  then  be  used  as  a  source  of  information 
about  the  local  surface  geometry. 

The  algorithm  consists  of  four  sequential  steps: 
smoothing  of  the  points  by  iterative  local  surface  fit¬ 
ting;  resampling  the  smoothed  points  onto  a  regular  grid; 
computation  of  an  adjacency  graph  of  the  points  with 
clustering  of  the  connected  components;  and  triangu¬ 
lation  of  the  clusters.  The  algorithm  is  well-suited  to 
parallel  implementation  and  the  examples  shown  were 
produced  on  the  Connection  Machine. 

The  experiments  shown  in  this  paper  used  depth  data 
acquired  from  stereopsis  by  a  mobile  robot,  although  we 
have  successfully  tested  much  of  the  same  code  on  3D 
biomedical  scanner  images  as  well.  It  must  ne  noted 
that  when  the  quality  of  the  data  degrades,  the  criti¬ 
cal  step  becomes  the  grouping  one.  In  extreme  cases, 
the  heuristics  described  in  this  paper  are  too  simple  and 
may  fail.  To  push  the  nethod  forward,  it  will  be  neces¬ 
sary  to  develop  more  sophisticated  ones.  In  fact,  gener¬ 
ating  the  triangulations  of  section  2.4  could  be  recast  as 
the  problem  of  finding  the  best  description  of  the  data  in 
terms  of  a  set  of  triangles  knowing  the  normals  and  cur¬ 
vatures  and  every  vertex.  This  problem  can  be  handled 
within  the  framework  provided  by  Mininum  Descrip¬ 
tion  Length  encoding  [Rissanen,  1987;  Leclerc,  1989; 
Fua  and  Hanson,  1989].  Such  a  framework  sould  pro¬ 
vide  us  with  a  sound  theoretical  basis  for  future  work. 

Appendices 

A  The  mathematics  of  recursive 
fit-and-update 

In  this  appendix,  we  give  some  of  the  details  of  the 
method  of  fitting  local  surfaces.  A  comprehensive  analy¬ 
sis  of  the  mathematical  properties  of  the  algorithm  is  not 
given,  but  will  appear  in  a  future  publication.  Also,  to 
simplify  the  presentation,  we  reduce  the  problem  by  one 
dimension  and  consider  fitting  local  quadratic  curves  to 
2D  data  points. 


Given  m  initial  data  points  i  =  l,...,m}, 

we  assume  that  the  iterations  k  take  place  on  a 
fixed  regular  grid  so  that  =  x*  =  •  •  ■  = 

x®  =  Xj.  We  further  assume  that  the  fits  are 
local  and  over  the  same  size  neighborhoods  of  n 
points  so  that  the  neighborhood  of  (xj.j^)  consists 
of  yl^), . . . ,  (0,  y?), . . . ,  (^, yf^.j^)}.  The 

initial  fits  to  the  data  points  determine  coefficients  Cf  = 
(a®,  6®,  c®)  of  the  “best-fitting”  quadratic 

y?(0  =  +  bit  +  c? 


at  (*<,y,-)- 

We  now  look  at  the  relationship  between  successive 
iterations  k  and  k  —  1.  The  quadratic  fit  at  (xi,yf)  is 
the  least-squares  solution  of  the  overdetermined  system 
AfC*  =  Pi  >  and  for  simplicity  we  consider  here  that  it  is 
given  by 

=  [A^Ar^A^pf. 


Note  that  A*  is  composed  only  of  linear  combinations  of 
Xj  and  is  hence  independent  of  k,  and  since  the  {xj}  in 
each  neighborhood  are  the  same,  it  is  independent  of  i 
as  well  and  we  write  just  A.  Thus, 


6f 


=  {A^A)-^A^ 


\ 

> 


where  (a*,6*,c*)  =  Now,  the  data  point  y*  is  up¬ 
dated  to  y*'*'^  =  cf,  and  introducing  this  updating  into 
the  above  equation  gives,  at  the  next  iteration, 


\ 

( ytV=i  \ 

1  =  (A'rA)-‘A'^ 

3 

/ 

^  *^*4^  j 

\ 
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(d)  (e)  (f) 

Figure  6:  (a)  (b)  A  stereo  pair  showing  a  set  of  rocks,  (c)  Another  image  taken  from  a  completely  different 
viewpoint,  (d)  The  stereo  map  derived  by  matching  (a)  and  (b).  The  black  areas  correspond  to 
textureless  areas  for  which  no  depth  was  computed.  Elsewhere,  the  lighter  colors  correspond  to  the 
greater  distances  (e)  Wireframe  representation,  (f)  Shaded  representation.  Note  that  the  parts  of 
the  rocks  visible  in  both  (a)  and  (b)  are  correctly  reconstructed,  but  that  the  others  are  not. 

Now,  when  only  the  third  component  of  C*  is  consid-  points  at  iteration  +  1,  we  get 


where  each  L  matrix  occupies  a  1  x  n  block  of  the  large 
where  is  the  third  row  of  a  matrix  B  and  L  is  a  matrix.  By  shifting  the  L  matrices  appropriately,  the 

1  X  n  matrix.  Now,  we  see  that  considering  all  the  data  right-hand  side  can  be  written  more  economically  as  the 
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Figure  7;  (a)  Triangulated  ground  surface  for  the  rocks  of  Figure  6  (b)  (c)  Shaded  views.  Note  that  the 
backs  of  the  foreground  rocks  of  Figure  6(a)  are  now  clearly  visible. 


product  of  an  m  X  m  matrix  and  an  m— vector, 


where  M  is  an  mxm  banded  matrix  of  width  n.  Thus  we 
see  that  with  a  bit  of  work  the  results  of  the  quadratic  At 
at  iteration  k  can  be  derived  from  the  results  of  the  initial 
fit,  in  turn  obtained  from  the  given  data  observations. 

B  Area  Based  Stereography 

In  this  section,  we  briefly  describe  the  correlation-based 
stereo  algorithm  used  to  produce  the  3D  data  we  need. 
For  a  more  complete  description  we  refer  the  interested 
reader  to  previous  publications  [Fua,  1991a;  Fua,  1991b]. 

Most  correlation  based  algorithms  attempt  to  And 
points  of  interest  on  which  to  perform  the  correlation. 
This  approach  is  justified  when  only  limited  computing 
resources  are  available,  but  with  modern  hardware  ar¬ 
chitectures  and  massively  parallel  computers  it  becomes 
possible  to  perform  the  correlation  over  all  image  points 
and  retain  only  matches  that  appear  to  be  “valid.” 

To  generate  a  dense  and  accurate  depth  map,  one  must 


then  interpolate  these  measures  in  such  a  way  as  to  pre¬ 
serve  depth  discontinuities.  To  do  so,  we  model  the  world 
as  made  of  smooth  surfaces  separated  by  depth  discon¬ 
tinuities  that  generate  changes  in  grey  level  intensities 
due  to  changes  in  orientation  and  surface  material. 

Given  a  pair  of  images  and  corresponding  camera  mod¬ 
els,  the  computation  of  the  depth  map  consists  of  four 
steps. 

1.  Rectification.  The  images  are  reprojected  onto 
the  same  image  plane  so  that  all  epipolar  lines  be¬ 
come  parallel.  This  makes  the  parallel  implementa¬ 
tion  of  the  correlation  algorithm  much  simpler  be¬ 
cause  the  exact  same  operations  are  performed  at 
every  pixel. 

2.  Matching.  Correlation  scores  are  computed  by 
comparing  a  fixed  window  in  the  first  image  to  a 
shifting  window  in  the  second.  The  second  window 
is  moved  in  the  second  image  by  integer  increments 
and  an  array  of  correlation  scores  is  generated  for 
integer  disparity  values.  To  compute  the  disparity 
with  subpixel  accuracy,  we  fit  a  second  degree  curve 
to  the  correlation  scores  in  the  neighborhood  of  the 
maximum  and  compute  the  optimal  disparity  by  in¬ 
terpolation. 

As  shown  by  Nishihara  [Nishihara  and  T.Poggio, 
1983],  the  probability  of  a  mismatch  goes  down  as 
the  size  of  the  window  and  the  amount  of  texture 
increase.  However,  using  large  windows  leads  to  a 
loss  of  accuracy  and  the  possible  loss  of  important 
image  features.  We  choose  to  consider  as  acceptable 
only  those  results  for  which  we  get  the  same  result 
by  reversing  the  roles  of  the  two  images,  thereby 
greatly  reducing  the  probability  of  error  even  when 
using  very  small  windows  (down  to  3x3  for  textured 
outdoor  scenes).  In  fact  the  density  of  such  consis¬ 
tent  matches  in  a  given  area  of  the  image  appears  to 
be  an  excellent  indicator  of  the  quality  of  the  stereo 
matching.  An  occasional  “false  positive”  (a  pixel 
for  which  the  same  erroneous  disparity  is  measured 
when  matching  both  from  left  to  right  and  right  to 
left)  may  occur,  but  we  have  never  encountered  a 
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Figure  8;  (a)  (b)  The  two  left  images  of  a  pair  of  stereo  views,  (c)  The  depth  map  computed  using  the  first 
one.  (d)  (e)  Two  shaded  views,  (f)  A  texture  mapped  view. 


situation  that  gave  rise  to  a  large  clump  of  such  er¬ 
rors. 


3.  Merging  across  Hierarchy.  We  perform  the 
matching  at  several  levels  of  resolution  with  iden¬ 
tical  window  sizes,  which  is  conceptually  equivalent 
to  matching  at  one  level  with  windows  of  different 
sizes  [Kanade  and  Okutomi,  1990]  but  computation¬ 
ally  more  efficient.  We  then  pick  the  disparity  com¬ 
puted  at  the  highest  level  of  resolution  for  which  an 
acceptable  disparity  with  respect  to  our  consistency 
test  can  be  found. 


This  is  a  departure  from  traditional  hierarchical  im¬ 
plementations  that  make  use  of  the  results  gener¬ 
ated  at  low  resolution  to  guide  the  search  at  higher 
resolutions.  While  these  are  good  methods  for  re¬ 
ducing  computation  time,  they  assume  that  the  re¬ 
sult  generated  at  low  resolution  are  more  reliable, 
even  if  less  precise,  than  those  generated  at  high 
resolution.  This  is  a  questionable  assumption  espe¬ 
cially  in  the  presence  of  occlusions. 

4.  Interpolation.  Finally,  the  dense  depth  map  w 
is  computed  by  fitting  a  piecewise  smooth  sur¬ 
face  to  the  correlation-based  depths  wO.  Assum¬ 
ing  that  depth  discontinuities  are  more  likely  to 
occur  where  the  image  intensity  gradient  is  large, 
we  use  a  conjugate  gradient  method  [Szeliski,  1990; 
Terzopoulos,  1986]  to  minimize  the  following  crite¬ 
rion: 


dw^ 


dw‘ 


C  =  c(«;-ts0)  ’ 


where  c  is  zero  if  the  correlation  has  failed  and  pro¬ 


portional  to  the  normalized  correlation  score  other¬ 
wise,  and  Ax,  Ay  are  inversely  proportional  to  the 
image  gradients  in  the  x  and  y  directions.  This  in¬ 
terpolation  scheme  produces  dense  depth  maps  that 
preserve  depth  discontinuities  without  having  to  ex¬ 
plicitly  specify  their  location. 

This  algorithm  uses  simultaneously  the  grey  level  in¬ 
formation  present  in  a  single  image  and  the  stereo  in¬ 
formation  present  in  a  pair.  In  fact,  as  suggested  by 
Moravec  [Moravec,  1981]  and  many  others,  more  than 
two  images  can  and  should  be  used  whenever  practical. 

However,  the  smoothing  scheme  described  above  op¬ 
erates  in  the  image  plane  and  makes  the  2  1/2D  assump¬ 
tion  of  all  such  interpolation  techniques.  In  particular, 
it  cannot  deal  with  occlusions  and  hidden  faces.  To  re¬ 
cover  the  full  3D  structure  of  objects  and  merge  range 
data  computed  from  several  viewpoints  we  need  to  in¬ 
voke  the  3D  smoothing  procedure  of  Section  2. 
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triangles. 
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Abstract 

Inference  of  3-D  shape  from  2-D  contours  in  a 
single  image  is  an  important  problem  in  ma¬ 
chine  vision.  Often,  techniques  to  solve  this 
problem  examine  each  surface  in  the  scene  sep¬ 
arately  whereas  our  perception  of  their  shapes 
clearly  depends  on  the  interplay  between  them 
as  well.  In  this  paper,  we  describe  a  technique 
that  attempts  to  recover  the  shapes  of  all  the 
surfaces  of  an  object  simultaneously,  though,  it 
is  limited  to  objects  made  of  zero-Gaussian  cur¬ 
vature  surfaces.  Our  technique  is  based  on  an 
analysis  of  three  kinds  of  symmetries  defined  in 
the  paper  and  the  constraints  that  derive  from 
them,  and  from  other  boundaries.  Results  on 
some  complex  examples  are  shown. 

1  Introduction 

One  of  the  basic  goals  of  mid-level  vision  is  to  recover 
the  local  orientations  of  the  surfaces  of  the  obiects  in 
a  scene.  Of  the  many  cues  available  to  aid  in  this  pro¬ 
cess,  we  believe  that  shape  of  the  2-D  contour  itself  is 
the  most  important  and  robust  one.  Of  course,  inferring 
shape  from  contour  is  highly  ambiguous  and  can  not  be 
done  without  making  some  assumptions.  The  goal  in 
shape  from  contour  methods  is  to  minimize  the  number 
of  needed  assumptions  and  to  achieve  results  consistent 
with  human  perception.  For  shape  from  contour  analy¬ 
sis,  the  only  ground  truth  is  really  in  human  perception, 
for  even  if  the  given  contour  was  obtained  by  a  real  ob¬ 
ject,  it  could  have  been  also  obtained  by  any  number  of 
other  objects  as  well. 

Early  work  on  inferring  3-D  structure  from  a  2-D 
shape  was  focused  on  analysis  of  line  drawings  of  polyhe- 
dra  [HufTl],  [Clo7l],  [Mac73],  [KanSl],  [KK83].  In  the 
80s,  several  techniques  for  non-polyhedral  shapes  were 
proposed  (e.g.  [BT81],  [Wei88],  [BY84],  [Ste8l],  [XT871, 
[HB88],  [PCM89],  [Nal87]).  One  characteristic  of  most 
of  these  methods  is  that  they  examine  only  a  single  sur¬ 
face  in  the  scene  at  a  time  whereas  our  perception  of  a 
surface  can  be  strongly  influenced  by  our  perception  of 
the  entire  object. 

In  earlier  papers,  we  have  examined  the  recovery  of 
3-D  surface  snape  of  a  variety  of  curved  surfaces  cut  by 
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Figure  1:  (a)  an  object  consisting  of  multiple  planar  and 
curved  surfaces  and  (b)  the  front  part  of  the  object  in 
isolation. 

planes  ([UN90a],  IUN90b],  [UN91b]).  Our  techniques 
rely  on  observed  symmetries  in  the  image  and  the  analy¬ 
sis  depends  on  the  interplay  between  constraints  imposed 
by  the  curved  surface  and  the  planes  cutting  it.  We 
showed  successful  shape  recovery  for  the  following  kinds 
of  surfaces;  zero-Gaussian  curvature  surfaces  [UN90a], 
surfaces  of  straight  homogeneous  generalized  cylinders 
[UN90b]  and  surfaces  of  planar,  right,  constant  cross- 
section  generalized  cylinders  [UN91b]. 

Complex  objects,  however,  are  composed  of  a  number 
of  curved  surfaces  and  planar  patches.  Our  perception 
of  each  of  these  surfaces  is  affected  by  the  presence  of 
the  others.  For  example  consider  the  object  in  figure  1 
(a)  which  appears  to  be  a  composite  of  two  objects,  one 
in  front  of  the  other.  If  the  front  object  is  viewed  in 
isolation  as  in  figure  1  (b),  the  interpretation  for  its  top 
surface  is  ambiguous  (it  may  be  planar  or  not).  How¬ 
ever,  in  the  context  of  the  whole  object  (a),  this  ambi¬ 
guity  disappears  (the  said  surface  must  be  planar).  This 
is  an  example  of  how  a  remote  surface  changes  the  per¬ 
ception  of  some  other  surface  drastically.  In  general, 
even  if  the  perception  of  a  surface  may  not  be  effected 
by  the  neighboring  surfaces  this  drastically,  its  percep¬ 
tion  would  still  be  effected  by  small  amounts  to  make 
the  whole  object  more  consistent  (surfaces  obeying  inter¬ 
surface  constraints). 

This  paper  explores  3-D  surface  inference  by  includ¬ 
ing  interplay  between  many  surfaces  that  may  comprise 
a  complex  object.  Our  technique  is  limited  to  a  com¬ 
bination  of  zero-Gaussian  curvature  (ZGC)  surfaces  and 
planar  surfaces.  However,  our  method  uses  constraints 
similar  to  those  used  in  our  earlier  work  on  shape  from 
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contour  for  ZGCs  [UN90a].  The  previous  work,  was  lim¬ 
ited  to  analysis  of  a  ZGC  surface  cut  by  two  parallel 

J>lanes.  In  this  paper,  we  first  develop  techniques  that  al¬ 
ow  analysis  of  a  ZGC  surface  cut  by  non-parallel  planes 
and  then  show  how  multiple  ZGC  surface  shapes  can  be 
recovered  simultaneously  while  influencing  each  other. 

To  accomplish  this,  we  have  found  it  more  convenient 
to  change  some  of  the  representations  used  in  our  earlier 
work  as  well  as  to  device  an  additional  form  of  symmetry. 

In  section  2  we  define  three  kinds  of  symmetries  and 
discuss  the  occurrence  of  these  symmetries  in  the  context 
of  planar  and  ZGC  surfaces.  In  section  3  we  discuss  the 
constraint  equations  that  are  used  in  the  shape  recovery. 

In  section  4  the  representation  of  the  surfaces  and  sin¬ 
gle  surface  recovery  is  discussed.  In  section  5  combined 
^ape  recovery  of  multiple  surfaces  is  discussed  in  detail. 

In  section  6  we  discuss  an  implementation  of  the  shape 
recovery  algorithms  and  present  some  results. 

Our  method  assumes  that  clean,  closed  boundaries  are 
given  (or  can  be  extracted  from  the  real  image).  We 
do  not  address  the  issue  of  separating  object  boundaries 
from  surface  markings,  or  other  perceptual  grouping  oper¬ 
ations  here,  though  we  believe  that  constraints  required 
for  shape  inference  by  our  methods  will  aid  in  the  percep¬ 
tual  organization  process  itself.  Such  research  is  being 
currently  pursued  m  our  laboratory  separately. 

We  assume  orthographic  projection  throughout  the 
paper  unless  specifically  mentioned  otherwise  (in  a  sepa¬ 
rate  paper,  we  have  shown  how  many  of  the  constraints 
for  orthographic  projection  can  be  transformed  to  the 
case  of  perspective  projection  [UN91a]. 

2  Surfaces  and  Symmetries 

In  this  paper  we  concentrate  on  shape  from  contour  for 
objects  composed  of  planar  and  zero  Gaussian  Curvature 
surfaces.  A  Zero  Gaussian  Curvature  (ZGC)  surface  is 
one  where  the  the  Gaussian  curvature  (the  product  of 
the  maximum  and  minimum  principal  curvatures)  of  the 
surface  is  zero  everywhere.  Cylinders  and  cones  are  ex¬ 
amples  of  a  ZGC  surf^e.  These  surfaces  are  also  called 
developable  surfaces  since  they  can  be  generated  from  a 
piece  of  paper  by  rolling  and/or  bending  without  cut¬ 
ting.  We  feel  that  ZGC  surfaces  comprise  a  large  and 
useful  class  and  that  they  represent  a  natural  step  up  in 
complexity  from  the  study  of  planar  surfaces  that  have 
dominated  previous  work  in  the  field.  Lines  of  mini¬ 
mum  curvature  for  a  ZGC  surface,  also  called  rulings, 
are  straight,  t.e.  it  is  possible  to  embed  straight  lines  on 
a  ZGC  surface  along  these  rulings. 

2.1  Symmetries 

We  define  three  types  of  symmetries,  that  we  call  parallel 
symmetry,  line-convergent  symmetry  and  skew  symme¬ 
try.  Parsiliel  and  line-convergent  symmetries  are  mainly 
found  in  curved  surfaces,  skew  symmetry  is  usually  an  in¬ 
dicator  of  planar  surfaces.  In  our  previous  work  [UN90a) 
we  provided  a  detailed  description  of  parallel  and  skew 
symmetries.  Here,  we  discuss  line  convergent  symmetry 
in  detail. 

For  curves  to  be  symmetric,  certain  point-wise  corre¬ 
spondences  between  two  curves  must  exist.  We  will  call 
tne  lines  joining  the  corresponding  points  on  the  curves 
as  the  lines  of  symmetry,  the  locus  of  the  mid  points 
of  these  lines  as  the  axis  of  symmetry,  and  the  curves 
forming  the  symmetry  as  the  curves  of  symmetry. 

2.1.1  Line- Convergent  Symmetry 

Two  image  curves  Ci  and  Cn  are  line-convergent  sym¬ 
metric  if  the  tangents  of  C\  and  Cj,  at  the  corresponding 
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Figure  2;  Two  line-convergent  symmetric  curves. 

points,  intersect  along  a  line,  say  I,  on  the  image  plane. 
This  is  shown  in  figure  2.  Parallel  symmetry  may  be 
thought  of  as  a  limiting  case  of  line  convergent  sym¬ 
metry  where  the  line  of  intersection  is  at  infinity.  We 
show  later,  in  section  2.2,  that  this  symmetry  is  found 
in  curves  obtained  by  cutting  a  ZGC  surface  with  two 
non  parallel  planes.  A  paralld  symmetry  also  turns  into 
line  convergent  symmetry  under  perspective  projection 
[UN91a].  It  is  also  present  for  limbs  of  straight  homoge¬ 
neous  generalized  cylinders  [UN90b]. 

2.2  Symmetries  in  Surfaces 

The  symmetries  discussed  in  the  previous  sections  are 
present  in  ZGC  and  planar  surfaces.  Symmetries  also 
provide  strong  information  about  the  type  of  the  surface. 
In  [UN90a]  we  showed  that  if  a  closed  contour  composed 
of  non-limb  edges  has  a  skew  symmetry,  then  the  con¬ 
tour  has  to  be  planar  under  the  assumption  of  general 
viewpoint  and  if  the  correspondence  is  static  with  re¬ 
spect  to  changing  viewpoint.  A  ZGC  surface  cut  by  par¬ 
allel  planes  produces  parallel  symmetry.  Moreover,  we 
showed  that  a  figure  bounded  by  one  parallel  symmetry 
and  one  skew  symmetry  with  straight  lines  of  symme¬ 
try  must  be  a  ZGC  surface  (assuming  general  viewpoint 
in  both  cases).  Line  convergent  symmetry  is  produced 
when  a  ZGC  surface  is  intersected  by  non  parallel  planes. 

Theorem  1  Curves,  C\  and  C^,  obtained  by  cutting  a 
ZGC  surface,  S,  by  two  non  parallel  planes.  Hi  and  n2 
project  as  line-convergent  symmetric  curves  such  that  the 
lines  joining  the  corresponding  points  of  the  image  curves 
are  the  projections  of  the  rulings  of  S  and  the  line  I 
formed  on  the  image  plane  by  joining  the  intersection 
points  of  the  tangent  lines  of  the  Line-convergent  sym¬ 
metric  curves  is  the  projection  of  the  3-D  intersection 
line  of  the  planes  Hi  and  II2. 

Proof  The  above  theorem  is  visualized  in  figure  3. 
The  key  to  the  proof  of  this  theorem  is  that  the  tangent 
plane,  plane  T,  in  figure  3  of  the  ZGC  surface  5  is  same 
along  tne  rulings  of  5.  Therefore,  both  the  tangent  lines, 
ti  and  <2.  of  the  curves  Ci  and  C2  from  points  Pi  and 
P2  are  on  plane  T.  Also  the  tangent  line  <1  is  on  plane 
Hi  and  <2  is  on  112.  Therefore  intersection  of  ti  and 
<2  is  necessarily  at  the  intersection  point  of  the  three 
planes  Hi,  n2  and  T.  For  other  rulings  the  same  things 
repeat  for  a  different  T  plane,  and  all  the  tangent  line 
intersections  take  place  along  the  line  I,  the  intersection 
line  for  planes  Hi  and  02-  Hence,  on  the  image  plane 
too  the  intersection  of  the  tangents  takes  place  on  the 
projection  of  the  line  /. 

Note  that  the  reverse  of  this  theorem,  that  line- 
convergent  symmetry  curves  must  come  from  non  par¬ 
allel  planar  cuts  of  ZGC  surfaces,  is  not  valid.  How¬ 
ever,  we  believe  that  it  is  reasonable  to  infer  that  line- 
convergent  symmetry  curves  are  planar  cross  sections  of 


Figure  3:  Formation  of  the  line-convergent  symmetry 
with  a  ZGC  surface  and  two  non  parallel  planes. 

ZGC  surfaces,  if  they  are  terminated  by  line  segments 
(corresponding  to  the  rulings). 

3  Constraints  on  Surface  Shape 

We  will  solve  the  recovery  of  shape  from  contour  problem 
as  a  constraint  minimization  problem.  The  constraints 
discussed  here  are  the  building  blocks  of  constraints  and 
error  terms  of  the  minimization.  The  constraints  are 
originally  stated  in  gradient  space.  However,  the  gra¬ 
dient  space  is  not  uniform,  i.e.,  a  constant  shift  at  the 
center  of  the  ^adient  space  corresponds  to  a  larger  vec¬ 
tor  difference  in  3-D  than  the  same  shift  somewhere  far¬ 
ther  away  from  the  center.  Therefore,  the  uniformity  of 
the  constraint  function  implies  that  the  error  returned 
by  the  function,  when  not  satisfied  exactly,  depends  on 
the  3-D  vector  differences  rather  than  the  differences  in 
gradient  space.  A  gradient  (p,  q)  corresponds  to  a  3-D 
vector  of  v  =  (p,q,  1).  The  projection  of  v  on  the  unit 
sphere  is  given  by: 


The  vector  w,  is  the  normalized  (i.e.,  |v,|  =  1)  form  of 
the  vector  v,  and  v,  is  only  dependent  on  the  orientation 
of  the  vector  v,  it  has  no  length  information.  Equation 
of  Vf  shows  that  a  constant  shift  in  parameters  p  and  q 
of  V,  has  less  significance,  i.e.,  affects  the  components 
of  V,  less,  as  p  and  q  gets  larger.  The  normalized  con¬ 
straint  error  functions  are  formulated  to  compensate  for 
this  behavior  of  nadient  space.  The  drawback  of  this 
normalization  is  Uiat  linear  constraint  functions  are  no 
longer  linear. 


3.1  Shared  Boundary  Constraint  (SBC) 

This  constraint  relates  the  orientations  of  the  two  sur¬ 
faces  on  opposite  sides  of  an  edge.  The  planar  version  of 
this  constraint  has  been  used  since  early  days  in  polyhe¬ 
dral  scene  analysis  [MeurTS].  Shafer  ei  al.  [SKK831  first 
extended  it  to  tne  case  of  intersection  of  curved  surfaces. 
In  [UN90a]  we  also  provide  a  detailed  description  and 
derivation  of  this  constraint.  Here  we  just  give  the  nor¬ 
malized  version  of  this  constraint,  which  is  a  non  linear 
equation.  The  normalized  shared  boundary  constraint 
is  applied  between  the  gradients  (pi,9i)  nnd  (p3,qa)  of 
two  surfaces  at  the  point  of  intersection,  and  a  2-D  vec¬ 
tor  (*,y),  which  is  the  tangent  of  the  curve  formed  by 
intersection  of  the  surfaces.  The  constraint  is; 


SBC()  = 


((P2-Pi)=g-l-(92-gi)y)^ 

((Ps  -  Pi)^  +  (92  -  9i)^  +  1)(*^  +  y^) 


3.2  Orthogonality  Constraint 

Certain  properties  or  symmetries  invoke  the  assumption 
of  orthogonality  in  3-D.  The  assumption  of  orthogonality 
was  first  studied  for  skew  symmetric  planar  contours  by 
Kanade  [KanSl].  We  will  assume  orthogonality  between 
the  axis  of  parallel  symmetry  and  the  lines  of  parallel 
symmetry.  For  a  ZGC  surface,  this  is  equivalent  to  slic¬ 
ing  the  surface  along  rulings  to  obtain  thin  skew  sym¬ 
metric  planar  strips  and  assuming  that  these  strips  are 
orthogonally  symmetric  in  3-D.  In  [Ulu9l]  we  provide  a 
detailed  discussion  of  this  constraint.  Here  we  give  the 
normalized  version  only.  The  normalized  constraint  Q(  ) 
is  a  function  of  a  surface  gradient  (p,  q)  and  two  image 
directions  (a;i,yi)  and  (xatVa)  that  are  hypothesized  to 
be  orthogonal  in  3-D. 

i(Pi,9i.i)n(P2.92,i)p 


3.3  Equality  Constraint 

This  constraint  is  applied  when  two  gradients  are  hy¬ 
pothesized  to  be  equal.  Since  the  gradient  space  is 
not  uniform,  using  Euclidean  distance  between  vectors 
(pi.9i)  and  (pa.qa)  is  not  a  normalized  and  uniform  er¬ 
ror  measure.  Therefore,  square  of  the  sin  of  the  3-D 
unit  vectors  corresponding  to  the  gradients  (pi,9i)  and 
(P2i  92)  is  used; 


^9(Pi.9i.P2,92)  =  1  - 


((Pi,9x.l)  •(P2.92,l))^ 
l(Pl.9l,l)PI(P2,92,  1)1^ 


(4) 


4  Recovering  ZGCs  Cut  by 
Non-Parallel  Planes 


In  order  to  be  able  to  include  the  contributions  of 
the  constraints  from  each  surface  and  inter-surface  con¬ 
straints  into  pool  of  constraints,  appropriate  parameter 
representation  for  each  surface  is  necessary.  Tne  param¬ 
eterization  we  use  is  described  below. 


4.1  Parameterization  of  Surfaces 
For  planar  surfaces  the  gradient  space  representation 
(p,g)  of  the  surface  normal  of  the  plane  is  used.  This 
is  the  natural  and  most  versatile  (for  our  purposes)  rep¬ 
resentation  for  planar  surfaces. 

In  the  following  we  discuss  the  parameterization  and 
confutation  of  local  surface  normals,  for  the  case  of 
ZGC  surfaces  cut  Iw  non  parallel  planes.  As  discussed 
in  section  2.2  a  ZGC  surface  cut  by  parallel  planes  pro¬ 
duces  parallel  symmetric  curves  and  a  ZGC  surface  cut 
by  non  parallel  planes  produces  line  convergent  symmet¬ 
ric  curves.  Non  parallel  cut  case  is  the  more  general  case. 

In  order  to  compute  line-convergent  symmetry  be¬ 
tween  two  curves  on  a  general  ZGC  surface  we  must  try 
all  possible  monotonic  point  correspondences  between 
the  curves.  This  is  a  very  costly  searA.  However,  for  the 
case  of  cylindrical  and  conic  surfaces,  the  computation 
of  line-convergent  symmetry  is  much  simpler.  Most  of 
the  ZGC  surfaces  that  we  encounter  in  our  environment 
are  in  fact  cylindrical  or  conic  surfaces.  Moreover,  we 
can  always  segment  a  general  ZGC  surface  into  cylindri¬ 
cal  and  conic  sections  (mostly  at  the  inflection  points) 
and  process  each  section  individually  with  appropriate 
constraints  applied  along  the  lines  of  segmentation.  For 
conic  surfaces  correspondence  of  line-convergent  symme¬ 
try  is  restricted  to  be  along  the  lines  that  pass  through 
a  common  point,  the  apex  of  the  cone. 

To  recover  the  local  surface  normals  for  ZGCs  cut  by 
non  parallel  planes,  we  need  to  decide  which  cross  section 
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Figure  4;  The  parameters  of  a  ZGC  surface  and  the 
constraints  in  the  gradient  (p,  q)  of  the  surface  along  the 
ruling  r. 


curve  is  to  be  made  orthogonal  to  the  rulings.  This  is 
because  the  cross  section  planes  are  not  parallel  to  each 
other. 

A  ZGC  surface  cut  by  non  parallel  planes  have  five 
degrees  of  freedom.  Of  the  five  degrees  of  freedom,  the 
four  parameters  are  the  gradients  &ud  of 

the  top  and  the  bottom  planes  cutting  the  ZGC  surface, 
and  the  u  parameter  discussed  below. 

We  can  model  a  conic  surface  by  using  any  3-D  axis 
that  goes  through  the  apex  of  the  cone.  We  use  the  3-D 
axis  uiat  projects  as  the  2-D  axis  of  the  straight  edges 
of  the  cone.  If  the  image  direction  of  the  axis  is  (a^,  Oy) 
then  the  3-D  direction  of  the  axis  in  gradient  space  rep¬ 
resentation  is  (uacUUy)  where  u  is  a  free  variable,  and 
it  is  the  fifth  parameter  of  the  ZGC  representation. 


4.2 


Recovering  ZGCs  Cut  by  Two  Planes 

Consider  figure  4,  let  (pti  9()  be  the  gradient  of  the  cross 
section  plane  that  is  ^osen  to  be  made  orthogonal  to 
the  surface.  The  gradient  (p,  q)  of  the  surface  along  the 
ruling  r  is  given  by  combination  of  two  linear  constraints 
given  below.  The  first  one  is  the  shared  boundary  con¬ 
straint,  given  in  section  3.1,  between  gradients  (p,q)  and 
(pt,qt)  using  the  tangent  (x',  jr')  of  the  intersection  curve 
at  the  point  the  curve  touches  the  ruling  r.  The  equa¬ 
tion  of  the  constraint  that  {p  —  Pi,q  —  qt)  ■  ix',y')  =  0 
This  constraint  is  shown  in  the  gradient  space  by  the  line 
labelled  L  (x',y')  in  figure  4. 

The  second  constraint  is  that  the  3-D  gradient 
(Pri9rfl)  of  the  ruling  r  must  must  be  orthogonal  to 
the  ^D  gradient,  (p,9, 1),  of  the  surface  along  ruling  r, 
that  is;  (p,q,  l)-(Pr,qr,  1)  =  0  In  figure  4  this  constraint 
is  shown  by  the  line  labelled  X  (Pri9r)>  which  is  the  or¬ 
thogonal  line  of  the  gradient  {pr,qr),  <•£•>  the  gradient 
of  the  set  of  the  directions  that  are  orthogonal  to  (pr,  9r) 
in  3-D.  Note  that  this  line  is  also  orthogonal  to  the  2-D 
direction  of  the  image  of  the  ruling. 

The  gradient  (pr,9r)  of  the  ruling  r  is  obtained  by 
reconstructing  the  axis  line  (a^,  Uy ),  and  the  line  between 
points  (z,y)  and  (Xp,yp)  in  3-D  (i.e.,  computing  the  z 
coordinates  of  these  image  points).  Since  the  gradient  of 
the  axis  line  is  (uaz,  uay),  fixing  Zp  =  0,  the  z  coordinate 
of  the  point  (xtip.yup),  2tip,  is  given  by. 


_  Xtip  ~  Xp 
Ztip  — 

«0* 


(5) 


The  z  coordinate  of  the  point  (x,  y)  is  computed  using 
the  gradient  {pt,qt)  as; 

z  =  Pt{x  -  Xp)  +  qt{y  -  yp)  (6) 
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Figure  5:  Constraints  on  the  orientation  of  the  cutting 
planes  of  a  ZGC  surface. 


Figure  6:  The  telephone  example. 

Then  the  gradient  (pr,9r)  of  the  ruling  is  given  by; 

(Pr,9r)  —  (*»ip  ~  ^^>yfip  ~  P)  (7) 

Ztip  —  2 

These  formulas  are  exactly  the  same  for  ZGCs  cut  by 
parallel  planes.  In  that  case  the  result  is  independent  of 
which  parallel  symmetry  curve  used. 

Compared  to  ZGCs  cut  by  parallel  planes,  for  a  ZGC 
surface  cut  by  non  parallel  planes  there  are  two  addi¬ 
tional  unknowns,  which  are  the  gradient  parameters  of 
the  second  cutting  plane.  Also,  there  are  two  additional 
constraints  on  the  orientation  of  the  planes  cutting  the 
ZGC  surface.  These  constraints  are  not  needed  com¬ 
pute  the  local  surface  normals  of  a  ZGC  surface.  In  fact 
the  gradient  of  the  second  plane  (t.e.,  the  plane  that  is 
not  chosen  to  be  made  orthogonal  to  the  surface)  is  not 
needed  at  all  for  that  purpose.  However,  for  the  multi¬ 
ple  surface  recovery  algorithm  the  gradient  of  the  second 
plane  is  needed  so  are  these  constraints.  Here  we  state 
the  constraints,  they  will  be  used  in  section  5.1. 

The  first  one  is  a  shared  boundary  constraint;  for 
a  ZGC  surface  5,  with  line-convergent  symmetry,  let 
(P(<9t)  he  the  gradient  of  the  top  plane  and  let  (pi,9t) 
be  the  gradient  of  the  bottom  plane,  and  let  the  inter¬ 
section  line  have  direction  {lz,ly)  on  the  image  plane. 
Since  the  top  and  the  bottom  planes  actually  intersect 
each  other  dong  the  line  /  in  3-D  we  have  the  shared 
boundary  constraint  as; 

SHC/(pi ,  9i,  Pj,  96,  Ir ,  ly)  —  0  (8) 

Consider  figure  5;  let  (pr,9r)  be  the  local  surface  gra¬ 
dient  of  the  surface  5  along  ruling  r.  This  constraint 
enforces  that  the  3-D  lines  ti  and  U  be  on  the  same 
tangent  plane  having  gradient  (pr,9r).  The  constraint 
is; 

<1  X  r  =  tj  X  r  (9) 

5  Combined  Shape  Recovery 

Many  objects  of  interest  consists  of  several  curved  sur¬ 
faces.  Here  the  recovered  3-D  individual  surfaces  must 


be  in  agreement  with  the  neighboring  surfaces,  t.e.,  sur¬ 
faces  snaring  a  common  boundary.  We  describe  a  tech¬ 
nique  for  such  integrated  multiple  surface  recovery  for 
objects  consisting  of  planar  and  ZGC  surfaces.  Figures 
1  and  6  show  some  sample  objects. 

The  shape  of  all  the  surfaces  is  recovered  simultane¬ 
ously  by  finding  appropriate  values  for  the  parameters  of 
each  surf2ice.  The  values  of  the  surface  parameters  are 
computed  by  solving  the  following  constraint  minimiza¬ 
tion  problem; 

minEi  subject  to  £*  =  0  (10) 

where  Ei  stands  for  error  terms  resulting  from  internal 
constraints  of  each  surface  and  Ex  are  the  external  terms, 
that  is,  the  constraints  obtained  by  intersection  of  sur¬ 
faces. 

5.1  Internal  Constraints 

The  internal  constraints  are  the  constraints  obtained 
from  the  regularity  assumptions  of  each  surface.  In  gen¬ 
eral  they  have  the  following  form; 

Ei  =  '^WpEp  +  '^WoEo  +  ^WcEc  (11) 

Where  each  ws  are  weight  and  Ep  is  the  error  term  for 
the  orthogonality  constraint  of  the  planes,  Eo  is  the  error 
term  for  the  orthogonality  constraint  of  the  ZGC  surfaces 
and  Ee  is  the  error  term  for  the  implicit  constraint  of 
the  parameters  of  ZGC  surfaces.  These  error  terms  are 
described  in  more  detail  in  the  following. 

Ep  is  the  error  term  for  the  orthogonality  constraint  of 
planes.  If  a  planar  surface  has  a  skew  symmetry  then 
this  is  the  orthogonality  function  of  the  lines  and  axis  of 
skew  symmetry  as  given  in  3,  where  (ii.yi)  and  (12,^2) 
are  the  image  directions  of  the  lines  of  symmetry  and  the 
axis  of  symmetry  and  (p,  q)  is  the  gradient  of  the  plane. 
Wp  is  the  weight  of  Ep  and  is  proportional  to  the  total 
length  of  the  contour  enclosing  the  surface.  The  formula 
used  for  Wp  is  Wp  =  ^/^  where  Ic  is  the  total  length  of 
the  curve  enclosing  the  surface.  If  the  surface  does  not 
have  skew  symmetry  Ep  is  zero. 

Eo  is  the  error  term  for  the  orthogonality  of  ZGC  sur¬ 
faces.  As  in  our  previous  work  on  ZGC  surfaces  (UN90a], 
we  choose  to  msike  the  directions  of  the  rulings  orthogo¬ 
nal  to  the  tangents  of  the  ^rallel  or  chosen  line  conver¬ 
gent  symmetry  curve  in  3-D. 

^  ^ 0(pi , qi Xtip  —  Xi , ytip  —  yt)  (12) 

t 

here  O(-)  is  the  orthogonality  constraint  given  in  3  and 
for  i  €  [0,  IV  —  1],  at  »**  location;  (p,-,  9,)  is  the  local  sur¬ 
face  normal  represented  in  gradient  space  (x|.,y|)  is  the 
tangent  of  the  line  convergent  symmetry  curve  that  is 
chosen  to  be  orthogonal  to  rulings,  (x,,  yi)  is  the  location 
of  the  point  that  the  ruling  meets  with  the  line  conver¬ 
gent  symmetry  curve,  and  {xtip,ytip)  is  the  location  of 
the  apex  of  the  cone.  These  are  donated  as  (p,  q),  (x',  j/), 
(x,y),  {xup,ytip)  respectively  in  figure  4.  However  com¬ 
puting  E  requires  computation  of  local  normals  at  each 
point  on  the  surface  at  each  iteration  of  minimization, 
which  is  very  time  consuming.  Also,  since  it  is  a  huge  non 
linear  expr^ion,  it  creates  manjr  local  minima  around 
the  true  minima,  which  the  minimization  routine  gets 
stuck.  Therefore  from  our  experiments  we  use  an  ap¬ 
proximation  of  Eg,  Eo,  such  that  Eg  is  quadratic  and 


error  given  by  Eg  is  very  similar  to  the  one  given  by  Eg. 
Eg  has  the  following  form; 

Eg  =  Eax  +  Et  (13) 

where  Egx  =  cos^(a),  and  a  is  the  angle  between  the  gra¬ 
dient  (pt,  9t)  of  the  plane  containing  the  parallel  (or  line- 
convergent)  symmetry  curve  that  is  decided  to  be  made 
orthogonal  and  the  direction  of  the  image  axis  (oj,,  Oy)  of 
figure  4.  Motivation  for  Egx  is  based  on  the  observation 
that  minimum  of  Eg  in  general  occurs  when  (p,,  qt)  of  fig¬ 
ure  4  is  along  a  line  in  p—q  space  that  passes  through  the 
origin  and  is  parallel  to  the  image  axis.  Et  =  (uin«  —  n)^ 
where  u  is  the  u-parameter  of  the  ZGC  surface  and  u^nt 
is  set  at  the  initialization  by  minimizing  orthogonality 
error  Eg  given  in  equation  12.  In  our  implementation 
we  tried  both  Eg  and  Eg.  It  turns  out  that  Eg  performs 
better  (in  the  sense  of  stability)  because  it  is  a  simpler 
error  function  and  creates  fever  local  minima.  On  can 
use  Eg  as  an  initializer  for  Eg,  but  in  our  experiments 
another  run  of  minimization  with  Eg  over  Eg  was  not 
necessary.  Wg  is  the  weight  of  the  orthogonality  term 
and  is  proportional  to  the  total  length  of  the  perimeter 
of  the  surface,  Wg  =  y/ll  where  Ig  is  the  total  length  of 
the  contour  enclosing  the  surface. 

Eg  is  the  error  term  for  implicit  constraints  of  the  pa¬ 
rameters  of  ZGC  surfaces.  Let  {pt,qt)  and  (p*,?^)  be 
the  gradients  of  the  planes  containing  the  two  paral¬ 
lel  (or  line-convergent)  symmetry  curves  of  the  ZGC 
surface.  If  the  ZGC  surface  has  a  parallel  symme¬ 
try  then  (p(,9t)  should  be  equal  to  (p»,g6),  therefore. 
Eg  =  Eq{pt,qt,Pb,qb),  where  Eq{)  is  given  in  equation  4. 
If  the  ZGC  surface  has  an  line-convergent  symmetry  then 
Eg  is  the  addition  of  the  constraints  given  in  equations  8 
and  9.  Wg  is  the  weight  and  is  inversely  proportional  to 
the  eccentricity  of  the  parallel  (or  line-convergent)  sym¬ 
metry  curves.  If  the  parallel  (or  line-convergent)  sym¬ 
metry  curves  are  highly  eccentric,  i.e.,  they  are  almost 
streught,  then  the  weignt  of  this  constraint  is  low.  The 
formula  for  Wg  =  1/ccc,  were  ecc  is  the  eccentricity  of 
the  total  cross  section  curve  (the  eccentricity  of  a  curve 
is  computed  by  using  the  scattering  matrix  of  the  curve). 

5.2  External  Constraints 

External  constraints  are  the  inter-surface  restrictions  im¬ 
posed  by  each  surface  on  neighboring  surfaces.  Extremal 
constraints  have  the  following  form: 

Ex  =  ^wEpp +  ^wEpi +^wEgt  (14) 

where  w  is  the  weight  of  each  constraint  and  is  equal 
to  where  /«  is  the  length  of  the  curve  produced 
by  intersection  of  the  surfaces.  Epp,  Epx  and  E,i  are 
the  error  terms  for  shared  boundary  constraint  between 
planes,  between  planes  and  ZGC  surfaces,  and  between 
ZGC  surfaces  respectively.  In  detail,  the  individual  error 
terms  are: 

Epp  is  the  error  of  the  shared  boundary  constraint  be¬ 
tween  the  gradients  of  the  two  intersecting  planes  as 
given  in  equation  2. 

Ep,  is  the  error  term  for  shared  boundary  constraint 
between  a  plane  and  a  ZGC  surface.  There  are  two  pos¬ 
sibilities;  tne  intersection  is  along  a  ruling  of  the  ZGC 
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surface  or  the  intersection  is  along  a  parallel  (or  line- 
convergent)  symmetry  of  the  ZGC  surface.  If  the  inter¬ 
section  is  along  the  ruling  of  the  ZGC  surface  then  Ept 
is  the  shared  boundary  constraint  as  given  in  eq^uation  2 
between  the  gradient  of  the  plane  and  the  local  surface 
normal  of  the  ZGC  surface  at  the  ruling  of  intersection. 
If  the  intersection  is  along  one  of  the  parallel  (or  line- 
convergent)  symmetry  curves  then; 

Epz  =  Eq{p,q,pt,qt)  (15) 

where  {pt,qt)  is  the  parameters  of  the  ZGC  surfaces 
which  is  the  gradient  of  the  plane  containing  the  intersec¬ 
tion  curve  and  (p,  q)  is  the  gradient  of  the  planar  surface. 

Ezi  is  the  error  term  for  the  shared  boundary  con¬ 
straint  between  two  ZGC  surfaces.  There  are  various 
ways  two  ZGC  surfaces  may  intersect  each  other.  Here 
we  only  handle  the  intersections  that  produce  a  planar 
intersection  curve.  There  are  two  traes  of  such  inter¬ 
sections;  along  the  rulings  of  the  ZGC  surfaces  or  along 
the  parallel  (or  line-convergent)  symmetry  of  the  ZGC 
surfaces.  If  the  intersection  is  along  the  rulings  of  the 
ZGC  surfaces  then  Shared  Boundary  Constraint  given 
in  equation  2  is  applied  between  the  local  surface  nor¬ 
mals  of  ZGC  surfaces  at  the  ruling  of  intersection.  If 
the  intersection  is  along  the  parallel  (or  line-convergent) 
symmetry  curves,  then,  let  (pi,9i)  and  (pa,?’)  he  the 
gradients  of  the  planes  containing  the  intersection  curve 
in  the  r^resentations  of  the  first  and  the  second  inter¬ 
secting  ZIGC  surfaces.  The  error  term  is; 

Ezz  =  Eq(pi,qi,p2,q2)  (16) 

When  two  ZGC  surfaces  intersect  each  other  along 
their  parallel  symmetry  (or  line-convergent  symmetry) 
curves,  how  orthogonal  both  surfaces  can  be  made  de¬ 
pends  on  how  parallel  their  image  axes  are.  Therefore  we 
form  a  new  orthogonality  error  term  Em  for  the  inter¬ 
secting  ZGC  surfaces  to  replace  their  original  orthogo¬ 
nality  error  terms  (£'o’s).  Let  a  be  the  angle  between  the 
image  axis  of  these  surfaces,  let  £<,i  and  Eo2  be  the  error 
terms  for  the  orthogonality  of  the  intersecting  ZGC  sur¬ 
faces.  Then  the  new  combined  orthogonality  error  term 
is; 

Eon  =  cos^(o)(£'oi  -H  E02)  +  sin*(o)(£'<,i£„2)  (17) 

Em  emphasizes  the  orthogonality  of  both  of  the  ZGC 
surfaces  when  the  image  axis  are  almost  parallel  to  each 
other,  and  it  emphasizes  the  orthogonality  of  either  of 
the  ZGC  surf2w:es  when  the  image  axes  are  almost  or¬ 
thogonal  to  each  other. 

5.3  Solving  Constraint  Equations 

The  total  error  function  E  is  solved  using  a  constraint 
minimization  technique,  where  Eg  consists  of  “must- 
satisfy”  external  constraints  and  Ei  consists  of  assump¬ 
tion  driven  error  terms  as  defined  earlier.  To  solve  this 
constraint  minimization  (where  the  constraints  are  non¬ 
linear),  the  problem  is  converted  into  a  minimization 
form  as  follows; 

lim  minE  =  lim  min{Ei  +  XEg)  (18) 

A— ►inf  A— ‘inf 

That  is,  E  is  minimized  for  successively  larger  values 
of  A,  thus,  emphasizing  E^.  more  at  each  minimization 
cycle.  At  the  end  Eg  constraints  are  satisfied  almost 
exactly  and  Ei’s  are  minimized  to  the  extent  possible. 
In  our  implementation  we  increased  A  from  1  to  100  in 
exponential  steps  of  3.5  (that  is  A  =1,  3.5, 12.25, ...  etc.). 


Figure  7;  The  segmented  surfaces,  and  the  symmetries 
computed  for  each  surface.  The  skew  symmetry  of  planar 
surfaces  are  shown  by  crosses,  the  long  line  is  the  axis  of 
symmetry  and  the  short  one  is  the  direction  of  the  lines 
of  symmetries.  Parallel  and  line-convergent  symmetries 
are  shown  by  their  curved  axis  only. 

For  the  minimization,  a  gradient  descent  algorithm  is 
used.  The  set  of  parameters  of  the  surfaces  ((p,  9)’s  and 
u’s)  minimizing  E  is  taken  as  the  solution  set  and  used 
to  reconstruct  the  local  surface  gradients. 

Initial  values  of  the  parameters  of  E,  i.e.,  the  param¬ 
eters  of  all  the  surfaces  involved  in  E,  are  computed  by 
an  initializer.  The  initializer  starts  with  an  arbitrary 
ZGC  surface,  and  sets  its  parameters  as  if  it  is  an  iso¬ 
lated  surface.  Then,  the  initializer  sets  the  parameters 
of  the  neighboring  surfaces  by  keeping  them  consistent 
with  the  nrst  surface,  and,  the  neighbors  of  these  sur¬ 
faces  are  processed  progressively  until  the  parameters  of 
all  the  surfaces  are  initialized. 

6  Implementation  and  Results 

For  the  results  shown  in  this  section  the  following  im¬ 
plementation  is  used.  The  input  to  the  program  are  seg¬ 
mented  curves  represented  as  a  list  of  points  that  define 
the  contour  of  each  object.  However  we  do  not  assume 
that  the  input  curves  are  noise  free.  These  segmented 
curves  are  grouped  into  closed  regions  using  continuity. 
Each  closed  region  is  taken  to  correspond  to  an  object 
surface.  Next,  we  find  symmetries  among  segments  of  a 
surface.  The  details  of  symmetry  finding  is  presented  in 
[Ulu9l]. 

The  surfaces  containing  parallel  or  line  convergent 
symmetric  segment  pair  are  treated  as  curved  and  oth¬ 
ers  are  treated  as  planar.  For  curved  surfaces  the  curves 
joining  parallel  symmetric  curves  are  checked  if  they  are 
straight  to  confirm  that  the  surface  is  a  ZGC. 

Some  surfaces  are  coinbination  of  various  curved  sur¬ 
faces  and  there  is  no  distinctive  boundary  between  them. 
This  is  the  case  for  the  curved  surfaces  of  the  object 
in  figure  6.  Such  surfaces  contain  more  than  one  par¬ 
allel  (or  line-convergent)  symmetry  and  they  are  seg¬ 
mented  into  smaller  surfaces  containing  only  one  parallel 
(or  line-convergent)  symmetry.  Figure  7  shows  the  seg¬ 
mented  surfaces  and  the  symmetries  (skew,  parallel  or 
line-convergent)  for  each  surface. 

The  constraints  for  each  surface  and  inter-surface  con¬ 
straints  including  the  ones  for  the  newly  formed  intersec¬ 
tions  are  extracted  forming  the  error  function  E.  Then 
E  is  minimized  by  the  constraint  minimization  technique 
discussed  in  section  5.3. 

Figure  8  shows  the  final  results  for  the  objects  in  fig¬ 
ures  1  (a)  and  6.  This  process  takes  approximately  2 
minutes  for  an  object  on  a  Symbolics  3645  computer  run¬ 
ning  in  LISP.  The  computed  surface  normals  are  shown 
by  needle  diagrams,  as  needles  sticking  to  the  surface 
in  the  direction  of  the  local  surface  normals.  For  pla¬ 
nar  surfaces,  a  small  coordinate  frame,  with  a  triangle 


Figure  8:  The  needle  and  the  shaded  images  obtained 
from  the  computed  surface  normals  for  the  objects  in  6. 


at  the  base,  is  used  to  better  show  the  computed  sur¬ 
face  normal.  We  also  provide  the  shaded  images  of  the 
objects  computed  by  using  the  surface  normals,  a  Lam¬ 
bertian  reflectance  model  and  a  point  light  source.  Note 
that,  for  the  bottom  object  in  figure  8,  the  middle  sur¬ 
face  is  initially  classified  as  a  curved  surface  due  to  the 
parallel  symmeti^  it  has.  However,  its  perceived  as  a 
planar  surface  effected  by  the  planarity  of  the  top  sur¬ 
face  as  discussed  in  the  Introduction.  The  final  result 
of  the  minimization,  in  fact,  shows  the  middle  surface 
as  planar  within  error  bounds  of  the  minimization.  We 
believe  that  the  computational  results  are  in  agreement 
with  human  perception,  though  we  have  not  attempted 
a  quantitative  comparison. 

7  Conclusion 

We  have  described  a  technique  for  recovering  3-D  shape 
of  objects  consisting  of  zero-Gaussian  curvature  (and 
planar)  surfaces.  Our  method  incorporates  the  con¬ 
straints  imposed  by  all  the  surfaces  simultaneously.  As 
shown  by  an  example,  this  can  result  in  fixing  the  shapes 
of  some  surfaces  wnich  are  ambiguous  otherwise. 

We  have  attempted  to  show  tne  accuracy  of  our  re¬ 
sults  by  comparison  with  human  perception.  For  shape 
from  contour  analysis,  the  only  ground  truth  is  really  in 
human  perception,  for  even  if  the  given  contour  was  ob¬ 
tained  by  a  real  object,  it  could  have  been  also  obtained 
by  any  number  of  other  objects  as  well.  Our  system  has 
a  certain  notion  of  preference  that  is  based  partly  on  geo¬ 
metrical  analysis  and  partly  on  perceptual  observations. 

The  current  system  assumes  that  clean  and  complete 
boundaries  are  given.  This  can  not  be  expected  to  be  the 
case  for  real  objects  in  complex  environments.  To  cope 
with  these  difficulties  will  require  incorporation  of  some 
perceptual  organization  techniques.  We  believe  that  our 
methodology  will  help  in  this  step  too,  as  we  are  able  to 
provide  strong  constraints  that  hypotheses  of  a  percep¬ 
tual  organization  system  can  be  tested  against.  This  is 
the  topic  of  separate,  current  research  in  our  laboratory. 
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Abstract 


Line  drawings  provide  an  effective  means  of  commu¬ 
nication  about  the  geometry  of  3-D  objects.  An  ui 
derstanding  of  how  to  duplicate  the  way  humans  inte. 
pret  line  drawings  is  extremely  important  in  enabling 
man-machine  communication  with  respect  to  images,  di¬ 
agrams,  and  spatial  constructs.  In  particular,  such  an 
understanding  could  be  used  to  provide  the  human  with 
the  capability  to  create  a  line-drawing  sketch  of  a  polyhe¬ 
dral  object  which  the  machine  can  automatically  convert 
into  the  intended  3-D  model. 

A  recently  published  paper  (Marill  1991)  presented  a 
simple  optimization  procedure  supposedly  able  to  du¬ 
plicate  human  judgement  in  recovering  the  3-D  “wire 
frame”  geometry  of  objects  depicted  in  line  drawings. 
Marill  provided  some  impressive  examples,  but  no  theo¬ 
retical  justification  for  his  approach.  In  this  paper  we  in¬ 
troduce  our  own  work  by  first  critically  examining  Mar- 
ill’s  algorithm.  We  provide  an  explanation  for  why  Mar- 
ill’s  algorithm  was  able  to  perform  as  well  as  it  did  on  the 
examples  he  presented,  discuss  its  weaknesses,  and  show 
very  simple  examples  where  it  fails.  We  then  provide  an 
algorithm  that  improves  on  Marill’s  results.  In  partic¬ 
ular,  we  show  that  an  effective  objective  function  must 
favor  both  symmetry  and  planarity — Marill  deals  only 
with  the  symmetry  issue.  By  modifying  Marill’s  objec¬ 
tive  function  to  explicitly  favor  planar-faceted  solutions, 
and  by  using  a  more  competent  optimization  technique, 
we  were  able  to  demonstrate  significantly  improved  per¬ 
formance  in  all  of  the  examples  Marill  provided  and  those 
additional  ones  we  constructed  ourselves.  Finally,  we  e.x- 
amine  some  questions  relevant  to  the  implications  of  this 
work  for  understanding  the  human  ability  to  interpret 
line  drawings. 


*  The  work  reported  here  was  partially  supported  by  the  Defense 
Advanced  Research  Projects  Agency.  We  gratefully  acknowledge 
the  valuable  discussions  with  Aaron  Bobick  and  Tom  Strat  regard¬ 
ing  both  the  content  and  organization  of  this  paper. 


1  Introduction 

The  interpretation  of  line  drawings  has  been  an  impor¬ 
tant  focus  for  research  in  machine  vision  since  the  field’s 
in«-  Uon.  There  seems  to  be  little  question  that  hu¬ 
man  subjects  can  easily  recover  3-D  models  from  the 
2-D  line  drawings  depicting  many  classes  of  objects.  One 
such  class  of  special  interest  has  been  called  the  “blocks 
world.”  This  class  consists  primarily  of  polyhedral  solids 
in  three  dimensional  Euclidean  space  and  the  projections 
of  the  visible  edges  of  the.se  objects  onto  a  2-D  plane 
(which  we  call  the  line  drawing).  Given  a  single  line 
drawing  of  a  blocks  world  scene,  normal  human  subjects 
will  usually  arrive  at  the  same  3-D  interpretation,  even 
though  there  may  be  a  very  large  number  of  possible  3-D 
objects  that  could  have  produced  the  given  drawing. 

Beginning  with  the  the  work  of  Guzman  in  1968,  there 
has  been  a  concerted  effort  by  vision  researchers  to  de¬ 
velop  an  algorithmic  procedure  that  could  duplicate  hu¬ 
man  performance  in  interpreting  line  drawings,  at  least 
with  respect  to  blocks  world  objects.  A  significant  body 
of  work  in  this  area  weis  produced  by  such  prominent  sci¬ 
entists  as  Clowes  (1971),  Huffman  (1971),  Waltz  (1972), 
Mackworth  (1973),  Kanade  (1980),  Draper  (1981),  and 
Sugihara  (1982,  1984).  However,  the  problem  as  orig¬ 
inally  formulated,  devising  a  procedure  for  recovering 
p.sychologically  plausible  3-D  models  from  line  drawings, 
remains  unsolved. 

The  earliest  work  by  Guzman  was  heuristic  in  nature, 
failed  in  many  ceises  where  humans  had  no  trouble  in 
finding  appropriate  interpretations,  and  did  not  actually 
return  a  3-D  model,  but  rather  partitioned  the  scene  into 
separate  polyhedral  objects.  Clowes,  Huffman,  Waltz, 
Mackworth,  and  Kanade  formalized  and  extended  the 
work  of  Guzman,  but  did  not  solve  the  original  problem. 
They  were  (usually)  able  to  label  the  edges  of  the  line 
drawing  to  correctly  reflect  a  consistent  3-D  interpreta¬ 
tion  if  one  existed,  or  could  assert  that  the  drawing  did 
not  correspond  to  a  realizable  blocks  world  scene.  Mack- 
worth  and  Kanade  explicitly  exploited  the  planarity  of 
the  faces  of  blocks  world  and  “Origami”  objects  (by  em¬ 
ploying  a  “gradient  space”  representation)  to  accomplish 
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a  form  of  semi-quantitative  recovery.  In  addition  to  con¬ 
sistent  edge  labeling,  they  could  also  constrain  the  rela¬ 
tive  orientation  of  the  faces  of  the  target  3-D  model.  'I'he 
labels  could  describe  the  edges  as  being  conve.K,  concave, 
occluding  edges,  etc.,  but  still,  for  the  general  case,  no 
explicit  3-D  model  was  returned  (without  introducing 
additional  constraints)  and  the  algorithms  would  make 
occasional  errors. - 

In  a  series  of  papers,  Sugihara  reformnlated  the  realiz¬ 
ability  and  recovery  problems  for  line  drawings  of  poly- 
hedra  (both  with  and  without  hidden  line  removed)  in 
purely  algebraic  terms.  He  required  as  input  a  speci¬ 
fication  of  the  vertices  defining  each  of  tlie  individual 
planar  faces  of  the  polyhedra,  and  also  required  that  the 
implied  line  drawing  be  a  general-position  projection  of 
the  polyhedra.  With  this  approach  he  succeeded  in  pro¬ 
viding  an  algebraic  criterion  as  a  necessary  and  sufficient 
condition  for  a  line  drawing  to  represent  a  pliy.sically  re¬ 
alizable  polyhedral  object.  He  could  al.so  constrain  the 
space  of  feasible  solutions,  and  obtain  a  unique  solution 
if  enough  additional  constraints  were  provided.  These 
additional  constraints  were  obtained  from  information 
beyond  that  provided  by  the  line  drawing  (e  g.,  shading 
or  texture  information).  Sugihara’s  work  was  an  impor¬ 
tant  advance,  but  again,  it  fell  short  of  the  original  goal. 
It  will  rarely  be  the  case  that  a  unique  reconstruction  is 
implied  by  the  line  drawing,  and  thus,  the  |)rimary  ob¬ 
jective  of  duplicating  human  performance  in  this  regard 
is  not  met.® 

Our  motivation  for  writing  this  paper  was  supplied, 
in  part,  by  a  recent  publication  authored  by  T.  Mar- 
ill  (1991).  He  refocused  on  the  original  problem  of 
human  interpretation  of  single  line  drawings  as  three- 
dimensional  structures;  he  did  not  restrict  his  universe 
to  blocks  world  objects  nor  did  he  demand  that  the  line 
drawings  be  complete.  The  surprising  thing  about  his 
work  was  that  he  used  an  optimization  approach  involv¬ 
ing  (seemingly)  an  almost  trivial  objective  function,  and 
the  simplest  possible  gradient  descent  algorithm  to  find 
a  solution,  and  yet,  provided  examples  of  reconstructed 
objects  that  were,  intuitively,  extremely  good.  (Figure  1 , 
Examples  A  through  I,  shows  the  line  drawings  used  in 
Marill’s  experiments.)  However,  his  paper  provided  no 
justification  for  why  the  algorithm  should  work,  and  thus 

^Gradient  space,  originally  conceived  of  by  .Janies  Clerk 
Maxwell  in  1864  (see  (Whitely  1986))  and  rediscovered  by  D.  A. 
Huffman,  provides  only  necessary  conditions  for  planar  realizabil¬ 
ity  of  general  polyhedral  objects  with  hidden  lines  removed,  mid 
thus  consistent  edge  labeling  is  possible  for  impossible  blocks  world 
and  Origami  objects.  Further,  the  labeling/recovery  algorithms 
were  not  always  competent  to  find  an  existing  solution. 

^There  were  some  other  problems  of  lesser  significance  for  our 
purposes  in  this  paper.  For  example,  the  algebraic  formulation  was 
sensitive  to  computation  round-off  errors,  and  digitization  errors 
in  specifying  the  line  drawing;  a  realizable  object  could  be  rejected 
because  of  such  minor  numeric  inaccuracies.  Sugihara  dealt  with 
this  problem  by  adding  an  optimization  step  to  his  algorithm  which 
could  find  a  fettsible  reconstruction  if  the  input  drawing  was  an 
almost  correct  specification. 


no  ha.sis  for  judging  its  generality  or  insight  into  bow  it 
could  be  improved  (should  this  be  desirable). 

In  tliis  paper  we  introduce  onr  own  work  by  first  crit¬ 
ically  examining  Marill’s  algorithm.  VVe  provide  an  ex¬ 
planation  for  why  Marill’s  algorithm  was  able  to  |)erform 
as  well  as  it  did  on  the  examples  he  presented,  discuss 
its  weak  nesses,  anti  show  very  siinjile  exam[>l<'s  where  it 
fails  (Figure  1,  Examples  .J  through  N).  VVe  then  ju-o- 
vide  an  algorithm  tliat  improves  on  Marill's  n'sults  for 
all  nine  of  his  examples,  and  also  successfully  deals  with 
the  simple  cases  where  Marill  fails.  Finally,  we  exam¬ 
ine  some  questions  relevant  to  the  implications  of  this 
work  for  understanding  the  human  ability  to  interpret 
line  drawings. 

We  see  the  work  described  in  this  paper  as  being  of 
both  theoretical  and  practical  interest.  The  practical 
utility  of  this  work  is  its  relevance  to  man-machine  com¬ 
munication  about  3-D  structures  via  line  drawings  in 
particular,  iirovidiug  the  human  with  the  capability  to 
create  a  line  drawing  sketch  of  a  polyhedral  object  which 
the  machine  can  automatically  convert  into  the  intended 
3-D  model.  Deficiencies  in  providing  a  complete  theory 
are  not  fatal,  since  auxiliary  information  can  always  be 
supplied  interactively  to  re.solve  ambiguities,  but  the  un¬ 
derlying  theory  should  reduce  this  “side  communication" 
to  a  niiitimum. 

2  Marill’s  Algorithm 

Marill’s  algorithm  consists  of  two  components,  an  objec¬ 
tive  function  and  a  gradient  descent  optimization  proce¬ 
dure  for  finding  a  local  minimum  of  this  objective  func¬ 
tion.  The  objective  function  is  simply  the  standard  de¬ 
viation  of  all  of  the  angles  (SDA)  in  the  recovered  3-D 
object  with  respect  to  their  common  mean.  Marill  calls 
the  minimization  of  the  SDA  the  MSDA  principle. 

The  input  line  drawing  is  specified  as  a  set  of  points 
(vertices)  and  lines;  each  point  is  represented  by  an  (x,  y) 
coordinate  pair,  and  each  line  is  represented  by  an  inte¬ 
ger  pair  corresponding  to  the  sequence  numbers  of  the 
two  points  it  joins.  The  representation  of  the  recovered 
3-D  object  involves  supplying  a  third  (z)  coordinate  for 
each  of  tlie  originally  specified  points.  This  is  called  the 
orthographic  extension  of  the  line  drawing,  and  is  actu¬ 
ally  a  wire  frame  rather  than  a  solid  object. 

To  evaluate  the  objective  function  for  a  given  proposed 
solution,  every  pair  of  lines  terminating  on  a  point  (as 
defined  in  the  input  specification)  is  considered  to  form  a 
separate  angle.  Thus,  if  five  lines  terminate  on  the  same 
point,  every  potential  3-D  solution  contains  ten  angles 
at  this  point  which  contribute  to  the  objective  function. 
Note  that  the  intersection,  between  two  lines  that  hap¬ 
pen  to  cross  at  intermediate  points  of  their  extent  in  the 
line  drawing,  is  not  treated  as  a  vertex,  and  does  not 
contribute  to  the  objective  function  (even  if  the  lines  lie 
in  the  same  plane  in  the  3-D  reconstruction).  Siuiilarly, 
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two  distinct  vertices  can  have  tlie  same  (a:,  y)  coortliiiates 
in  the  line  drawing,  but  then  the  line  segments  terminat¬ 
ing  on  the  distinct  vertices  do  not  interact  to  form  angles 
(even  if  the  vertices  coincide  in  the  3-D  reconstruction). 

Thus,  given  a  line  drawing  with  n  vertices,  each  po.s- 
sible  orthographic  extension  is  represented  as  a  r-vector 
having  ii  components;  the  corresponding  angles  and  SDA 
are  computed  to  evaluate  the  proposed  solution.  Marill 
uses  a  gradient-descent  t;  -hniqne  to  .search  for  a  best  an¬ 
swer,  recognizing  that  this  is  simply  a  heuristic  and  that 
this  approach  will  only  find  a  single  local  minimum  of 
his  objective  function.  The  injuit  object  has  ail  of  its  2 
values  initially  set  to  zero;  i.e.,  it  is  a  flat  object  lying  in 
the  {x,  y)  plane.  At  each  stage  in  the  search,  the  SDA  of 
the  current  2-vector  is  computed  and  the  program  then 
looks  at  the  children  of  the  current  vector.  These  2n 
children  are  all  of  the  vectors  one  step  size  away  from 
the  current  vector,  and  are  formed  by  both  adding  and 
subtracting  a  specified  value  (A2)  to  each  of  the  »  com¬ 
ponents  in  the  current  r-voctor.  The  value  of  the  SDA 
is  computed  for  each  of  these  2n  children,  and  the  child 
with  the  minimum  SDA  is  selected  as  the  new  current 
vector.  This  process  is  repeated  until  no  improvement 
in  the  SDA  is  obtained,  and  the  resulting  2- vector  is  re¬ 
turned  as  the  solution  for  the  first  of  three  rounds  of 
gradient  descent.  Each  additional  round  uses  a  .smaller 
Az  and  begins  with  the  result  of  the  preceding  round. 
Marill  experimentally  found  effective  values  of  Az  for  his 
three  rounds  to  be  1,  0.5,  and  0.1. 

Figure  2  shows  a  line  drawing,  it’s  internal  represen¬ 
tation  as  described  above,  and  the  reconstructions  using 
Marill’s  algorithm  and  the  algorithm  we  describe  in  Sec¬ 
tion  3. 

In  the  top  left  window  of  the  figure  is  the  input  line 
drawing  (with  the  vertices  numbered  for  reference  by  the 
written  representations  below).  The  four  windows  on  the 
top  right  show  two  views  of  Marill’s  reconstruction  and 
two  views  of  our  reconstruction.  In  the  middle  of  the 
figure  is  a  table  showing  the  internal  representation  of 
the  input  line  drawing.  In  the  first  row  are  the  (x,y) 
coordinates  of  the  vertices,  in  the  order  shown  on  the 
line  drawing.'*  In  the  .s^’co/id  row  are  the  integer  pairs 
representing  the  lines  in  the  drawing.  In  the  third  row 
are  the  sequences  of  vertices  corresponding  to  the  pla¬ 
nar  faces  derived  according  to  the  rules  of  Appendix  A 
(see  Section  3).  The  reconstructions  will  be  discussed  in 
detail  in  Section  3.1. 

2.1  Marill’s  Examples 

Marill  described  the  application  of  his  algorithm  to  Ex¬ 
amples  A  through  I  of  Figure  1.  We  categorize  these 
examples  along  the  following  dimensions  (bcisecl  on  the 

^For  simplicity,  the  vertices  are  written  using  only  two  digits 
of  precision  in  the  table.  However,  we  used  the  full  32-bit  preci¬ 
sion  of  the  projection  in  the  intemal  representation  userl  by  the 
algorithms. 


appearance  of  the  input  drawing  and  on  the  characl.er- 
istics  of  the  recovered  if-D  object  ): 

a)  •  Three  Dimensional  [A  H  I)  E  F  (i  II  I] 

•  Flat  [C] 

h)  •  Blocks  World  (planar  faced  solids  with  occluded 
edges  not  rendered)  [B  II  I] 

•  Origami  (planar  faced,  possibly  hollow)  [C  F] 

•  Wire  Frame  of  Blocks  World  Olqcct  (all  edges  of 
a  blocks  world  object  are  given,  and  additional  lines 
between  vertices  of  a  planar  face  may  be  added)  [A 
DG] 

•  Restricted  Wire  Frame  (I'very  closed  circuit  of 
lines,  without  interior  lines  in  tin-  given  input  rep¬ 
resentation,  corresponds  to  a  planar  face)  [E] 

•  Non-Planar  Wire  Frame  (none  of  the  al>ove) 

c)  •  Symmetric  [A  B  (.'  E  G  II] 

•  Asymmetric  [D  F  I] 

d)  •  All  Angles  (approximately)  Equal  [A  B  E  F  H] 

•  A  Few  Distinct  but  Mostlv  Rei>eated  Angles  [(’  G 

1] 

•  Mostly  Unequal  Angles  [D] 

For  the  purposes  of  our  discussion,  we  use  MarilFs 
categorization  and  augment  it  with  our  own  subjective 
evaluation  where  we  di.sagree  or  need  to  add  additional 
attributes  to  tho.se  Marill  provides.  It,  is  important  to 
remember  that  Marill  always  returns  a  wire  frame  as  his 
solution,  regardless  of  the  categorization  of  the  object. 
Thus,  we  would  call  the  wire  frame  of  a  blocks  world 
object  a  correct  solution  if  it  was  a  geometrically  cor¬ 
rect  representation  of  the  3-D  geomet  ry  of  the  edges  of 
the  psychologically  plausible  blocks  world  object  whose 
orthographic  projection  corresponded  to  the  input  line 
drawing,  even  though  the  wire  frame  does  not  provide 
an  explicit  repre.sentation  of  the  grouping  of  lines  into 
faces,  etc. 

Examples  A,  B,  E,  F,  and  H  can  all  be  visualized 
as  approximately  equiangular  thr<'e  dimensional  objects. 
That  is,  each  of  the  objects  has  an  equiangular  .3-D  wire 
frame  as  a  psychologically  plausible  solut  ion.  Since  these 
equiangular  solutions  exactly  satisfy  Marili’s  minimum 
standard  deviation  of  angles  (MSDA)  criterion,  it  is  ob¬ 
vious  why  Marill’s  objective  function  should  prefer  what 
we  accept  as  the  correct  solutions  in  these  cases.  In  the 
other  four  cases,  suppo.sedly  representative  examples  of 
the  ability  of  Marill’s  algorithm  to  deal  with  complicated 
structures  having  unequal  angles.  rea,sonably  correct  so¬ 
lutions  are  also  recovered,  and  it  is  this  performance  we 
wish  to  understand. 

2.2  The  Performance  of  the  MSDA 
Principle 

Given  its  overall  simplicity,  it  would  be  quite  remarkable 
if  the  MSDA  principle  generally  converged  to  a  psycho- 


637 


logically  correct  reconstruction.  (A  psychologically  cor¬ 
rect  reconstruction  of  a  line  drawing  is  the  one  that  vir¬ 
tually  all  people  make.)  Unfortunately,  it  is  rather  easy 
to  find  examples  where  this  is  not  the  case,  contrary  to 
Marill’s  implied  competence  for  the  principle. 

Examples  .]  through  N  of  Figure  1  are  line  drawings 
for  which  Marill’s  algorithm  converged  to  solutions  that 
are  clearly  psychologically  incorrect,  even  though  these 
drawings  are  not  significantly  more  complicated  or  more 
asymmetric  than  the  examples  that  Marill  used  (Figures 
2,  3,  4,  5,  and  6  illustrate  both  Marill’s  reconstructions 
and  our  teconstructions,  as  described  in  Section  3.)  In 
Examples  K,  and  M  it  would  appear  that  the  fault 
could  lie  with  Marill’s  overly  simple  gradient  descent  al¬ 
gorithm  because  the  SDA  of  the  psychologically  correct 
answer  is  lower  than  the  SDA  for  the  solution  Marill  ac¬ 
tually  obtains.  Thus,  one  can  argue  that  a  more  compe¬ 
tent  global  search  strategy  could  have  found  the  psycho¬ 
logically  correct  answer  using  the  same  objective  func¬ 
tion.  However,  Examples  L  and  N  are  line  drawings  for 
which  the  SDA  of  Marill’s  solution  is  significantly  lower 
than  that  of  the  psychologically  correct  solution.  Thus, 
the  MSDA  principle  is  clearly  not  adequate  to  reliably 
handle  even  simple  line  drawings. 

Before  discussing  ways  of  augmenting  the  MSDA  prin¬ 
ciple  to  obtain  a  more  competent  algorithm,  we  attempt 
to  explain  the  performance  of  MSDA  for  line  drawings 
depicting  objects  that  are  not  equiangular. 

2.3  Evaluating  the  Performance  of  the 
MSDA  Principle 

It  is  not  immediately  obvious  why  the  MSDA  princi¬ 
ple  should  prefer  a  psychologically  plausible  answer  if 
the  object  depicted  in  the  line  drawing  contains  two  or 
more  significantly  different  angles  (e.g.,  C,  D,  G,  I,  and 
J).  Marill  offers  no  explanation  for  this  phenomena,  and 
thus  no  way  to  judge  the  conditions  under  which  his  al¬ 
gorithm  should  be  expected  to  succeed  or  fail.  In  this 
section  we  provide  a  partial  explanation  for  cases  (such 
as  C,  G,  J,  K,  and  L)  that  have  critically  important 
attributes — the  psychologically  correct  reconstruction  is 
a  3-D  planar-faced  object  whose  faces  are  either  equian¬ 
gular  or  form  “complete-star”  configurations  (see  Ap¬ 
pendix  B). 

To  establish  the  role  played  hy  the  above  geometric 
attributes,  we  define  the  planar  orthographic  extension 
of  a  simple  closed  2-D  circuit  in  a  line  drawing  to  be  any 
orthographic  extension  for  which  the  corresponding  3-D 
contour  is  planar.  If  a  line  drawing  contains  more  than 
one  simple  closed  2-D  circuit,  then  a  planar  orthographic 
extension  of  the  entire  line  drawing  exists  if  we  can  cover 
the  line  drawing  with  a  set  of  simple  closed  2-D  circuits 
such  that  (a)  every  angle  in  the  drawing  is  included  in  at 
least  one  circuit,  and  (b)  each  circuit  projects  to  a  3-D 


planar  contour.® 

In  Appendices  B,  C,  and  D,  we  provide  a  number  of 
theorems  that  are  pertinent  to  understanding  the  effec¬ 
tiveness  of  the  MSDA  principle  applied  to  planar  ortho¬ 
graphic  extensions.  The  main  theorem.  Appendix  D,  as¬ 
serts  that  solutions  with  certain  symmetries  corre.spond 
to  the  global  minimum  of  the  SDA  over  all  planar  ortho¬ 
graphic  extensions  (the  specific  symmetry  condition  we 
examine  is  that  all  faces  must  either  be  equiangular  or 
form  complete-star  configurations). 

Con.sequently,  if  there  were  .some  way  to  consider  as 
possible  solutions  only  the  planar  orthographic  exten¬ 
sions  of  a  line  drawing,  (such  as  the  psychologically  plau¬ 
sible  solutions  for  Examples  A,  B,  C,  G,  .1,  K,  and  L), 
these  solutions  will  be  global  minima  of  the  SDA  be¬ 
cause  of  the  angular  symmetry  they  exhibit.  We  sho 
in  Example  L  that  Marill’s  algorithm  is  not  constraine 
to  searcli  only  for  planar  solutions;  while  it  will  also  find 
solutions  with  non-planar  facets  that  have  lower  SDAs 
than  the  planar  solutions,  there  is  still  the  possibility 
that  MSDA  shows  at  least  a  weak  inherent  preference 
for  planarity.  While  we  c.^nnot  completely  rule  out  this 
possibility,  it  appears  that  the  geometric  constraints  in¬ 
herent  in  the  specific  examples  Marill  selected,  rather 
than  MSDA  itself,  are  largely  responsible  for  finding  pla¬ 
nar  faceted  solutions.  Specifically,  triangles  in  the  line 
drawing  will  always  produce  planar  facets  in  the  ortho¬ 
graphic  extension,  and  as  we  prove  in  Appendix  B,  a 
closed  four  sided  polygonal  space  curve  with  90  degree 
angles  at  each  vertex  will  always  be  a  planar  configu¬ 
ration.  Since  in  Marill’s  examples  listed  above,  all  the 
facets  satisfy  these  two  geometric  conditions,  we  see  why 
both  the  desired  planarity  and  symmetry  are  present  in 
the  computed  solutions.® 

Marill  offers  only  two  examples  (D  and  I)  which  are 
not  clear  instances  of  the  above  analysis  (all  angles  equal, 
or  symmetric  planar  facets).  His  solution  for  Example 
I  is  at  least  questionable  if  not  incorrect.  However,  this 
solution  ht»s  almost  all  of  its  angles  equal  to  90  degrees 
and  so  it  needs  no  further  explanation  if  we  accept  it  as 
correct. 

Marill’s  solution  to  the  asymmetric  drawing  of  Exam¬ 
ple  D  looks  very  reasonable;  it  has  all  its  angles  fairly 
well  di.stribnted  between  40  and  70  degrees  and  it  is  not 
possible  to  find  a  more  symmetric  (equiangular)  ortho¬ 
graphic  extension  for  this  line  drawing.  However,  be- 

*We  note  tliat  while  there  generally  can  be  many  different  ways 
of  covering  a  line  drawing,  those  of  blocks- world  objects  with  hid¬ 
den  lines  removed  will  be  covered  uniquely  if  we  demand  that  the 
interior  of  the  2-D  circuits  be  free  of  any  lines.  We  also  note  that 
it  is  not  always  possible  to  cover  a  line  drawing  with  simple  closed 
circuits  coiresponding  to  the  specified  planar  facets  of  a  given  or¬ 
thographic  extension  (see  Example  N).  It  may  also  be  the  case 
that  a  given  covering  has  no  non-trivial  orthographic  extension 
with  planar  facets  as  specified,  as  in  Example  O. 

^There  is  one  facet,  in  Example  H,  that  is  an  exception  to  this 
statement.  However,  there  are  enough  other  geometric  constraints 
in  this  one  case  to  enforce  planarity. 
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cause  the  input  line  drawing  is  a  completely  connected 
set  of  triangular  faces,  every  possible  orthogra|iliic  exten¬ 
sion  will  have  planar  facets  and  an  acceptable  topology; 
it  is  hard  to  produce  an  unacceptable  reconstruction  for 
this  drawing. 

In  summary,  there  is  an  understandable  reason  why 
Marill’s  MSDA  principle  will  sometimes  tend  to  select 
planar  symmetric  3-D  wire  frames  when  a  purely  equian¬ 
gular  solution  is  not  possible.  But  we  also  see  that 
MSDA  will  make  unacceptable  errors,  even  in  simple 
Ccuses,  because  it  is  not  constrained  to  prefer  solutions 
with  planar  facets  unless  the  geometry  of  the  line  draw¬ 
ing  itself  forces  planarity. 

3  An  Augmented  MSDA  Algo¬ 
rithm 

What’s  missing  in  Marill’s  MSDA  principle  is  a  means  for 
enforcing  the  planarity  of  specified  facets.  In  extending 
the  MSDA  principle,  we  introduce  an  additional  compo¬ 
nent  to  the  objective  function,  called  DP,  that  increases 
in  value  as  the  facets  deviate  from  planarity.  The  new 
objective  function,  E{X),  is  the  sum  of  the  previously 
defined  SDA  term  and  DP: 

E{X)  =  XSDA  -b  (1  -  X)DP. 

Thus,  minimizing  E(X)  favors  planar  facets,  but  strict 
planarity  is  not  necessarily  enforced. 

To  define  the  new  component  of  the  objective  function, 
DP,  we  first  define  a  facet  /,•  as  a  sequence  of  connected 
lines  lij  in  the  line  drawing  (as  mentioned  previously, 
determining  which  sequences  should  be  considered  pla¬ 
nar  facets  is  still  not  a  completely  solved  problem,  see 
Appendix  A  for  the  set  of  rules  we  used  for  the  exam¬ 
ples  in  this  paper).  Each  facet  adds  a  term  to  DP  that 
is  a  measure  of  the  non-planarity  of  that  facet;  it  is  the 
following  local  non-planarity  measure  summed  over  the 
lines  in  the  facet: 

1  _  X  Ijj)  ■  {Ijj  X  IjJ  +  l) 

li  ^  ^i,i  II  II  ^i,j  ^  ^»J  +  1  II  . 

Ideally,  we  would  like  to  find  the  orthographic  exten¬ 
sion  of  the  line  drawing  with  the  lowest  SDA  that  has 
exactly  planar  facets.  To  achieve  this,  we  use  a  so-called 
continuation  method  (Leclerc  1989),  which  is  a  sequence 
of  gradient-descent  steps  applied  to  E{X).  The  sequence 
begins  with  the  initial  condition  that  Marill  suggests 
(z  =  0  for  all  points)  and  with  A  =  0.5.  Then,  A  is 
decreased  by  a  given  amount  and  the  gradient-descent 
algorithm  is  applied  anew,  starting  at  the  solution  found 
for  the  previous  value  of  A.  This  is  repeated  until  A  is 
sufficiently  close  to  zero  so  that  no  additional  changes 
occur  with  further  reductions  in  A. 

Why  not  simply  start  with  A  close  to  zero  in  the  first 
place?  The  reason  is  that  when  A  is  sufficiently  close  to 


zero,  the  local  minima  of  E{X)  are  determined  only  by 
the  planarity  component.  Thus,  simply  starting  with  A 
close  to  zero  would  not  allow  us  to  find  solutions  with 
low  SDAs  (in  fact,  when  A  =  0,  the  original  line  drawing, 
which  is  planar,  is  a  local  minimum  of  E(X)).  Although 
we  cannot  affect  the  shape  of  ^(A)  when  A  is  small,  we 
can  choose  the  starting  point  for  the  gradient-descent  al¬ 
gorithm.  Thus,  the  purpose  of  the  continuation  method 
is  to  choose  a  sequence  of  starting  points  that  are  first 
strongly  influenced  by  the  SDA  term,  but  which  even¬ 
tually  become  dominated  by  the  DP  term.  The  method 
is  not  guaranteed  to  find  a  global  minimum  of  the  ob¬ 
jective  function,  but  has  yielded  excellent  answers  for  all 
the  examples  discussed  in  this  paper. 

3.1  Results 

Figures  2 — 6  illustrate  the  results  of  our  augmented 
MSDA  algorithm,  and  allows  one  to  compare  them  with 
both  Marill’s  reconstructions  and  the  original  3-D  ob¬ 
jects  that  were  used  to  generate  the  line  drawings.  The 
reconstructions  are  illustrated  both  graphically  (as  two 
views  in  the  upper  third  of  each  figure)  and  in  tabular 
form  in  the  lower  third.  The  first  column  of  the  table 
lists  the  z  coordinates  of  each  object,  the  second  col¬ 
umn  is  the  range  of  lengths  of  the  lines  of  each  object, 
the  third  column  is  the  mean  and  range  of  the  angles 
formed  by  all  line  pairs  meeting  at  a  common  vertex, 
the  fourth  column  is  the  standard  deviation  of  angles 
(SDA)  of  each  object,  and  the  fifth  column  is  the  devia¬ 
tion  from  planarity  (DP)  of  each  object.  To  simplify  the 
comparison  of  the  results,  the  recovered  ;  coordinates 
have  been  normalized  so  that  the  first  point  always  has 
2  =  0,  and  the  second  coordinate  is  always  positive  (this 
normalization  procedure  heis  no  effect  on  the  objective 
function). 

We  also  applied  our  algorithm  to  Examples  A  through 
I  from  Marill’s  paper.  Since  his  algorithm  produced  ap¬ 
proximately  planar-faceted  solutions  by  itself  in  all  cases 
but  E.\ample  I,  it  isn’t  surprising  that  our  algorithm  pro¬ 
duced  solutions  almost  identical  to  his.  The  greatest  de¬ 
viation  from  his  result  was  for  Example  I,  because  Mar- 
ill’s  algorithm  recovered  a  significantly  non-planar  face 
for  the  leftmost  face  of  the  line  drawing. 

Ii.  all  of  the  examples,  the  Azs  used  by  Marill’s  algo¬ 
rithm  (both  as  a  stand-alone  algorithm  and  within  the 
continuation-method)  were  0.125,  0.0625, 0.03125, 0.015, 
and  0.007.  We  used  a  smaller  initial  Az  than  Marill  sug¬ 
gests  because  the  larger  one  often  forced  the  algorithm 
out  of  the  valley  of  attraction  of  the  current  local  mini¬ 
mum.  Decreasing  Az  by  a  factor  of  two  generally  gener¬ 
ally  allowed  the  algorithm  to  run  in  the  fewest  number 
of  iterations.  Using  a  smaller  final  Az  allowed  the  algo¬ 
rithm  to  produce  significantly  more  accurate  solutions. 
In  the  continuation  method,  A  was  started  at  0.5,  and 
was  decreased  by  a  factor  of  two  a  total  of  four  times. 
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image's,  tliagraiii.s,  and  spatial  coiistnifts.  In  this  soctioii 
we  addrpss  two  relat<'d  qm-stioiis  arising  out  ol't  li(>  inves¬ 
tigation  descrihod  earlier  in  this  paper:  (a)  under  wliat 
conditions  is  a  line  drawing  actually  given  soiik'  intended 
3-D  interpretation,  and  (h)  i";d(M'  what,  conditions  does 
a  moving  rigid  (wire  frame)  ohject  actually  appear  rigid. 

Some,  but  not  all,  line  drawings  are  perceived  by  hu¬ 
man  subjects  as  being  thri'e  dimensional.  What  at¬ 
tributes  of  the  thawing  promote  such  an  int('r|)retat  ion. 
and  vvliat  are  the  constraints  on  the  naturi;  of  the  re¬ 
sulting  3-D  construction?  Partially  because  human  in¬ 
trospection  is  involved,  this  is  a  very  difficidt  tpK'stion 
to  answer.  For  ('xam|)le,  if  the  thawing  is  recognizt'tl  as 
a  known  or  inevioiisly  enconnteretl  3-D  object,  it  might 
be  visualized  this  wayevt'ti  though  it  violates  conditions 
necessary  for  an  unfamiliar  objtHt  to  be  ]rerteivetl  as  be¬ 
ing  three  dimensional.  (lestalt  psychologists  have  sug¬ 
gested  that  if  the  drawing  offers  a  simpler  construct  when 
seen  as  three  dimensional,  than  when  seen  as  lieiiig  flat, 
if  will  be  perceived  as  Ix'ing  three  tlimensional;  how¬ 
ever,  an  effective  computational  procedure  to  evaluate 
“simpler”  has  yet  to  be  provided  (and  there  is  also  the 
problem  of  producing  the  corresponding  3-D  construct). 


Example  J  (Figure  2)  illustrates  Marill's  leconstruc- 
tion  for  a  line  drawing  of  a  regular  he.'cagonal  prism. 
This  reconstruction  not  only  appears  psychologically  im¬ 
plausible  from  these  two  views,  but,  as  we  discuss  in  the 
following  section,  the  reconstructerl  ob  ject  does  not  ap¬ 
pear  rigid  when  rotated  in  real-time.  It  would  appear 
that  at  least  part  of  the  reason  for  this  result  is  that  the 
recovered  facets  are  clearly  non-planar,  as  shown  by  the 
value  of  DP  in  the  table.  Our  reconstruction  is  almo.st 
identical  to  the  original  hexagonal  prism. 

In  Example  K  we  see  that  the  MS'DA  principle  itself 
is  ambiguous  for  simple  line  drawings.  Marill's  recon¬ 
struction  takes  the  line-drawing  of  a  planar  hexagonal 
plate  (.SDA=0.0)  and  reconstructs  a  non-planar  object, 
also  with  SDA=0.f).  By  enforcing  planarity,  however, 
our  reconstruction  is  quite  close  to  the  original  hexago¬ 
nal  plate. 

In  Examples  L  and  N  we  see  furtln'r  evidmice  that  the 
MSDA  principle  by  itself  is  inadequate  for  even  simple 
line  drawings.  In  both  examples,  Marill's  reconstruc¬ 
tion  has  a  significantly  lower  SDA  than  the  original  ob¬ 
ject,  and  we  consider  both  of  these  reconstructions  to 
be  psychologically  implau.sible.  Our  reconstruction  of 
Example  L  is  quite  close  to  the  original  object,  modulo 
an  additive  constant  and  flip  of  the  r-coordinates  of  the 
second  object  (which  is  invisible  to  the  objective  func¬ 
tion).  Example  N  is  a  fairly  ambiguous  figure,  and  our 
reconstruction  favored  a  “hinge”  with  al!  angles  close  to 
ninety  degrees  (the  original  object  had  a  “hinge-angle” 
of  forty-five  degrees).  Because  of  the  ambiguity  of  the 
figure,  there  exists  a  family  of  reconstructions  that  we 
consider  psychologically  plausible,  including  ours. 

Example  M  shows  the  reconstruction  of  a  figure  for 
which  some  of  the  planar  facets  are  not  equiangular. 
Again,  because  some  of  the  facets  had  more  than  four 
sides,  Marill’s  algorithm  failed  to  recover  a  psychologi¬ 
cally  plausible  object.  Our  reconstruction  is  reasonably 
good,  but  it  did  adjust  the  right-angles  in  the  large  face 
by  as  much  as  thirteen  degrees  in  order  to  make  the  an¬ 
gles  in  that  face  closer  to  being  equal.  Nonetheless,  we 
consider  the  reconstruction  to  be  psychologically  plausi¬ 
ble. 

4  Implications  for  Human  Vi¬ 
sion 

Line  drawings  provide  an  effective  means  of  communica¬ 
tion  about  the  geometry  of  3-D  objects.  It  is  a  matter 
of  some  debate  ais  to  whether  the  interpretation  of  line 
drawings  is  a  learned  skill,  or  whether  line  drawings  are 
isomorphic  to  some  intermediate  construction  of  the  hu¬ 
man  visual  system  (HVS)  in  its  normal  processing  of 
imagery,  but  in  either  case  an  understanding  of  how  hu¬ 
mans  interpret  line  drawings  is  extremely  important  in 
enabling  man-machine  communication  with  respect  to 


It  appears  to  be  much  more  productive  to  show  a  hu¬ 
man  subject  a  candidate  3-D  reconstruction  and  a,sk  if 
it  corresponds  to  .some  given  line  drawing  than  it  is  to 
tabulate  introspective  j\)dgments  about  whether  objects 
appear  to  be  2-D  or  3-D.  'fhe  former  approach,  in  fact,  is 
how  Marill  presents  his  r(\sults  to  the  reader.  Obviously, 
he  can’t  show  an  actual  3-D  reconstruction,  but  only  a 
projection.  If  he  showed  the  recon.structed  object  pro¬ 
jected  without  some  spatial  relocation,  then  all  we  have 
is  the  original  line  drawing  back  again  -  and  no  deter¬ 
mination  can  be  made;  Marill  shows  two  projections  of 
his  reconstructed  objects,  rotated  by  a  few  degrees,  for 
evaluation  by  the  reader.  Now  we  know  that  every  ortho¬ 
graphic  e.xtension  is  a  geonielrically  feasible  reconstruc¬ 
tion,  so  on  what  basis  does  the  human  judge  acceptabil¬ 
ity  (i.e.,  what  we  have  called  a  ])sychologxcally  fea.sible 
reconstruction).  It  i.s  ea.sy  to  hypothesize  a  whole  li.st  of 
conditions  that  .sliouid  be  met  -  mostly  different  instanti¬ 
ations  of  the  idea  that  regularities  (such  as  parallel  lines 
or  equal  angles  and  lengths)  observed  in  the  line  drawing 
are  not  accidental,  and  should  be  preserved  in  the  recon¬ 
structed  object;  orthographic  projective  invariants,  such 
as  parallelism,  should  then  also  be  preserved  in  the  re¬ 
projections  of  the  spatially  relocated  object.  One  could 
write  computational  procediire.s  to  search  for  such  in¬ 
variants,  but  this  apjrroach  seems  incompatible  with  the 
universality  of  the  human  evaluation  process  (c.g,  none 
of  the  invariants  we  ha|ipened  to  think  of  may  be  present 
in  the  instances  we  are  considering).  A  more  powerful 
idea  is  to  require  that  the  comi>ntational  procedure  that 
produced  the  original  reconstruction  give  the  same  result 
when  applied  to  any  of  its  rejirojections  -  i.e.,  a  consis- 
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tency  criterion.'  Tliis  is  exactly  the  condition  tliat  ob¬ 
tains  when  we  observe  a  moving  or  rotating  object  to  be 
rigid;  when  we  see  a  (continuous)  sequence  of  projections 
that  we  perceive  as  being  isomorphic  to  the  same  geo¬ 
metric  reconstruction,  we  perceive  the  object  as  being 
rigid. 

If  we  apply  the  above  ideas  to  an  evaluation  of  Mar- 
ill’s  results,  we  find  his  algorithm  often  fails  to  recover 
even  a  geometrically  similar  object  from  two  different, 
orthographic  projections  (.see  Figure  7).  Further,  w’hen 
we  u.se  the  computer  to  create  a  rotating  display  of  some 
of  the  reconstructions  obtained  with  the  use  of  MarilFs 
algorithm,  we  .see  wdiat  appears  to  be  the  movement  of 
a  non-rigid  object  (see  Figure  8).  Those  reconstructions 
that  have  pronounced  non-planar  facets  appear  to  be 
non-rigid,  while  objects  wdth  only  planar  facets  apjiear 
rigid  under  rotation.  Thus  augmenting  the  MSDA  objec¬ 
tive  function  with  a  planarity  term  is  a  nece.ssary  step  in 
improving  performance  in  the  recovery  task.  We  showed 
in  an  earlier  .section  that  a.  “continuation  method”  based 
optimization  procedure  can  be  easily  formulated  to  bias 
the  search  for  a  MSDA  solution  which  also  emphasizes 
the  formation  aim  retention  of  planar  facets;  this  algo¬ 
rithm  gives  psychologically  plausible  answers  for  all  of 
the  examples  we  have  tested. 

5  Discussion 

Traditional  blocks  world  problems  are  mathematical  in 
nature,  they  deal  with  issues  of  existence  and  consistency- 
based  .strictly  on  geometric  considerations;  they  make  no 
reference  to  what  people  actually  see.  The  problem  de¬ 
fined  by  Marill  is  p.sychological;  since  every  line  drawing 
has  an  infinite  number  of  mathematically  valid  ortho¬ 
graphic  extensions  and  no  invalid  ones,  on  what  basis 
does  the  HVS  select  a  particular  extension  as  being  psy¬ 
chologically  acceptable?  Marill  proposed  an  intriguingly 
simple  criterion  for  duplicating  human  preference,  but 
we  have  shown  that,  while  it  often  produces  an  accept¬ 
able  answer,  it  is  unreliable  even  in  very  simple  situa¬ 
tions. 

MarilFs  work  has  similarities  to  the  Huffman-Clowes- 
Waltz  approach  which  focu.sed  on  how  polyhedral  ver¬ 
tices  can  appear  in  a  line  drawing,  and  hence,  the  con¬ 
straints  such  vertices  impose  on  the  implied  3-D  model, 
Marill  only  considers  the  constraints  implied  by  line- 
intersections  at  specified  vertices  in  the  line  drawing. 
Mackworth,  Kanade,  and  Sugihara  found  it  necessary  to 
introduce  constraints  based  on  the  explicit  assignment 
of  vertices  to  planar  faces.  In  this  paper  we  show  the 
need  for  introducing  a  similar  explicit  requirement  for 
planarity  (actually,  in  the  context  of  optimizing  an  ob- 

'  The  successive  reconstruct  ions  are  not  independent;  to  the  ex¬ 
tent  that  tliey  allow  a  range  of  interpretations,  the  parameteis 
selected  for  one  interpretation  will  influence  the  parameter  selec¬ 
tions  for  successive  interpretations. 


jectivc  function,  our  coii.straiiil  is  soft  in  that  il  can  be 
violated).  However,  in  onr  ca.si',  (lie  requirement  for  |)la- 
narity  is  justified  on  ps} cliological  grounds  rat  ln'r  than 
for  achieving  a  geometrically  more  competent  algorithm. 

The  iireferonce  of  the  II  VS  to  interpret  a  line  drawing 
a,s  the  most  .symmetric  iiolyheilral  (planar  faced)  object 
consistent  with  the  drauing  is  well  establi.shed  in  the 
p.sychological  literature,  Marill  apiieared  to  have'  discov¬ 
ered  a  simph'  computational  procedure  for  finding  such 
solutions  for  any  given  line  drawing,®  but  on  closer  ex¬ 
amination,  it  became  apparent  that  his  MSDA  |)rinciple 
does  not  enforce  (or  evini  )>refer)  jilanar  solutions.  Be¬ 
cause  of  this  tleficiency,  .MSDA  is  unreliable  I'ven  in  very 
.simjile  .situations.  We  wi-re  able  to  prove  (Appendix  D) 
that  if  a  planarity  prefeience  is  explicitly  added  to  the 
MSDA  objective  function,  then  indeed,  the  non-obvious 
preference  for  symmetric  solutions  is  also  present.  How¬ 
ever,  we  are  now  forced  t  o  address  the  problem  of  how  to 
provide  the  auxiliary  information  necessary  to  partition 
the  drawing  into  the  coherent  components  correspond¬ 
ing  to  the  3-D  planar  facets.  It  appears  that  the  HVS 
selects  .some  subset  of  tlu'  contours  in  the  line  drawing 
as  corresponding  to  the  planar  facets  in  the  .'l-D  model, 
and  if  we  do  not  supply  this  information  to  a  recovery 
algorithm  (either  explicitly  or  by  providing  a  set  of  con¬ 
ditions  implying  the  same  information),  we  will  fail  to 
recover  psychologically  acceptable  models. 

Most  of  the  work  in  t  he  blocks  world  tradition  em¬ 
ployed  perfect  labeled  line  drawings  with  the  assignment 
of  vertices  to  faces  given  as  part  of  the  input  specifica¬ 
tion.  If  we  follow  the  same  approach  (although  we  are 
not  concerned  with  having  perfect  line  drawings  since  our 
recovery  method  employs  optimization  which  can  toler¬ 
ate  deviations  from  any  of  the  constraints  embodied  in 
the  objective  function),  t  hen  we  at  least  have  provided  a 
tool  for  simplifying  man- machine  communication  using 
the  language  of  line  drawings.  However,  there  is  obvi¬ 
ous  theoretical  value  in  understanding  the  criterion  for 
hiiinati  selection  of  the  circuits  in  the  line  drawing  that 
correspond  to  planar  faci'ts  in  the  .3-D  model.''  In  part, 
this  importance  is  related  to  the  issue  of  how  the  HVS 
recovers  the  shape  n*"  a  moving  object.  Even  though 
there  are  a  few  well  known  exce|)tions,  it  is  widely  be¬ 
lieved  that  (he  HVS  will  a.ssume  an  object  to  be  rigid  and 
correctly  recover  its  shape  if  this  is  iiuleed  the  case.*® 

*MarilI,  of  course,  only  retiii  ns  I  lie  wire  frame.  Bui  in  the  rase 
of  a  Blocks  World  abject,  coiupeleiit  ,algorilhm.s  exist  for  fiuding 
all  thr  valiH  «'omp|p|ions  of  <lir  wire  frame  as  a  solirl  polylierlral 
object  (Strat  1984,  Markowsky  aiul  Wesley  1981). 

^As  nolcfl  ill  Ajipeiiflix  A.  wr  have  made  some  iniiial  progress 
towarfl  the  solution  of  this  problem  and  have  rlevelopcd  aii  algo 
rithinir  prorrrhire  llial  can  successfully  handle  all  of  ihc  examples 
<(isrussed  in  tlii.s  paper;  but  wc  rrcogiii/e  that  lliis  is  still  far  short 
of  a  complete  solution. 

*®For  example,  l>y  using  IMliuan's  result  that  three  distinct  or¬ 
thographic  project ion.s  of  four  noii-coplanar  points  in  a  rigid  con¬ 
figuration  are  sufTicieiit  to  uiuf|ue|y  determine  the  structure  and 
motion  up  to  a  reflection  ahout  the  image  plane. 
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However,  the  rigid  wire  frames  with  non-pianar  facets 
provide  a  whole  class  of  counter-examples  to  this  belief 
-  they  appear  to  be  non-rigid  when  observed  in  motion 
(even  at  very  low  speeds  where  maintaining  correspon¬ 
dence  of  vertices  from  one  projection  to  the  next  is  no 
problem).  The  non-rigidity  appears  to  result  from  the 
HVS  making  incorrect  decisions  about  how  the  drawing 
can  be  partitioned  into  planar  facets  (see  Appendix  E). 

6  Summary 

Maiill’s  recently  published  paper  claimed  that  the  sim¬ 
ple  procedure  he  described  could  duplicate  human  judge¬ 
ment  in  recovering  the  3-D  wire  frame  geometry  of  ob¬ 
jects  depicted  in  line  drawings.  He  provided  some  im¬ 
pressive  examples,  but  no  theoretical  justification  to 
back  his  claims.  In  this  paper,  we  critically  examined  the 
merits  of  Marill’s  algorithm,  provided  at  least  a  partial 
explanation  for  its  competence,  identified  \veakne.sses, 
showed  how  it  could  be  improved,  and  discussed  the 
implications  of  this  work  for  clarifying  some  important 
problems  in  human  perception. 

In  particular,  we  provided  a  number  of  theorems  which 
show  that  minimizing  the  standard  deviation  of  angles  is 
(potentially)  a  simple  and  effective  method  for  selecting 
symmetric  solutions  when  the  constraining  line  drawing 
(which  is  the  projection  of  a  wire  frame  that  may  be 
incomplete)  permits  such  interpretation.  On  the  other 
hand,  we  showed  that  Marill’s  algorithm  could  fail  in 
simple  cases,  that  he  employed  an  optimization  proce¬ 
dure  that  was  often  too  weak  to  find  the  correct  answer 
even  when  it  was  within  the  competence  of  the  objective 
function,  and  that  the  algorithm  would  often  produce 
wire  frames  with  non-planar  facets  (something  no  hu¬ 
man  would  intuitively  eiccept  in  perceiving  a  straight-line 
drawing  as  a  3-D  configuration). 

We  argued  that  an  important  condition  in  testing  or 
evaluating  the  psychological  correctness  of  a  reconstruc¬ 
tion  is  that  its  reprojections  (after  spatial  relocation) 
result  in  the  same  object  being  produced  by  the  re¬ 
covery  algorithm.  For  the  human  visual  system,  this 
is  equivalent  to  the  coadition  that  the  recovered  object 
appear  rigid  when  observed  during  movement  or  rota¬ 
tion.  The  perception  of  rigidity  for  wire  frames  appears 
to  be  highly  correlated  with  the  presence  or  absence  of 
strongly  non-planar  facets.  By  modifying  Marill’s  objec¬ 
tive  function  to  explicitly  favor  planar-faceted  solutions, 
and  by  using  a  more  competent  optimization  technique, 
we  were  able  to  demonstrate  significantly  improved  per¬ 
formance  in  all  of  the  examples  Marill  provided  as  well 
as  those  additional  ones  we  constructed  ourselves. 


References 

Clowes,  M.  B.  (1971).  On  seeing  thing.s.  Aritficial 
Intelligence,  2(1),  79-1 1C. 

Draper,  S.  W.  (1981).  The  use  of  gradient  and  dual 
space  in  line-drawing  interpretation.  Artifictal  In¬ 
telligence,  17,  461-508. 

Huffman,  D.  A,  (1971).  Impossible  objects  a.s  non¬ 
sense  sentences.  In  Meltzer  and  Michie,  editors, 
Machine  Intelligence  6,  pages  295-323,  Edinburgh 
Univ.  Press. 

Kanade,  T.  (1980).  A  theory  of  origami  world.  Artitfi- 
cial  Intelligence,  I3(I),  279-311. 

Leclerc,  Y.  G.  (1989).  Constructing  simple  .stable 
descriptions  for  image  partitioning.  Inlemational 
Journal  of  Computer  Vision,  3(1)  73  102. 

Mackworth,  A.  K.  (1973).  Interpreting  pictures  of  poly¬ 
hedral  scenes.  Artifieial  Intelligence,  4(2),  121-137. 

Marill,  T.  (1991).  Emulating  the  human  interpreta¬ 
tion  of  line-drawings  as  three-dimensional  objects. 
IJCV,  6(2),  147-161. 

Markowsky,  M.  A.  and  Wesley,  G.  (1981).  Fleshing  out 
projections.  IBM  J.  R&D,  25(6),  934-954. 

Strat,  T.  M.  (1984).  Spatial  reasoning  from  line  draw¬ 
ings  of  polyhedra.  In  DARPA  Image  Understanding 
Workshop,  pages  230-235. 

Sugihara,  K.  (1982).  Mathematical  structures  of  line 
draings  of  polyhedrons — toward  man-machine  com¬ 
munication  by  means  of  line  drawings.  IEEE  PA  MI, 
4(5),  458-469. 

Sugihara,  K.  (1984).  A  necesstiry  and  sufficient  condi¬ 
tion  for  a  picture  to  represent  a  polyhedral  scene. 
IEEE  PA  MI,  6(5),  578-586. 

Waltz,  D.  A.  (1972).  Generating  semantic  descriptions 
from  line  drawings  of  scenes  with  shadows.  Techni¬ 
cal  Report  Al-TR-271,  MIT. 

Whitely,  W.  (1986).  Two  algorithms  for  polyhedral 
pictures.  In  Proc.  Second  Annual  Symp.  on  Com¬ 
putational  Geometry,  pages  142-149. 


642 


Appendices 


A  Psychological  Assumptions 

The  following  are  some  of  the  basic  assumptions  that 
we  believe  are  typically  made  by  people  in  the  recon¬ 
structions  of  wire  frames  from  line  drawings,  and  some 
constraints  relevant  to  partitioning  a  line  drawing  into 
planar  facets.  They  are  are  known  to  have  rare  excep¬ 
tions. 

1)  Three  dimensional  wire  frames,  derived  from  line 
drawings,  have  implied  planar  faces  inside  subsets  of 
their  closed  circuits;  they  can  akso  have  struts,  such  as 
legs  or  bracing  wires,  in  or  on  a  planar  face.  (Strongly 
non-planar  facets  produce  psychologically  unacceptable 
solutions.) 

2)  Symmetric  reconstructions  are  preferred  over  non- 
symmetric  ones. 

3)  Parallel  lines  in  a  line  drawing  are  parallel  in  space. 
Lines  connecting  vertices  falling  on  two  parallel  lines  are 
in  a  common  plane  with  the  two  parallel  lines. 

4)  Many-sided  convex  closed  contours  without  internal 
circuHs  (in  a  2-D  line  drawing)  are  likely  to  correspond 
to  the  contours  of  planar  faces  in  the  corresponding  3-D 
orthographic  extension  (see  B4). 

5)  A  closed  simple  contour  in  a  line  drawing,  with¬ 
out  internal  lines,  corresponds  to  a  planar  face  in  the 
corresponding  3-D  reconstruction. 

An  algorithmic  procedure  for  identifying  3-D  planar 
facets  in  the  corresponding  2-D  line  drawing  was  con¬ 
structed  by  composing  the  requirements  of  items  3,  4, 
and  5  into  a  single  algorithm.  This  procedure  is  suffi¬ 
cient  to  deal  with  all  of  the  examples  we  discuss  in  this 
paper,  but  is  not  general  enough  to  handle  other  cases 
we  can  think  of. 

B  Projective  Invariants 

The  following  are  some  important  projective  invariants 
for  planar  geometric  structures. 

1)  The  sum  of  the  interior  angles  (measured  between  0 
and  360  degrees)  of  a  closed  planar  contour  with  n  sides 
equals  (n-2)180  degrees.  Thus,  since  a  polygon  of  n  sides 
projects  to  a  polygon  of  n  sides  under  both  orthographic 
and  central  projection,  the  mean  value  of  the  interior 
angles  of  a  given  closed  planar  contour  [(n-2)180/n]  is 
invariant  under  both  orthographic  and  central  projec¬ 
tion. 

We  note  that  Marill  measures  angles  only  in  the  inter¬ 
val  between  0  and  180  degrees.  To  the  extent  that  we  are 
primarily  concerned  with  equiangular  closed  contours  in 
the  application  of  the  above  theorem  in  explaining  and 
using  his  results,  this  discrepancy  is  irrelevant  since  all 
the  interior  angles  of  such  contours  are  less  than  180 
degrees. 


2)  Consider  an  angle  (two  line  segnusits  sharing  a 
common  endpoint)  in  3-D  space  and  its  orthographic 
projection.  We  will  call  the  plane  containing  the  angle 
the  source  plane,  and  the  plane  containing  its  projec¬ 
tion  the  projection  plane.  If  the  angle  is  translated  in 
the  source  plane,  its  projection  is  also  translated,  but 
does  not  change  in  magnitude  from  its  original  projected 
value.  Now  consider  a  set  of  n  angles  lying  on  a  com¬ 
mon  source  plane,  such  that  the  sum  of  these  angles  is 
360  degrees.  If  it  is  also  the  case  that  the  angles  can  be 
translated  so  that  when  all  their  vertices  coincide,  they 
exactly  span  an  angle  of  360  degrees,  then  the  mean 
value  of  the  set  of  angles  (360/n)  is  unaltered  under  or¬ 
thographic  projections  We  will  call  such  a  collection  of 
angles  a  “complete-star.”^'.  We  note  that  if  an  essen¬ 
tially  infinite  number  of  copies  of  an  angle  of  d  degrees 
(where  360/d  =  k  and  k  is  an  integer)  is  uniformly  dis¬ 
tributed  in  orientation  over  a  plane,  then  the  mean  value 
of  the  angles  under  any  orthographic  projection  of  the 
plane  is  the  constant  value  d. 

3)  We  note  that  if  the  angle  between  two  line  segments 
is  less  than  180  degrees,  the  angle  cair  be  closed  to  form 
a  triangle,  and  since  triangles  are  preserved  under  both 
orthographic  and  central  projection,  an  angle  of  less  than 
180  degrees  will  never  transform  under  such  projections 
into  one  of  more  than  ISO  degrees.  We  will  call  a  closed 
planar  contour  convex  if  the  region  it  bounds  is  convex. 
Since  a  convex  contour  has  all  internal  angles  of  less  than 
180  degrees,  a  convex  planar  contour  remains  convex  un¬ 
der  both  orthographic  and  central  projection. 

4)  We  note  that  the  orthographic  projection  of  an  ar¬ 
bitrary  non-planar  polygonal  space  curve,  with  four  or 
more  sides,  has  a  probability  of  projecting  to  either  a 
non-simple  or  concave  curve  with  a  probability  (P)  that 
increeises  with  the  number  of  sides: 

P>l-..5""^  forn>4 

This  expression  is  brtsed  on  the  following  model:  Con¬ 
sider  a  process  that  generates  a  chain  of  3-D  random  vec¬ 
tors  by  generating  three  random  numbers  for  each  vector 
(in  spherical  coordinates,  an  angle  uniformly  distributed 
between  0  and  360  degrees,  a  second  angle  between  0  and 
180  degrees,  and  a  length  uniformly  distributed  between 
0  and  some  fixed  integer  L).  As  each  vector  is  generated 
we  extend  the  projection  of  the  developing  space  curve 
on  the  X-Y  image  plane.  The  process  stops  after  some 
fixed  number  of  steps  which  is  determined  by  choosing 
a  random  number  in  some  given  range;  the  curve  is  now 
closed  by  connecting  the  starting  point,  which  could  be 
the  origin  of  the  X-Y  plane,  to  the  last  point  gener¬ 
ated  and  this  determines  whether  the  inside  is  to  the 
left  or  right  as  we  follow  the  chain  of  edges  of  the  pro¬ 
jected  polygon.  We  note  that  the  only  relevant  factor  in 

**  Example  C,  for  instance,  contains  a  complete-star  consisting 
of  the  eight  45  degree  angles  formed  at  the  corner  vertices  by  the 
diagonals  with  the  sides  of  the  square.  Example  G  contains  this 
same  conRguration  in  its  central  plane. 


whether  the  projected  closed  contour  is  convex  or  con¬ 
cave  is  the  cylindrical  angle  giving  the  rotation  of  each 
of  the  random  vectors  relative  to  the  X  axis  in  the  image 
plane.  For  more  than  three  sides,  there  is  a  50%  proba¬ 
bility  at  each  vertex  that  the  inside  angle  is  greater  than 
180  degrees  whicli  thus  produces  a  concave  polygon  (the 
last  closing  side  can  be  ignored  since  it  does  not  have  the 
same  statistics  as  the  other  edges  in  our  random  model). 
Other  probabilistic  models  would  give  non-identical  but 
similar  results.  The  >  condition  is  based  on  additional 
considerations,  such  as  the  projected  curve  intersecting 
itself  even  though  the  input  specification  does  not  record 
a  vertex  at  the  cross  point. 

5)  Closed  4-sided  polygonal  space  curves  with  90  de¬ 
gree  angles  at  each  vertex  are  planar  contours.  To  prove 
this  assertion,  let  the  sequence  of  vertices  be  labeled  a, 
b,  c,  and  d.  Let  the  plane  containing  lines  Lab  and 
and  thus  vertices  a,  b,  and  c)  be  called  P\.  Since  all 
angles  are  90  degrees.  Led  must  lie  in  a  plane  (Po)  nor¬ 
mal  to  Lbc  at  c.  Similarly,  Lad  must  lie  in  a  plane  (P3) 
normal  to  Lab  at  a.  Vertex  d  must  then  lie  on  the  line 
(Lrf)  of  intersection  of  P->  and  P3  which  is  normal  to  Pi. 
We  know  one  solution  is  to  locate  d  at  the  point  of  in¬ 
tersection  (d*)  of  Ld  and  Pi  (where  a,  b,  c,  and  d*  form 
a  rectangle).  This  is  the  planar  solution  and  we  wish  to 
show  that  no  other  solution  is  possible.  We  note  that  a 
second  constraint  on  the  location  of  d  is  that  it  must  lie 
on  a  sphere  with  diameter  ac  (i.e.,  all  right  angles,  with 
legs  ptissing  through  points  a  and  c,  must  be  inscribed 
angles  of  circles  through  a  and  c  with  diameter  ac).  We 
know  d*  lies  on  the  sphere  and  Pi  is  a  bisecting  plane  of 
the  sphere.  Thus  Ld  is  tangent  to  the  sphere  at  d*  and 
d*  is  the  only  possible  solution. 


C  A  Partition  Theorem 

The  variance  of  a  set  S  of  n  objects  {a,  }  is  defined  as: 


1  =  1 


Li  =  l 


-  M- 


where; 


Let  us  now  partition  the  {a,  }  into  k  subsets,  such  that 
subset  Sj  has  tij  elements  and  mean  Mj  where; 


Let  Vj  be  the  variance  of  Sj  about  Mj  and  let  Aj  = 
(M-  Mj). 

Theorem: 


V  =  -  Y^UjiVj  +  Aj-] 

j  =  i 

Proof:  The  expression  for  V  can  be  rewritten  as: 


V  =  - 

V 


y  .  ~  (Ml  +  Ai  )]■  +  [«,■  —  (AJ-j  -I-  A:>)]" 


+  •  ■  •  +  [f'(  —  (Mb  -f  Ai- )]■ 

S,. 


If  we  let; 


v;  =  Yl^«r-(Mj-i-Aj)]- 


Then  we  have: 


— 2A7j^  o,/nj  -f  2MjAj 

Given  that  =  Mj,  we  note  that  the  4th  and 

6th  terms  cancel  and  the  2nd  and  5th  terms  combine; 

vpuj  =  [J^«r7»7  -  A//j  -b  A^  =  V)  -f  A] 


And; 


V)*  =  n^[V)-bA?] 


QED 


D  A  Symmetric  Preference  The¬ 
orem 

Recall  that. 

1)  In  Appendix  B  we  showed  that  the  average  angle 
of  all  planar  orthographic  extensions  of  a  given  simple 
closed  2-D  contour  are  the  same,  and  that  the  average 
angle  of  all  planar  orthographic  extensions  of  a  complete- 
star  are  also  the  same. 

2)  In  Appendix  C  we  proved  a  theorem  that  allows  us 
to  compute  the  SDA  of  a  set  of  simple  closed  planar  con¬ 
tours  (and/or  complete-stars)  as  the  sum  of  two  compo¬ 
nents.  The  first  component  is  the  variance  of  the  angles 
in  a  contour  or  star  about  the  mean  angle  of  that  contour 
or  star,  summed  over  all  contours  and  stars.  The  second 
component  is  a  weighted  sum  of  the  squared  differences 
between  the  mean  angle  of  each  contour  and  star,  and 
the  average  of  all  the  angles  under  consideration. 


644 


By  1),  the  second  component  of  the  variance  is  con¬ 
stant  over  all  planar  orthographic  extensions  because  (a) 
the  mean  of  each  contour  and  star  is  constant  over  all 
such  extensions,  and  (b)  the  mean  of  all  angles  can  be 
computed  as  the  weighted  sum  of  the  mean  of  each  con¬ 
tour  and  star. 

Consequently,  if  we  restrict  our  attention  to  the  pla¬ 
nar  orthographic  extensions  of  a  line  tlrawing,  then  by 
2)  above  only  the  first  component  of  the  variance  will 
change  over  tlie  extensions.  Since  the  first  component  is 
zero  for  an  extension  comprising  only  equiangular  planar 
contours  and  stars  (such  as  the  solutions  for  Examples 
A,  B,  C,  G,  J,  K,  and  L),  and  since  it  is  positive  oth¬ 
erwise,  then  .such  symmetric  solutions  correspond  to  the 
global  minimum  of  the  SDA  over  all  planar  orthographic 
extensions. 


that  originally  appeared  non-rigid,  now  appeared  to  l)e 
rigid  under  rotation.  And,  as  a  general  oliservation,  \\r 
have  not  encountered  any  examples  in  wliich  the  win 
frame  of  a  blocks  workl  object  appears  non-rigid  when 
in  motion. 


E  Factors  Affecting  The  Percep¬ 
tion  Of  Non- Rigidity 

If  we  rotate  a  randomly  derived  orthographic  extension 
of  almost  any  of  the  line  drawings  used  as  examples 
in  this  paper,  the  object  appears  non-rigid  to  most  ob¬ 
servers  (even  though,  of  course,  the  wire  frame  is  actually 
a  rigid  object).  While  there  are  many  possible  explana¬ 
tions  for  this  phenomenon,  our  conjecture  is  that  it  is 
primarily  due  to  special  position  projections  of  the  wire 
frame  (that  occur  at  one  or  more  poses  in  its  rotation) 
that  lead  the  HVS  to  incorrectly  assume  that  some  pro¬ 
jective  invariant  (such  as  parallel  lines,  see  Figure  8)  is 
being  observed.  This,  in  turn,  causes  incorrect  expecta¬ 
tions  about  the  presence  and  location  of  planar  facets. 

We  informally  looked  at  some  other  possible  causative 
factors,  but  did  not  observe  consistent  non-rigidity  phe¬ 
nomena.  For  example,  we  looked  at  objects,  such  as 
Example  N  that  produce  compelling  3-D  interpretations 
with  Necker  reversals,  but  for  which  the  drawing  is 
incomplete — it  does  not  show  all  the  edges  that  should 
be  visible,  e.g.,  where  planar  faces  intersect.  There  was 
the  possibility  that  these  missing  edges  in  the  3-D  model 
(and  thus  missing  lines  in  the  drawing)  could  cause  the 
appearance  of  a  non-planar  faceted  object  to  be  ob¬ 
served.  But  the  hinge,  and  the  few  other  objects  we 
looked  at  in  this  category,  appeared  rigid. 

We  also  looked  at  non-planar  orthographic  exten¬ 
sions  of  drawings  that  generally  appeared  flat,  including 
blocks  world  type  drawings  that  do  not  have  correspond¬ 
ing  polyhedral  realizations  (such  as  example  O).  The  re¬ 
sults  here  were  ambiguous.  The  rotating  objects  gen¬ 
erally  produced  illusions  of  non-rigidity,  but  since  these 
objects  did  not  always  appear  three-dimensional,  the  il¬ 
lusions  were  generally  very  weak. 

Some  other  casual  experiments  include  cases  where  all 
the  lines  connecting  the  vertices  of  the  wire  frames  are 
deleted,  we  then  observed  that  some  of  the  wire  frames 
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Figure  1:  The  line  drawings  examined  in  this  paper.  Examples  A  through  I  are  taken  from  Marill’s  paper.  Examples 
J  through  N  are  line  drawings  introduced  in  this  paper  for  which  Marill’s  algorithm  failed  to  recover  a  psychologically 
plausible  3-D  model.  Example  O  i.s  a  line  drawing  for  which  a  psychologically  plausible  3-D  model  is  not  feasible. 
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Zs 

Lengths 

Angles  (Mean  /  Range) 

SDA 

DP 

Original 

Object 

0.00  0.10  0.87  1.53  1.43  0.66 
-2.23  -2.12  -1.36  -0.69  -0.80  -1.57 

1.0  to  4.0 

100.0 

90.0  to  120.0 

0.060923 

0.000000 

Marril’s 

Reconstruction 

0.00  0.42  -3.02-2.37-3.04  0.68 
-0.51  0.16  -3.56-2.87  -3.29  0.14 

1.0  to  3.8 

84.2 

48.4  to  109.9 

0.109002 

0.035421 

Our 

Reconstruction 

0.00  0.11  0.92  1.58  1.47  0.68 
-2.09  -1.98  -1.18  -0.51  -0.62  -1.43 

1.0  to  3.9 

100.0 

89.0  to  121.0 

0.061034 

0.000000 

Figure  2;  Example  J.  This  line  drawing  was  created  by  orthographically  projecting  a  specific  3-D  wire  frame  object. 
In  this  case,  the  object  was  a  regular  hexagonal  prism.  Although  arbitrary  line  drawings  can  be  used  as  input  to 
the  reconstruction  algorithms  described  in  this  paper  (with  greater  or  lesser  success  in  reconstruction),  all  of  the 
examples  introduced  in  this  paper  were  created  by  starting  with  specific  3-D  objects.  The  panels  in  the  upper  right 
show  two  views  of  the  object  reconstructed  by  Marill’s  algorithm.  The  first  view  is  of  the  object  rotated  about  the 
vertical  axis  by  30  degrees,  and  the  second  is  of  the  object  rotated  about  the  horizontal  axis  by  90  degrees.  The  two 
panels  in  the  lower  right  show  two  views  of  the  object  reconstructed  by  our  algorithm.  The  table  below  this  is  the 
internal  representation  of  the  line  drawing  used  by  the  reconstruction  algorithms.  Note  that  intersections  such  as 
those  between  lines  (1  7)  and  (2  3)  are  not  represented.  Marill’s  algorithm  use.s  only  the  first  two  components  of  this 
representation.  The  third  component  is  derived  from  the  line  drawing  using  heuristics  described  in  Appendix  A.  The 
table  at  the  bottom  shows  the  results  of  the  reconstructions  in  written  form. 
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Figure  4:  Example  L.  Note  that  Marill’s  unacceptable  reconstruction  has  an  SDA  tliat  is  significantly  lower  than 
that  of  the  psychologically  correct  original  object.  Thus,  the  MSDA  principle  itself  ha.s  failed  in  this  instance. 
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Truncated  Box 
Example  M 


Marill’s  Reconstruction 


Our  Reconstruction 


Points 


0.86  0.81)  (0.54  0.81)  (-0.18  0.1 


Lines  (0  1)  (1  2)  (2  3)  (3  4)  (4  0)  (5  6)  (6  7)  (7  8)  (8  9)  (9  5)  (0  5)  (1  6)  (2  7)  (3  8)  (4  9) 


Faces  1  (0  1  2  3  4)  (5  6  7  8  9)  (0  5  9  4)  (4  3  8  9)  (3  2  7  8)  (2  1  6  7)  (1  0  5  6)  (4  0  5  9) 


Original 

Object 


Marril’s 

Reconstruction 


Our 

Reconstruction 


Zs 

Lengths 

Angles  (Mean  /  Range) 

SDA 

DP 

0.00  0.77  0.61  0.06  -0..32  0.28 
1.04  0.88  0.34-0.04 

0.5  to  1.0 

96.0 

90.0  to  135.0 

0.071281 

0.000000 

0.00  0.77  0.97  0.87-0.16  0.28 
0.96  0.58  0.95-0.01 

0.4  to  1.1 

86.9 

43.0  to  137.1 

0.126276 

0.094751 

0.00  0.57  0.53  0.07  -0.23  0.37 
0.94  0.89  0.42  0.13 

0.4  to  1.0 

95.9 

82.0  to  129.1 

0.055758 

0.000037 

Figure  5:  Example  M.  Note  that  our  reconstruction  has  a  sliglitly  lower  SDA  than  that  of  the  original  object, 
indicating  the  preference  of  our  algorithm  for  equiangular  faces. 
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Marill’s  Reconstruction 


Hinge 
Example  N 


Our  Reconstruction 


Points  (-0.58  0.24)  (0.95  1.36)  (1.50  1.04)  (-0.02  -0.08)  (0.30  2.89)  (0.86  2.56) 


Lines  [  (0  1)  (1  2)  (2  3)  (3  5)  (5  4)  (4  0)  _ 


Faces  |  (0  1  2  3)  (0  4  5  3) 


i 

Zs 

Lengths 

Angles  (Mean  /  Range) 

SDA 

DP 

Original 

Object 

0.00  0.64  -0.12-0.77  -0.47-1.24 

1.0  to  2.8 

75.0 

45.0  to  90.0 

0.137078 

0.000000 

Marril’s 

Reconstruction 

0.00  1.16  -1.61  -0.85-1.19  0.86 

2.0  to  3.3 

63.9 

63.3  to  64.5 

0.000087 

0.071300 

Our 

Reconstruction 

0.00  2.04  1.80-0.19-2.01  -2.19 

0.7  to  3.4 

89.4 

88.0  to  90.1 

SI 


le 


Figure  7:  Illustration  of  the  failure  of  Marill’s  algorithm  to  recover  geometrically  similar  3-D  models  from  two  different 
projections  of  the  same  3-D  object.  The  top  row  shows  the  input  line  drawing  of  the  3-D  object  as  seen  from  one 
view-point  (similar  to  Example  G),  and  two  views  of  Marill’s  reconstructed  object.  The  bottom  row  shows  the  input 
line  drawing  of  the  same  3-D  object  as  seen  from  a  different  view-point,  and  two  views  of  Marill’s  reconstructed 
object.  The  two  reconstructed  objects  not  only  appear  different,  but  are  in  fact  significantly  different  geometrically, 
as  we  verified  by  examining  their  internal  representation.  In  contrast,  applying  our  algorithm  to  both  of  these  input 
line  drawings,  as  well  as  ten  other  randomly  chosen  views  produced  reconstructions  with  an  angular  error  of  less 
than  thirteen  degrees  from  the  original  object. 
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Figure  8:  The  illusion  of  non-rigidity  for  a  rotating  wire  frame  witli  non-planar  facets.  The  wire  frame,  Marill’s 
reconstruction  of  Example  J,  is  rotated  about  a  vertical  axis  in  the  center  of  the  object.  The  rotation  angle  is  written 
in  the  lower  left-hand  corner  of  each  box. 
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Abstract 

At  issue  is  the  importance  of  fully  accounting  for  3D  per¬ 
spective  in  3D  model-based  robot  navigation.  A  prob¬ 
abilistic  combinatorial  optimization  algorithm  searches 
for  an  optimal  match  between  landmark  and  image  fea¬ 
tures  by  initiating  an  iterative  generate-and-test  pro¬ 
cedure  from  a  randomly  selected  set  of  correspondence 
mappings.  The  2D-to-2D  version  of  this  algorithm  ap¬ 
proximates  full  3D  perspective  with  a  2D  afhne  trans¬ 
form  -  rotation,  translation  and  scale  -  applied  to  a 
2D  projection  of  the  3D  landmark  model.  A  3D-to-2D 
version  recomputes  the  robot’s  3D  pose  relative  to  the 
model  and  reprojects  the  model  during  matching.  In 
tests,  the  3D-to-2D  version  reliably  recovers  the  robot’s 
true  position.  The  2D-to-2D  version  does  equally  well 
when  initial  errors  do  not  introduce  perspective  distor¬ 
tion,  and  does  so  in  roughly  one  fifth  the  time.  However, 
it  fails  on  some  cases  where  perspective  effects  arc  pro¬ 
nounced. 

1  Introduction 

The  problem  of  matching  3D  models  to  2D  image  fea¬ 
tures  arises  in  many  domains.  Here  we  consider  prob¬ 
lems  associated  with  robot  navigation.  A  robot  moving 
through  known  surroundings  tracks  its  progress  using  vi¬ 
sion,  and  moving  down  a  hallway  it  acquires  images  such 
as  those  shown  in  Figures  1  and  2.  It  must  test  and  up¬ 
date  its  position  estimate  based  upon  the  appearance  of 
known  landmarks.  To  illustrate,  Figures  3A  and  B  show 
a  robot  in  two  possible  positions.  If  the  robot  believes  it 
is  at  position  A,  when  it  is  really  at  position  B,  then  it 
should  correct  this  error  through  landmark  recognition. 

Recognition  involves  identifying  prominent  3D  fea¬ 
tures  in  an  image.  In  this  case,  the  landmark  is  a  simple, 
partial,  3D  wire  frame  model  of  some  distinctive  features 
in  a  hallway.  The  model  can  be  seen  in  Figure  3.  The 
image  features  are  2D  line  segments  extracted  from  the 
images  using  the  Burns  algorithm  [Bur86].  Perspective 
effects  are  pronounced  in  this  domain,  as  Figure  4  illus¬ 
trates.  Figure  4  shows  the  landmark  as  it  appears  to  the 

‘This  work  was  supported  by  the  Defense  Advanced 
Research  Projects  Agency  (via  TACOM)  under  contract 
DAAE07-91-C-R035  and  by  the  National  Science  Foundation 
under  grant  CDA-8922572. 


Figure  2;  Image  2  taken  at  Position  B 
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Figure  3:  A)  Robot  in  position  for  Image  1,  B)  Robot 
in  position  for  Image  2,  C)  Nine  positions  from  which  to 
solve  for  the  true  position.  The  10  3D  line  segments  that 
make  up  the  landmark  are  visible  in  all  three  examples. 


robot  from  each  of  the  nine  poses  shown  in  Figure  3C. 
Pose  refers  to  the  robot’s  3D  position  and  orientation 
relative  to  the  landmark. 

At  issue  is  the  importance  of  accounting  for  3D  per¬ 
spective  while  matching  landmark  features  to  image  fea¬ 
tures.  In  side-by-side  tests  on  a  set  of  example  prob¬ 
lems,  the  performance  of  two  versions  of  an  otherwise 
identical  matching  algorithm  wiU  be  compared.  The  re¬ 
sultant  match  found  by  each  system  is  used  to  estimate 
the  robot’s  true  pose  using  an  algorithm  developed  by 
R.  Kumar  [Kum89,  Kum90]. 

The  two  versions  of  the  matching  algorithm  are 
called  the  2D-to-2D  system  and  the  3D-to-2D  system. 

In  the  2D-to-2D  system,  a  restricted  2D  afRne  transfor¬ 
mation  -  rotation,  translation,  and  scale  in  the  image 
plane  -  is  used  to  account  for  differences  between  the 
landmark’s  estimated  and  true  appearance.  This  system 
matches  2D  data  to  a  2D  projection  of  the  model  gen¬ 
erated  from  the  initial  pose  estimate.  We’ve  described 
this  basic  approach  previously  in  [Bev90,  Fen90]. 

In  the  3D-to-2D  system,  the  landmark  model  is  re¬ 
peatedly  reprojected  into  the  2D  image  plane  based  upon 
incrementally  updated  3D  pose  estimates.  This  system 
uses  Kumar’s  iterative  pose  algorithm  [Kum89,  Kum90] 
during  matching. 

The  discrete  search  space  of  possible  feature  map¬ 
pings  is  the  same  for  each  side-by-side  test.  The  size  of 
these  spaces  is  staggering.  To  see  this,  let  M  be  the  set 
of  3D  line  segments  in  the  landmark  model,  let  D  be  the 
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Figure  4:  Nine  views  of  landmarks  as  seen  from  nine  dif¬ 
ferent  positions  around  a  nominal  location.  The  relative 
orientation  of  the  baseboards  and  the  door  frame  change 
as  the  robot’s  position  estimate  shifts  laterally. 

set  of  2D  data  line  segments  in  the  image,  and  let  5  be 
the  set  of  model-data  pairs  which  are  candidate  matches. 

S  CM  X  D  (1) 

In  landmark  recognition,  the  initial  pose  estimate  usu¬ 
ally  constrains  the  possible  pairings  between  model  and 
data  segments,  and  5  is  considerably  smaller  than  the 
complete  space  of  pairs,  M  x  D. 

It  is  not  prudent  to  assume  that  the  correct  corre¬ 
spondence  between  model  and  data  features  is  one-to- 
one.  The  process  of  extracting  image  features  can  lead 
to  fragmentation  and  accretion.  We  therefore  consider 
many-to-many  mappings,  and  hence  the  discrete  space 
of  possible  correspondences  C  is  the  powerset  of  S. 

C  =  2^  (2) 

Often  S  contains  50,  100  or  more  model-data  pairs. 

The  matching  algorithm  which  searches  C  is  a  local 
search  algorithm  specifically  adapted  to  model  match¬ 
ing  [Bev89,  Bev90].  In  general,  a  local  search  algorithm 
moves  from  an  initial  solution,  via  transformations,  to 
one  that  is  locally  optimal  [Ker72,  Lin73,  Pap82].  It 
is  certainly  not  guaranteed  to  find  the  globally  optimal 
match.  However,  if  local  search  is  initiated  from  indepen¬ 
dently  chosen  random  starting  points,  the  probability  of 
missing  the  optimum  can  be  made  arbitrarily  small  by 
sufficiently  increasing  the  number  of  trials.  This  random 


sampling  strategy  is  employed  here  to  find  an  optimal 
match. 

Optimality  is  defined  in  terms  of  a  match  error.  The 
optimal  match,  c* ,  minimizes  this  error  function; 

•®match(^  )  —  ■®match(^)  Vc  €  C  (3) 

The  match  error,  .Ejnatch'  ^  *  combination  of  two  terms. 

■®match(c)  =  +  ■Eom(c)  (4) 

The  first,  is  a  residual  squared  error  obtained  by 
first  fitting  the  model  to  the  corresponding  data.  The 
second,  £omi  penalizes  matches  which  omit  portions  of 
the  model  from  the  match.  These  two  forms  of  error  are 
discussed  below. 

The  experiments  presented  here  test  both  the  im¬ 
portance  and  associated  cost  of  accounting  for  perspec¬ 
tive  during  matching.  In  experiment  1,  both  the  2D- 
to-2D  and  3D-to-2D  approaches  are  used  to  estimate 
the  true  position  of  the  robot,  position  A  shown  in  Fig¬ 
ure  3A,  from  each  of  the  nine  initial  pose  estimates  shown 
in  Figure  3C.  Experiment  2  tests  the  ability  of  each  sys¬ 
tem  to  recover  from  a  confusion  between  position  A  and 
position  B. 

2  Landmark  Matching  and  Local 
Search 

In  our  previous  work  on  landmark-based  navigation, 
[Bev90,  Fen90],  matching  was  completely  separated  from 
3D  pose  recovery  [Kum89].  The  basic  scenario  was: 

1.  Project  the  3D  landmark  model  features  into  the 
image  plane  using  an  initicd  estimate  of  the  robot’s 
pose.  Initial  estimates  are  generated  by  a  naviga¬ 
tion/planning  module  [Fen90]. 

2.  Use  a  rough  estimate  of  the  uncertainty  in  the  ini¬ 
tial  pose  to  determine  a  set  of  candidate  model-data 
pairs,  5  defined  in  equation  1. 

3.  Use  the  2D-to-2D  matching  algorithm  to  find  an  op¬ 
timal  correspondence,  c',  between  model  and  data 
features  [Bev90]. 

4.  Use  the  corresponding  landmark  and  image  features 
in  c*  to  recover  the  3D  pose  of  the  robot  relative 
to  the  landmark.  This  is  done  with  the  iterative 
algorithm  developed  by  Kumar  [Kum89]. 

The  strength  of  this  approach  is  that  it  reduces  a  3D-to- 
2D  matching  problem  to  a  comparatively  simpler  2D-to- 
2D  matching  problem.  The  potential  weakness  is  that 
rotation,  translation  and  scaling  in  the  plane  may  be  in¬ 
sufficient  to  recover  from  an  initial  error  in  robot  position 
that  produces  perspective  distortion  in  the  projected  2D 
model.  In  hallways,  where  the  side  walls  come  toward 
the  camera,  persp>ective  effects  associated  with  later2il 
position  error  can  be  dramatic.  Figure  4  illustrates  the 
effect.  Note  the  change  in  relative  angle  between  the 
baseboards  and  the  doorway. 


For  2D-to-2D  matching,  a  model  is  fit  to  corre¬ 
sponding  data  by  solving  for  the  rotation,  translation 
and  scale  which  minimizes  the  integrated,  squared,  per¬ 
pendicular  distance  between  2D  data  line  segments  and 
infinitely  extended  2D  model  lines.  is  a  normalized 
function  of  the  residual  point-to-line  squared  error  after 
fitting.  The  best  fit  2D  pose  has  a  closed  form  solution 
involving  a  simple  qucidratic  equation  [Bev90]. 

The  omission  error  is  a  function  of  the  percentage 
of  2D  model  line  segments  not  covered  by  data  line  seg¬ 
ments.  Coverage  is  defined  in  terms  of  the  perpendicular 
projection  of  data  segments  onto  model  lines.  A  point  on 
a  model  line  segment  is  covered  when  a  point  on  a  data 
segment  projects  onto  it.  A  more  detailed  discussion  of 
the  omission  error  appears  in  [Bev90]. 

In  extending  this  approach  to  3D-to-2D  matching, 
the  2D-to-2D  closed  form  pose  computation  is  replaced 
by  an  iterative  3D-to-2D  pose  computation  developed  by 
R.  Kumar  [Kum89,  Kum90].  This  algorithm  solves  for 
the  3D  position  and  orientation  of  the  camera  relative 
to  a  model  by  minimizing  a  sum  of  squared  3D  point  to 
plane  distances.  The  points  are  the  endpoints  of  the  3D 
model  line  segments.  The  planes  are  defined  by  the  two 
endpoints  of  a  data  line  segment  and  the  focal  point  of 
the  camera.  Thus,  for  the  3D-to-2D  system,  ^  ^ 
malized  function  of  the  residual  point-to-plane  squared 
error  after  the  best  fit  3D-to-2D  pose  has  been  deter¬ 
mined. 

The  omission  error  is  measured  in  a  manner  similar 
to  that  used  by  the  2D-to-2D  system.  However,  it  is 
based  upon  the  projection  of  the  updated  model  after 
the  best  fit  3D-to-2D  pose  is  generated.  The  omission  is 
measured  relative  to  the  new  2D  projections  of  the  3D 
model  segments. 

Two  additional  points  about  the  implementation 
of  the  3D-to-2D  pose  algorithm  are  worth  mentioning. 
First,  for  a  small  percentage  of  cases,  the  iterative  Quasi- 
Newton  method  fails  to  converge  to  the  optimal  3D 
pose.  To  overcome  this  weakness,  we  use  the  Levenberg- 
Marquardt  method  suggested  by  David  Lowe  [Low91]. 
For  a  clear  summary  of  this  method  see  [Pre88],  pages 
542  -  544.  Second,  to  save  computation,  Kumar’s  algo¬ 
rithm  has  been  reformulated  in  terms  of  state  variables 
associated  with  each  model-data  pair,  a  €  S.  The  sum  of 
these  state  variables  for  all  s  in  a  particular  correspon¬ 
dence  c  determines  the  pose.  This  allows  the  contribu¬ 
tion  from  a  model-data  pair  a  to  be  added/subtracted 
from  the  current  sum  if  s  is  added/deleted  from  the  cur¬ 
rent  match.  This  saves  considerable  amounts  of  compu¬ 
tation  during  matching,  since  it  removes  the  the  need  to 
loop  over  the  complete  set  of  pairs  in  c. 

3  Inertial-Descent  Local  Search 

The  local  search  matching  algorithm  employed  is  a 
variation  of  the  steepest-descent  strategy  described 
in  [Bev90].  We  call  the  algorithm  inertial-deacent,  and 
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Figure  5;  Search  space  and  examples  of  local  search  for  experiment  2 


0.66  0.59  0.49  0.41 

Figure  6:  Successive  3D-to-2D  pose  estimates  computed  during  local  search 
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Match  Improvements  in  Descending  Order 

84.48  ((t),  4)(D,  2)(J5,  5)(/,  1)(  J,  0)(5,  15)(B,  12) . . . 

10.21  ((C,  7)(J,  0)(/,  1){E,  5)(B,  12)(B,  15)(>1, 11)) 

2.94  ((D,2)(E,  5)(£;,  8)(A,  11)(B,  12)(B,  15)) 

0.77  ((D,  4)(A,  13)(  A,  14)(E,  8)(A,  17)(B,  15)(B,  16)) 

O-'tl  0 _ _ _ 

Table  1:  The  improvement  lists  generated  by  the  inertial- 
descent  sdgorithm  each  time  it  tests  adding/removing  a 
single  pair.  The  highlighted  pairs,  applied  in  sequence, 
lead  to  improved  matches. 

is  best  explained  by  example.  Figures  5  and  6  illustrate 
inertial-descent.  The  matching  problem  itself  is  drawn 
&om  experiment  2,  in  which  positions  A  and  B  are  con¬ 
fused  (Figure  3).  The  model  and  data  line  segments  are 
shown  on  the  right  hand  side  of  Figure  5.  The  model 
line  segments  are  shown  projected  into  the  image  plane 
as  they  would  appear  from  position  A.  The  data  line 
segments  are  derived  from  the  image  in  Figure  2,  ac¬ 
quired  at  position  B  .  Model  line  segments  are  labeled 
with  capital  letters,  data  line  segments  by  number. 

Figure  5  also  shows  a  complete  trace  of  two  indepen¬ 
dent  runs  of  the  local  search  algorithm.  Each  successive 
row  indicates  a  successively  better  match.  A  filled-in 
square  indicates  that  the  associated  model-data  pair  is 
an  element  of  c.  Hence,  for  example,  the  first  row  of  the 
first  table  indicates  a  match,  <  selected  at  random. 

<imt  =  {(A,  13),  (A,  14),  (A,  17),  (5, 16), 

(C,7),(D,2),(I>,4),(E,5),  (5) 

(£,8),(F'2),(J,1),(J,0)} 

The  importance  of  choosing  initial  matches  at  random 
will  be  discussed  shortly.  The  match  error,  ^niatchi 
each  successive  match  is  shown  to  the  left.  For  the  initial 
match  in  example  1,  i;match(ciiut)  =  84.48. 

Each  correspondence  c  may  be  represented  by  a  bit 
string.  To  illustrate,  from  equation  5  may  repre¬ 
sented  as: 

=  00111  00000100  010  11  11  1  1  1  (6) 

Spacing  is  maintained  to  clarify  the  relation  between  the 
columns  in  Figure  5  and  the  bits  in  equation  6. 

Toggling  a  bit  in  this  string  adds  or  removes  a  fea¬ 
ture  pair  from  the  current  match.  A  steepest-descent 
matching  algorithm  [BevOO]  computes  £'inatch  ^  ^ 
correspondences  which  differ  from  the  current  c  by  one 
bit,  i.e.  it  tests  each  single  bit  toggle.  It  then  applies 
the  best  single  toggle  to  the  current  match  yielding  a 
new  match  c'.  After  each  move,  steepest-descent  revalu- 
ates  all  n  possible  toggles.  It  terminates  when  no  toggle 
yields  improvement.  In  example  1,  (Figure  5),  the  pair 
yielding  the  greatest  improvement  from  the  initial  match 
qnit  is  (D,  4). 

Inertial-descent,  like  steepest-descent,  begins  by 
testing  all  n  single  bit  toggles.  Unlike  steepest-descent, 


rather  than  just  applying  the  single  toggle  yielding  the 
greatest  improvement,  inertial-descent  builds  a  sorted 
list  of  all  toggles  yielding  improvement.  It  then  works 
down  this  list  sequentially  toggling  successive  pairs  until 
either:  the  list  is  exhausted,  or  a  toggle  is  found  which 
no  longer  produces  an  improvement. 

Table  3  illustrates  the  improvement  lists  generated 
by  the  inertial-descent  algorithm  for  example  1  (Fig¬ 
ure  5).  The  highlighted  pairs  at  the  head  of  the  lists 
actually  lead  to  improved  matches.  For  the  first  three 
matches,  inertial-descent  didn’t  save  any  computation 
relative  to  steepest-descent.  From  the  new  match  ob- 
tmned  by  toggling  the  first  pair  on  the  list,  the  second 
pair  no  longer  improved  the  match,  and  therefore  all 
n  toggles  were  again  tested.  However,  for  the  fourth 
match,  the  list  allowed  the  algorithm  to  apply  four  tog¬ 
gles  in  succession  without  expending  effort  testing  al¬ 
ternatives.  In  general,  inertial-descent  tests  far  fewer 
matches  than  steepest-descent. 

As  alre2uly  mentioned,  each  time  the  local  search  al¬ 
gorithm  tests  a  new  correspondence  c,  it  must  compute 
^matchC*^)-  emphasis  that  new  3D  poses  are  gener¬ 
ated  for  each  match  tested  by  the  3D-to-2D  system,  Fig¬ 
ure  6  shows  the  projection  of  the  landmark  from  these 
updated  poses  for  the  successively  better  matches  found 
in  example  1,  Figure  5. 

The  local  search  algorithm  just  illustrated  may  not 
find  the  globally  optimal  match  c'.  However,  by  running 
multiple  trials  from  randomly  selected  initial  matches, 
and  then  taking  the  best  match  found  in  the  series  of 
trials,  a  local  search  procedure  with  a  relatively  small 
probability  of  finding  the  globally  optimal  match  on  a 
single  tri2d  may  be  used  to  find  the  global  optimum  with 
very  high  probability. 

Formally,  let  P,  be  the  probability  of  successfully 
finding  the  global  optimum  on  a  single  trial.  The  con¬ 
junctive  probability  of  failing  to  find  the  global  optimum 
in  t  independent  trials  is  Q/: 

Q}  =  P‘  i’/  =  1  -  P»  (7) 

Therefore,  the  number  of  trials  required  to  find  the  global 
optimum  with  probability  Q,,  using  a  local  search  al¬ 
gorithm  with  probability  of  success  P, ,  is  given  by  the 
following  equation. 

t  =  flogp^  Qf  =  \-Q,  (8) 

To  illustrate,  if  P,  =  0.10  then  29  trials  are  required  to 
find  the  optimum  with  probability  Q,  =  0.95.  If  P,  = 
0.5  then  only  5  trials  are  required  to  reach  the  same 
confidence  level. 

Empirical  trials  on  the  landmark  recognition  prob¬ 
lems  presented  below  are  used  to  estimate  the  true  prob¬ 
ability  of  success,  P,,  and  hence  the  number  trials,  t„ 
required  to  obtain  the  optimsd  match  with  confidence 
95%  or  better.  In  general,  100  trials  of  local  search  are 
used  to  generate  estimates  P,  and  t,. 
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The  estimate  P,  is  subject  to  one  important  quali¬ 
fication:  it  may  not  be  based  not  upon  the  true  globally 
optimal  match  c*,  but  instead  upon  what  we’ll  call  the 
optim2d  match,  the  best  match  found  in  a  series  of  empir¬ 
ical  trials.  By  visual  inspection  one  can  often  conclude 
that  the  optimal  match  and  the  globally  optimal  match 
are  the  same.  However,  this  is  not  always  the  case. 

4  Experiment  1 

The  task  is  to  recover  the  true  position  of  the  robot  from 
each  of  the  nine  pose  estimates  shown  in  Figure  3C.  ^ 
The  true  position,  position  A,  is  41.3  feet  from  the  door¬ 
way  and  4  feet  from  each  of  the  two  side  walls.  The  nine 
test  positions  were  obtained  by  translating  the  true  po¬ 
sition  estimate  forward  and  backward  and  side-to-side. 
Estimates  1  —  3  are  5.3  feet  forward  of  position  A.  Esti¬ 
mates  4  —  6  are  1.3  feet  forward  of  A,  and  estimates  7  —  9 
are  3.7  feet  back  of  A.  Estimates  1,  4  and  7  are  2  feet  to 
the  left  of  A.  Estimates  3,  6  and  9  are  2  feet  to  the  right 
of  A.  The  landmark  model  <is  it  appears  from  these  nine 
positions  has  already  been  shown  in  Figure  4. 

In  addition  to  varying  the  initial  pose  estimates, 
this  experiment  considers  both  ‘directed’  and  ‘undi¬ 
rected’  model  segments.  A  directed  segment  specifies 
the  sign  of  the  intensity  gradient  across  the  edge.  By 
experimenting  with  search  spaces  generated  both  with 
and  without  directed  segments  we  show  that  using  di¬ 
rected  segments  saves  computation.  However,  at  least 
for  the  3D-to-2D  system,  the  quality  of  the  final  match 
and  the  associated  updated  pose  estimate  is  the  same  in 
each  case. 

The  final  pose  estimates  generated  by  the  3D-to-2D 
system  are  within  0.1  feet  of  the  true  pose  for  all  nine 
initial  estimates  and  for  both  the  directed  and  undirected 
search  spaces.  Essentially  the  same  match  is  found  in  all 
cases,  which  explmns  the  consistently  good  final  pose 
estimate.  The  full  3D  perspective  approach  appears  to 
be  very  robust. 

Figure  8  shows  the  model  and  data  line  segments. 
The  model  is  projected  from  the  true  position.  The  op¬ 
timal  match,  c*,  found  by  the  3D-to-2D  system  for  the 
directed  search  space  is: 

c-  =  {(A,  32),  (A,  33),  (H,  30),  (C,  15), 

(D,12),(E,14),(F,11),(G,10),  (9) 

(H,9),(/,7),(J,0)} 

The  final  pose  estimates  for  the  2D-to-2D  system 
are  presented  in  Figure  7.  For  cases  2,  5  and  6,  for  both 
the  directed  and  undirected  search  spaces,  the  2D-to-2D 
system  recovers  the  robot’s  true  position  essentially  as 
well  as  the  3D-to-2D  system.  This  is  to  be  expected, 
since  forward  and  backward  error  primarily  changes  the 
expected  scale  of  the  landmark  model.  However,  for  the 
other  cases  the  recovered  pose  estimates  are  not  as  good. 

*In  this  experiment  we  only  consider  the  position  portion 
of  the  pose  estimate  associated  with  a  match. 
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Figure  7:  Bar  chart  of  position  estimates  recovered  with 
2D-to-2D  matching.  Position  is  shown  along  each  of  the 
three  dimensions  for  the  best  matches.  Results  for  the 
directed  and  the  undirected  search  spaces  are  shown. 


For  the  directed  search  space  the  final  position  is  always 
better  than  the  initial  estimate.  However,  for  the  undi¬ 
rected  search  space  this  is  not  always  the  case.  In  par¬ 
ticular,  for  cases  3,  6  and  9  the  recovered  Y  position  is 
worse  than  the  initial  estimate. 

The  first  step  in  generating  these  results  was  to  de¬ 
termine  the  set  of  candidate  pairs  5.  The  selection  of 
candidate  pairs  depends  upon  the  placement  of  the  2D 
projection  of  a  model  line  m  in  the  image.  For  the  case 
of  directed  model  segments,  a  pair  s  =  (m,  d)  €  M  x  D 
is  an  element  of  Sj  if: 

1  d  is  within  30  degrees  of  m. 

2  d  is  within  128  pixels  of  m. 

3  d  is  at  least  1/4  the  length  of  m. 

4  d  and  m  have  the  same  sign  of  contrast. 

For  the  undirected  case,  5^,  the  sign  of  contrast  test  is 
omitted.  These  bounds  are  picked  beised  on  experience 
with  the  domain.  In  particular,  128  pixels  is  one  quar¬ 
ter  the  distance  across  the  full  512  by  512  image  and  is 
adequate  to  ensure  the  correct  match  is  contained  in  the 
resultant  search  space.  The  number  of  candidate  pairs 
in  Sd  and  5^  for  each  of  the  nine  pose  estimates  are 
summarized  in  Table  4. 

We  tested  two  alternative  ways  of  generating  the 
initial  random  correspondences  used  as  starting  points 
by  the  local  search  algorithm.  In  one  case,  an  initial  cor- 
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Figure  8:  Labeled  landmark  and  data  line  segments  for  Image  1 


1  Initial  Pose  Estimate  | 

1 

2 

3 

4 

5 

6 

7 

8 

9 

|5d| 

41 

36 

42 

45 

37 

42 

53 

43 

54 

87 

75 

89 

94 

77 

94 

112 

92 

112 

Table  2:  The  number  of  candidate  pairs  for  each  of  the 
nine  initial  pose  estimates  and  using  directed,  Sd,  and 
undirected,  5«  model  segments. 

respondence  was  selected  uniformly  from  the  com¬ 
plete  search  space  C.  In  the  other,  we  biased  the  selec¬ 
tion  to  favor  choices  of  with  roughly  2  data  segments 
bound  to  each  model  segment. 

This  is  done  by  defining  a  binding  probability, 
such  that  a  pair  s  =  (m,  d)  is  included  in  an  ini¬ 
tial  correspondence  probability  Pij(s).  Choos¬ 

ing  Pb  =  0.5  for  all  pairs  yields  the  uniform  sampling 
mentioned  above.  Defining  Pb(s)  as  follows  biases  selec¬ 
tion.  Let  km  he  the  number  of  pairs  in  5  which  include 
model  feature  m,  and  then  define 

Pb{s)  =  max(0.5, 2/A:(m))  (10) 

The  maximum  of  0.5  is  desirable  because  it  randomizes 
cases  where  there  are  only  1,  2  or  3  candidate  pairs  for 
a  model  segment. 

Results  comparing  uniform  randome  versus  biased 
random  selection  are  presented  in  Figure  9.  These  results 
are  for  the  directed  search  space.  Both  systems  were 
run  100  times  for  each  of  the  nine  position  estimates. 
Figure  9  shows  the  number  of  times  the  optimal  match 
was  found  in  each  case.  These  outcomes  suggest  the  true 
probability  of  success,  P,,  is  higher  for  biased  random 
selection,  and  that  biased  selection  is  superior  to  uniform 


selection.  The  biased  selection  defined  by  equation  10  is 
therefore  used  henceforth. 
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Figure  9:  Trials  out  of  100  yielding  the  optimal  match  for 
uniform  verses  biased  selection  of  initial  feature  bindings. 
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Figure  10;  The  number  of  trials  required  to  find  the 
optimal  match  with  95%  confidence 

The  true  probability  of  success,  P,,  is  a  parameter 
of  a  binomial  -  success/failure  -  process.  The  maximum 
likelihood  estimate,  P,,  is  just  the  ratio  of  the  number 
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Figure  11:  Timing  information  in  seconds  for  a  TI  Ex¬ 
plorer  II  Lisp  Machine.  For  both  the  directed  and  undi¬ 
rected  search  spaces  the  2D-to-2D  system  runs  in  roughly 
1/5  the  time  required  by  the  3D-to-2D  system. 


of  times  the  optimal  match  is  found  over  the  total  num¬ 
ber  of  trials.  Plugging  P,  into  equation  8  yields  the 
estimated  number  of  trials,  t,,  required  to  find  the  opti¬ 
mal  match  with  confidence  95%  or  better.  The  graph  in 
Figure  10  compares  t,  for  the  3D-to-2D  and  2D-to-2D 
systems  over  all  nine  initial  pose  estimates.  These  re¬ 
sults  are  for  the  directed  search  space.  Observe  that  t, 
for  the  2D-to-2D  and  3D-to-2D  systems  are  more  simi¬ 
lar  for  cases  2,  5  and  8.  Also  note  the  3D-to-2D  system 
generally  requires  fewer  trials. 

The  expected  amount  of  time  required  to  find  an 
optimal  match  is  the  expected  amount  of  time  to  run  a 
single  trial  times  the  expected  number  of  required  trials. 
Timing  information  for  the  2D-to-2D  and  30-to-2D  sys¬ 
tems  is  presented  in  Figure  11.  All  times  are  reported  in 
seconds  and  are  for  a  TI  Explorer  II  Lisp  Machine.  Two 
numbers  ate  given  for  each  case,  the  first  is  the  number  of 
seconds  required  to  run  t,  truds.  The  second  number  is  a 
more  realistic  estimate,  based  upon  the  time  required  to 
tun  a  conservative  number  of  trials,  tc-  A  conservative 
number  is  selected  which  essentially  guarantees  finding 
the  optimal  match  with  confidence  95%  across  every  one 
of  the  set  of  problems. 

For  the  directed  search  space  and  the  2D-to-2D  sys¬ 
tem,  te  =  50  is  chosen.  Since  P,  is  consistently  higher 
for  the  3D-to-2D  system,  a  value  of  te  =  25  is  chosen 
for  this  system.  For  the  undirected  space  the  3D-to-2D 
system  finds  the  optimal  match  less  often,  and  te  =  50 
is  necessary. 

For  the  undirected  space  and  the  2D-to-2D  system 
the  definition  of  success  is  extended  to  include  the  best 


three  matches  found.  This  is  necessary  because  for  <ill 
but  cases  2,  5  and  8  the  probability  of  finding  the  sin¬ 
gle  uniquely  best  match  becomes  very  low.  Making  this 
change,  and  running  300  trials,  P,  >  0.02.  Under  these 
conditions,  =  150.  To  be  conservative,  the  worst  pose 
generated  by  the  top  three  matches  was  reported  in  Fig¬ 
ure  7. 


5  Experiment  2 

Consider  confusing  position  A  with  position  B  (Fig¬ 
ure  3).  For  this  problem.  Table  3  shows  the  true  po¬ 
sitions,  initial  pose  estimates,  and  recovered  pose  esti¬ 
mates.  The  3D-to-2D  system  recovers  the  true  pose  to 
within  1  foot  in  both  cases.  However,  in  recovering  posi¬ 
tion  A  from  cm  initial  estimate  of  position  B,  Esi-B-  True- 
A,  the  3D-to-2D  system  found  the  best  match  only  once 
in  300  trials.  The  next  best  match  was  almost  equally 
good,  and  was  found  in  five  out  of  300  trials.  Success 
in  this  case  is  redefined  as  finding  one  of  the  two  best 
matches,  P,  =  0.02.  The  recovered  pose  for  the  two  best 
3D-to-2D  matches  is  shown  for  this  case.  The  2D-to-2D 
system  did  less  well,  improving  the  initial  estimate  in 
each  case,  but  still  missing  the  true  position  by  several 
feet. 


Table  3:  Pose  results  when  positions  A  and  B  are  con¬ 
fused.  The  3D-to-2D  system  recovers  pose  well,  but  in 
the  Est-B-True-A  case  two  2dmost  equally  good  matches 
are  found,  noted  as  3D  and  3I>‘.  The  2D-to-2D  system 
does  less  well. 

The  estimated  probability  of  success,  P,,  and  re¬ 
quired  number  of  trials,  f*,,  was  determined  for  both  sys¬ 
tems  on  both  problems.  Based  upon  t,,  a  conservative 
number  of  trials  tc  was  selected,  and  finally  the  expected 
time  required  to  run  tg  trials  determined.  These  results 
are  presented  in  Table  5.  As  before,  times  were  measured 
on  a  TI  Explorer  II  Lisp  Machine. 


£8t-A-TVue-B 

Est-B-TVuc-A 

2D-to-2D 

3D-to-2D 

2D-to-2U 

31)-to-2D 

p. 

0.11 

0.22 

0.08 

0.02 

t. 

26 

13 

36 

150 

tc 

50 

150 

50 

150 

seconds 

10 

585 

35 

1,710 

Table  4:  The  estimated  number  of  required  trials,  t,  for 
Experiment  2.  Also  the  conservative  number  of  trials,  te 
2Uid  the  time  required  to  run  this  many  trials. 

6  Conclusion 

These  experiments  provide  insight  into  the  importance  of 
perspective.  We’ve  compared  2D-to-2D  matching  with 
3D-to-2D  matching  under  conditions  where  the  2D-to- 
2D  approach  might  be  expected  to  fail.  The  results  sug¬ 
gest  that  the  2D-to-2D  approach  is  more  useful  and  re- 
hable  than  one  might  at  first  expect.  The  results  also 
suggest  that  the  additional  cost  of  doing  full  3D-to-2D 
matching  is  not  prohibitive,  and  a  prudent  system  might 
choose  to  always  employ  3D-to-2D  matching. 

Hybrid  algorithms,  which  blend  2D-to-2D  and  3D- 
to-2D  matching,  present  intriguing  possibilities  for  the 
future.  It  is  worth  noting  that  the  2D-to-2D  system  usu¬ 
ally  improved  errorful  pose  estimates.  Hybrid  algorithms 
might  well  recover  3D-to-2D  matches  as  reliably  as  the 
full  3D-to-2D  system  used  here,  but  with  run  times  closer 
to  those  shown  for  the  2D-to-2D  system.  One  promis¬ 
ing  hybrid  might  use  2D-to-2D  matching  initially  and 
switch  to  3D-to-2D  matching  only  after  a  2D-to-2D  op¬ 
timal  match  has  been  found.  Another  variation  might 
use  the  comparatively  cheaper  2D-to-2D  test  to  deter¬ 
mine  a  next  candidate  move,  double  checking  the  move 
by  computing  a  new  3D  pose  and  reprojecting  the  model. 

References 

[Bnr86]  J.  B.  Burns,  A.  R.  Hanson,  and  E.  M.  Rise- 
man.  Extracting  straight  lines.  IEEE  Trans, 
on  Pattern  Analysis  and  Machine  Intelligence, 
PAMI-8(4):425  -  456,  July  1986. 

[Bev89]  J.  Ross  Beveridge,  Rich  Weiss,  and  Edward  M. 

Riseman.  Optimisation  of  2-dimensional 
model  matching.  In  Proceedings:  Image  Un¬ 
derstanding  Workshop,  pages  815  -  830,  Los 
Altos,  CA,  June  1989.  DARPA,  Morgu  Kauf- 
mann  Publishers,  Inc  (Also  a  Tech.  Report). 

[Bev90]  J.  Ross  Beveridge,  Rich  Weiss,  and  Edward  M. 

Riseman.  Combinatorial  optimisation  applied 
to  variable  scale  2D  model  matching.  In  Pro¬ 
ceedings  of  the  IEEE  International  Conference 
on  Pattern  Recognition  1990,  Atlantic  City, 
pages  18  -  23.  IEEE,  June  1990. 

[Fen90]  Claude  Fennema,  Allen  Hanson,  Edward  Rise- 
man,  J.  R.  Beveridge,  and  R.  Kumar.  Model- 

663 


directed  mobile  robot  navigation.  IEEE  Trans, 
on  Syst.,  Man,  Cybem.,  20(6):1352  -  1369, 
November/December  1990. 

[Kum89]  Rakesh  Kumar  and  Allen  Hanson.  Robust  es¬ 
timation  of  camera  location  and  orientation 
from  noisy  data  having  outliers.  In  Proc. 
of  IEEE  Workshop  on  Interpretation  of  3D 
Scenes,  pages  52  -  60,  Austin,  TX,  1989.  IEEE. 

[Kum90]  Rakesh  Kumar  and  AUen  Hanson.  Analysis  of 
different  robust  methods  for  pose  refinement. 
In  Proc.  of  IEEE  Workshop  on  Robust  Methods 
in  Computer  Vision,  pages  161  -  182,  Seattle, 
WA,  1990.  IEEE. 

[Ker72]  B.  W.  Kernighan  and  S.  Lin.  An  efficient 
heuristic  procedure  for  partitioning  graphs. 
Bell  Systems  Tech.  Journal,  49:291  -  307, 1972. 

{Lin73]  S.  Lin  and  B.  Kernighan.  An  effective  heuristic 
algorithm  for  the  traveling  salesman  problem. 
Operations  Research,  21:498  -  516,  1973. 

[Low91]  David  G.  Lowe.  Fitting  parameterised  three- 
dimensional  models  to  images.  IEEE  Trans, 
on  Pattern  Analysis  and  Machine  Intelligence, 
13(5):441  -  450,  May  1991. 

[Pre88]  William  H.  Press,  Brian  P.  Flannery,  Saul  A. 

Teukolsky,  and  William  T.  Vetterling.  Numer¬ 
ical  Recipes  in  C.  Cambridge  University  Press, 
Cambridge,  1988. 

[Pap82]  Christos  H.  Pap2tdimitriou  and  Kenneth  Stei- 
glits.  Combinatorial  Optimization:  Algorithms 
and  Complexity,  chapter  Local  Search,  pages 
454  -  480.  Prentice-Hall,  Englewood  Cliffs, 
NJ,  1982. 


Recognition  of  3-D  Objects  from  2-D  Groupings  * 

Fridtjof  Stein  and  Gerard  Medioni 
Institute  for  Robotics  and  Intelligent  Systems 
Powell  Hall  204 

University  of  Southern  California 
Los  Angeles,  California  90089-0273 

Email:  stein@iris.usc.edu 


Abstract:  We  propose  an  approach  for  the  recognition  of 
three-dimension^  objects  in  a  two-dimensional  scene.  As 
models,  we  use  a  set  of  non-registered  views  of  a  three- 
dimensional  object.  By  using  perceptual  organization  we  de¬ 
velop  a  hierarchy  of  features  based  on  proximity,  symmetry, 
paridlelism,  and  closure.  The  detection  of  these  features  is 
performed  in  an  efRcient  way  using  proximity  indexing.  While 
most  other  systems  use  spatial  relations  for  the  recognition 
process,  we  use  high  level  features  and  their  topology.  Using 
indexing,  we  retrieve  matching  hypotheses,  which  are  verified 
against  each  other  with  respect  to  topological  constraints. 
Groups  of  consistent  hypotheses  represent  detected  model 
instances  in  a  scene.  We  present  results  with  our  current 
system  and  discuss  further  extensions. 

1  Introduction 

Most  object  recognition  systems  today  address  the  prob¬ 
lem  of  finding  the  location  and  orientation  of  an  exactly 
known  rigid  object  in  a  scene.  The  tools  used  to  achieve 
this  task  are  geomeiric  constraints,  and  a  lucid  treat¬ 
ment  of  this  class  of  approaches  can  be  found  in  Crim¬ 
son’s  book  [7].  The  presence  of  a  model  is  inferred  by 
the  verification  that  such  a  model  could  indeed  produce 
some  of  the  observed  data  under  an  appropriate  geomet¬ 
ric  transform.  It  is  clear  that,  with  this  approach,  one 
can  use  local,  low  level  primitives  such  as  edgels  or  their 
approximations,  as  produced  bv  state  of  the  art  edge 
finders.  Such  an  approach  is  therefore  very  appropri¬ 
ate  when  evolving  in  a  controlled  environment,  such  as 
a  factory,  where  the  number  of  possible  objects  is  small 
and  their  geometry  is  precisely  known,  but  cannot  be 
extended  to  more  general  scenarios  for  the  following  two 
reasons: 

•  It  is  quite  difficult  to  build  accurate  geometric  mod¬ 
els  (unless  they  are  designed  that  way). 

•  When  the  object  library  is  large,  it  becomes  neces¬ 
sary  to  develop  methods  for  indexing  into  the  library 
to  select  likely  objects. 

One  possible  way  to  circumvent  these  problems  is  to 
use  projective  invariants  [6],  but  these  must  be  hand 
selected. 

We  propose  instead  that  the  solution  to  these  prob¬ 
lems  lie  in  groupings  of  the  initial  primitives.  Such  group¬ 
ings  serve  as  an  intermediate  level  representation  of  the 
data,  in  a  hierarchical  fashion,  and  can  be  used  to  re¬ 
trieve  likely  candidate  objects  from  a  library.  Further¬ 
more,  these  can  be  extracted  from  multiple  views  of  an 
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award  No.  IRI-9024369  . 


object,  rather  than  from  a  complete  model  (as  such  they 
are  “quasi-invariants”  [1,  14]).  This  idea  of  “perceptual 
grouping”  is  hardly  new.  It  can  be  found  in  the  psy¬ 
chology  literature  of  the  1920s,  under  the  Gestalt  nnme. 
As  an  explanation  of  how  the  perception  of  individual 
objects  are  formed,  the  Gestalt  theory  proposes  an  orga¬ 
nization  of  parts  of  an  image  into  wholes,  oased  on  laws 
of  grouping.  Elements  in  the  image  are  grouped  b^d 
on  proximity,  similarity,  closure,  symmetry,  and  continu¬ 
ation.  These  groups  themselves  can  be  used  as  elements 
and  grouped  with  the  same  laws.  Unfortunately,  the  im- 

filementation  of  such  observations  is  difficult,  as  th^es 
aws  conflict,  even  for  simple  stimuli.  In  computer  vision 
these  ideas  can  be  found  in  [13]  and  [24]. 

The  issues  we  have  to  tackle  in  order  to  generate 
roupings  relate  to  the  choice  of  such  groupings,  their 
iscnmination  power,  their  robustness  to  viewpoint  and 
noise,  and  their  efficient  computation. 

Here,  we  address  these  issues  and  develop  a  feature 
hierarchy  which  can  be  used  for  object  recognition  of 
three-dimensional  objects  from  a  two-dimensional  scene. 
In  this  hierarchy,  we  propose  specific  groupings  based  on 
proximity,  parallelism,  symmetry,  and  closure.  The  de¬ 
tection  of  these  features  is  performed  in  an  efficient  way 
using  proximity  indexing.  First,  we  generate  features 
with  multiple  representations  to  overcome  the  unrelia¬ 
bility  of  local  algorithms  during  preprocessing,  and  to 
handle  noise  and  capture  different  levels  of  detail.  Later, 
we  merge  perceptual  similar  features  at  higher  levels  of 
the  feature  hierarchy.  As  models,  we  use  a  set  of  non- 
registered  views  of  a  3-D  object.  While  most  other  sys¬ 
tems  use  spatial  correspondences  to  verify  matching  hy¬ 
potheses,  we  use  high  level  features  and  their  topological 
relationships  for  the  recognition  process.  These  features 
are  grouped  based  on  closure  and  proximity  to  generate 
so  called  high  level  groupings  which  are  stored  in  a  table. 
Using  indexing;  we  retrieve  matching  hypotheses,  which 
are  verified  against  each  other  with  respect  to  topological 
constraints.  Groups  of  consistent  hypotheses  represent 
detected  model  instances  in  a  scene. 

Our  proposed  groupings  are  not  guaranteed  to  pro¬ 
duce  “natural”  primitives,  in  the  sense  of  the  axis  of 
a  generalized  cone  for  instance,  so  we  need  to  validate 
our  choice.  This  is  achieved  by  demonstrating  a  recogni¬ 
tion  system  which,  even  in  its  early  implementation,  can 
recognize  an  object,  known  from  a  significantly  different 
viewpoint,  and  using  topologic  constraints  as  opposed  to 
geometric  constraints. 

Our  paper  is  organized  as  follows:  we  start  with  a 
brief  review  of  previous  work  in  Section  2.  Section  3 
describes  the  feature  hierarchy,  discussing  the  detection 
algorithms  which  we  developed  to  derive  the  different 
features  from  the  intensity  edges  of  an  image.  We  then 
introduce  Proximity  Indexing  m  Section  4.  In  Section  5 
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we  talk  about  the  motivation  of  using  high  level  group¬ 
ings  arid  give  an  outline  of  an  object  recognition  system 
which  is  in  its  early  stage,  and  we  show  some  promising 
preliminary  results. 

2  Previous  Work 

As  most  of  the  previous  work  is  concerned  with  pose 
estimation  rather  than  object  recognition,  we  simply  re¬ 
fer  the  reader  to  [7]  for  an  excellent  overview.  Other 
approaches  which  focus  on  groups  of  low  level  primi¬ 
tives  coinbined  with  an  indexing  scheme  for  the  purpose 
of  two-dimensional  object  recognition  can  be  found  in 
[2,  4,  12,  10].  The  method  described  in  [22]  is  con¬ 
cerned  with  three-dimensional  recognition.  Using  explic¬ 
itly  groupings  for  recognition  was  done  by  the  following 
authors: 

Lowe  [13]  gives  an  excellent  introduction  to  the  ideas 
behind  perceptual  grouping,  significance  of  features,  and 
an  implementation  of  a  system  called  SCERPO  which  in¬ 
corporates  some  of  his  ideas.  SCERPO  is  an  recognition 
system  which  is  able  to  estimate  the  pose  of  a  three- 
dimensional  CAD  model  in  a  two  dimensional  scene 
brised  on  line  segments. 

Mohan  and  Nevatia  [15]  perform  segmentation  of  im¬ 
ages  based  on  perceptual  clues.  They  use  the  resulting 
segmentation  for  the  detection  of  buildings  in  aerial  im- 
^es  and  for  solving  the  stereo  correspondence  problem. 
Their  groupings  are  strongly  natural  in  the  sense,  that 
the  segmentation  corresponds  to  a  physical  interpreta¬ 
tion.  A  special  focus  on  natural  descriptions  of  oojects 
was  also  done  by  Rao  and  Nevatia  [17]. 

Huttenlocher  and  Wayner  [9]  describe  an  algorithm  to 
find  the  largest  convex  groupings  of  line  segments.  The 
advantage  of  such  an  approach  is  that  the  number  of 
groups  generated  is  smaller  than  the  number  of  original 
primitives.  The  role  of  such  groupings  in  recognition  has 
yet  to  be  demonstrated,  however. 

A  different  approach  of  perceptual  grouping  is  taken 
by  Sha’ashua  and  Ullman  [20].  They  focus  on  the 
saliency  of  structure  in  an  image.  They  present  a 
saliency  measure  based  on  curvature  and  curvature  vari¬ 
ation,  and  they  show  examples  of  cluttered  scenes  where 
their  results  correspond  to  the  human  focusing  mecha¬ 
nism. 

3  Going  from  Edgels  to  Groupings 

We  now  explain  the  steps  involved  in  going  from  an  im¬ 
age  to  a  high  level  representation  of  it  in  terms  of  “per¬ 
ceptual”  groups.  This  chain  of  processing  is  sketched  in 
Figure  1.  We  start  with  an  image.  In  the  preprocessing 
stage  we  reduce  the  amount  of  data:  Starting  from  im¬ 
ages  we  compute  curves,  which  consist  of  linked  edgels. 

In  the  local  grouping  stage  we  generate  many  line  seg¬ 
ments  based  on  multiple  linear  approximations  with  diT- 
ferent  fitting  tolerances.  For  each  approximation  toler¬ 
ance,  we  perform  a  vertex  collapse  and  compute  super 
segments  and  parallels  with  exhaustive  grouping.  By 
creating  a  large  set  of  features  at  this  point  we  gain  ro¬ 
bustness  in  our  further  groupings,  ana  we  significantly 
reduce  the  unreliability  of  the  preprocessing. 

The  perceptual  grouping  stage  no  longer  distinguishes 
between  features  of  different  fitting  tolerances.  The  re¬ 
duction  of  data  is  based  on  two  strategies: 

1.  merging  perceptual  similar  features  from  the  local 
grouping  stage,  and 

2.  grouping  features  into  higher  level  features  using 
geometric  relationships  such  as  symmetry,  closure, 
and  proximity. 


To  deal  with  all  the  results  from  the  previous  stage,  we 
use  a  coarse  approximation  in  the  high  level  grouping, 
namely  the  convex  hull.  With  the  convex  approxima¬ 
tions  we  build  a  topological  adjacency  graph  based  on 
common  segments  or  vertices.  High  level  features  consist 
of  groupings  of  closure  and  proximity  in  this  adjacency 
graph. 

3.1  Preprocessing 

We  first  apply  an  edge  detection  algorithm  on  the  im¬ 
age  (we  use  the  Canny  edge  detector  [3]).  The  resulting 
edgels  are  further  linked  into  curves.  Curves  are  then 
approximated  with  a  line  fitting  algorithm  to  compute 
the  polygonal  approximations.  Instead  of  just  using  one 
representation  we  approximate  each  curve  with  a  set  of 
line  fitting  tolerances  to  get  a  robust  representation.  For 
every  approximation,  we  then  collapse  vertices,  so  that 
vertices  which  are  close  together  (typically  three  to  five 
pixels)  are  collapsed  into  one  vertex. 

3.2  Super  Segments 

Idea  Since  we  want  to  handle  occlusion,  we  do  not  ex¬ 
pect  to  obtain  complete  boundaries  in  our  images,  but 
only  portions  of  them.  On  the  other  hand,  individual 
segments  are  too  local  to  be  useful  as  matching  prim¬ 
itives.  Grouping  a  fixed  number  of  adjacent  segments 
provides  us  with  one  of  our  basic  features,  the  super  seg¬ 
ments. 

Implementation  The  computation  of  super  segments 
is  the  same  as  described  in  [21].  Connected  linear  seg¬ 
ments  form  chains  of  adjacent  .segments.  The  segment 
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chains  provide  the  super  segments  by  grouping  a  fixed 
number  of  adjacent  segments.  (Because  of  the  branch¬ 
ing  of  the  polygonal  approximations  we  generate  super 
segments  exhaustively  by  generating  them  from  all  possi¬ 
ble  segment  combinations.)  We  typically  generate  super 
segments  of  cardinality  three  to  six. 

3.3  Parallels 

Idea  A  parallel  consists  of  two  linear  segments  sj 
and  32-  We  require  that  the  two  segments  over¬ 
lap,  which  means  that  the  normal  projection  of  si 
on  32,  or  the  normal  projection  of  S2  on  si  is  not 
empty.  Furthermore  we  do  not  want  the  aspect  ra¬ 
tio  of  the  parallel  to  reflect  elongation,  with  ar(p)  = 
(length(si)  +  length(s2))/distance(si, st).  Typically  we 
require  ar(p)  >  0.5. 

Implementation  The  acquisition  of  the  parallels  is 
performed  in  two  steps  by  using  proximity  indexing  (for 
details,  see  Section  4). 

1.  All  segments  are  recorded  using  the  quantized  ori¬ 
entation  as  the  key.  We  use  6  as  quantization  (typ¬ 
ically  6  =  20®).  Using  proximity  indexing  we  are 
guaranteed  to  find  parallels  which  are  at  most  8/2 
apart  and  we  get  some  parallels  with  an  enclosed 
angle  between  6/2  and  6. 

2.  For  every  linear  segment,  the  possible  candidate 
parallels  are  retrieved  and  verified  with  respect  to 
aspect  ratio  arid  overlap.  Segment  pairs  which  meet 
theses  constraints  generate  parallels. 

3.4  Symmetries 

A  symmetry  is  defined  as  a  one-to-one  mapping  between 
the  points  of  two  curves,  with  the  symmetry  axis  de¬ 
fined  as  the  locus  of  the  mid-point  of  the  straight  lines 
joining  a  point  on  one  curve  to  its  image  in  the  other 
[15].  Ulupinar  and  Nevatia  have  proposed  two  specific 
symmetries,  namely  skewed  and  parallel  symmetries  [23]. 
Whereas  they  use  them  to  infer  surface  orientation,  we 
consider  them  as  a  general  feature  of  an  object. 

3.4.1  Parallel  Symmetries 

Idea  Given  two  curves  A'i(s)  =  (xi(s),y,(s)),  for 
i  =  1,2,  parametrized  by  arc  length  s,  and  0i(s)  = 
arctan((dyi(s)/ds)/(dn(5)/ds)).  A’i(s)  and  A'jis)  are 
said  to  be  parallel  symmetric  [23]  if  there  exists  a 
point-wise  correspondence  /(s)  between  them  such  that 
0i(s)  =  02(/(s))  for  all  values  of  s  for  which  Aj  and  X2 
are  defined,  and  /(s)  is  a  continuous  monotonic  function. 
We  only  consider  the  special  case  where  /(s)  is  a  linear 
function. 

Implementation  Parallel  symmetries  are  retrieved  by 
finding  proximate  parallels.  We  do  not  use  the  super 
segment  approach,  because  we  would  depend  on  the  car¬ 
dinality  of  the  super  segments.  By  using  the  parallels 
as  the  building  blocks,  we  can  use  proximity  indexing 
to  find  parallel  which  share  the  same  vertices.  The  ac¬ 
quisition  of  the  parallel  symmetries  is  performed  in  two 
steps. 

1.  We  record  every  parallel  twice.  One  time  with  the 
sorted  list  of  the  vertex  coordinates  of  one  side  as 
key,  the  other  time  with  the  sorted  list  of  the  vertex 
coordinates  of  the  other  side  as  key. 

2.  For  every  parallel,  the  possible  neighbor  parallels 
are  retrieved.  The  groups  of  adjacent  parallels  gen¬ 
erate  parallel  symmetries. 


One  example  is  shown  in  Figure  2. 


Figure  2:  Example  of  a  Parallel  Symmetry 
3.4.2  Skewed  Symmetries 

Idea  In  a  skewed  symmetry,  the  point-wise  correspon¬ 
dence  is  such  that  the  axis  of  the  symmetry  is  straight, 
and  the  lines  of  symmetry  are  at  a  constant  angle  (not 
necessarily  orthogonal)  to  the  axis  of  symmetry.  Skew 
symmetry  was  first  proposed  by  Kanade  [11]  and  used  in 
the  analysis  of  scenes  of  polyhedral  objects.  An  example 


(a)  (b) 

Figure  3:  Examples  to  (a)  skew  symmetry  with  curved 
contours,  (b)  and  skew  symmetry  with  straight  contours. 
The  bold  curves  are  axis  of  symmetry  and  the  dotted 
lines  are  lines  of  symmetry. 

is  given  in  Figure  3(a). 

The  detection  of  skew  symmetry  for  curves  was  done 
by  Ponce  [16]  and  Saint-Marc  and  Medioni  [19],  but  these 
methods  are  quite  sensitive  to  noise. 

In  our  system,  we  are  interested  in  symmetries  be¬ 
tween  line  segments.  Considering  all  possible  symmetries 
between  line  segments  eis  proposed  by  [8]  is  expensive. 

Our  approach  is  based  on  finding  skew  symmetries  be¬ 
tween  super  segments.  An  example  of  skew  symmetry 
for  straight  line  segments  is  given  in  Figure  3(b).  As  we 
show  in  the  Appendix,  we  have  A  =  |o'  — /?'  |  <  20,  where 
0  is  the  skew  angle.  This  allows  us  to  define  a  local  signa¬ 
ture  for  a  super  segment,  namely  the  angles,  with  which 
we  can  find  possime  symmetry  candidates  by  indexing. 
This  avoids  the  expensive  comparison  of  all  pairwise  su¬ 
per  segments.  To  test  whether  two  super  segments  are 
skew  symmetric,  we  have  to  test  the  following; 

1.  The  difference  between  the  corresponding  angles 
must  be  smaller  than  2(?max- 

2.  The  symmetry  axis  has  to  be  straight. 

In  using  proximity  indexing  we  get  an  efficient  algorithm. 
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Implementation  Skewed  symmetries  can  be  retrieved 
by  rinding  ^irs  of  super  segments  with  angles  of  oppo¬ 
site  sign.  The  detection  of  the  skewed  symmetries  is 
performed  in  three  steps: 

1.  We  record  every  super  segment  using  the  list  of  the 
curvature  angles  as  a  key.  To  allow  the  guaranteed 
detection  of  symmetries  with  up  to  skew  angle  0max. 
we  choose  as  the  quantization  interval  for  the  super 
segment  angles  20max  • 

2.  For  every  super  segment,  we  use  the  list  of  opposite 
sign  curvature  an^es  to  retrieve  possible  symmetry 
candidates. 

3.  Super  segment  pairs  which  generate  symmetry  hy¬ 
potheses  nave  to  be  checked  to  see  whether  the  cor- 
respondirig  symmetry  axis  is  straight.  We  test  this 
by  requiring  the  scatter  matrix  of  tne  middle  points 
of  all  corresponding  vertices  to  have  a  high  eccen¬ 
tricity  (they  lie  all  along  a  line). 

VVe  show  an  example  for  a  set  of  detected  symmetries  in 
Figure  4. 


Figure  4;  Detected  Symmetries  and  the  Corresponding 
Axes  in  Mozart  Bust 


3.5  Closures 

Lowe  [13]  makes  the  following  statement:  There  ts  a 
tendency  for  curves  to  be  completed  so  that  they  form 
enclosed  regions.  Based  on  this  statement,  Mohan  and 
Nevatia  [15]  developed  the  idea  to  close  symmetries  at 
their  ends  to  obtain  so  called  ribbons,  which  form  en¬ 
closed  regions.  They  use  these  ribbons  to  segment  im¬ 
ages. 

We  want  to  use  closures  as  features,  which  do  not  nec¬ 
essarily  have  any  physical  interpretation  in  the  image. 
At  the  moment  we  compute  closures  from  U-Shapes, 
from  closed  curves  and  from  skewed  symmetries. 

3.5.1  Closure  from  U-Shape 

Idea  A  parallel  which  is  closed  at  one  side  by  a  linear 
segment  is  a  strong  indication  that  a  rectangular  struc¬ 
ture  is  at  hand  where  one  side  could  not  be  detected. 
We  therefore  assume  that  we  found  a  closed  contour. 

I»:  plementation  U-Shapes  can  be  found  by  indexing 
over  the  vertex  pairs  of  parallels  and  trying  to  find  a 
segment  which  forms  a  U-Shape  with  the  parallel. 

1.  We  record  every  parallel  twice  using  once  the  quan¬ 
tized  vertices  of  one  “side”  of  the  parallel  and  then 
other  vertices  as  keys,  where  we  record  the  parallel. 

2.  For  every  linear  segment  we  use  the  list  of  the  end¬ 
points  to  retrieve  possible  U-Shape  candidates. 

3.  If  the  angle  between  the  parallel  and  the  segment  is 
90“  ±  30“  we  generate  a  new  U-Shape. 

3.5.2  Closure  from  Curve 

The  obvious  form  of  a  closure  occurs  if  we  have  a  closed 
curve.  To  detect  a  closure  based  on  a  curve  we  allow  the 
gap  between  start  and  end  of  the  curve  to  be  5%  of  the 
arc  length  of  the  curve. 

3.5.3  Closure  from  Skewed  Symmetry 

Idea  We  adopt  the  idea  that  a  segmentation  into  parts 
should  be  done  at  negative  minima  of  curvature  from 
Rom  and  Medioni  [18].  Such  “a  part”  is  used  in  our 
system  as  a  closure. 

Implementation  For  every  skewed  symmetry  we  tra¬ 
verse  the  angles  of  one  of  its  super  segments  (the 
other  super  segment  just  ha-s  the  skewed  mirror  angles). 
Whenever  we  encounter  a  sign  change  of  consecutive  an¬ 
gles,  we  “break”  the  symmetry  at  tnis  point  and  define 
the  symmetry  up  to  this  vertex  (together  with  the  corre¬ 
sponding  vertex  of  the  other  super  segment)  as  one  part. 
Applying  this  step  iteratively,  we  generate  alternating 
convex  and  concave  parts.  We  use  the  convex  parts  to 
create  closures.  An  example  can  be  seen  in  Figure  5. 

4  Proximity  Indexing 

We  encounter  two  questions  in  dealing  with  the  genera¬ 
tion  of  perceptual  groups: 

1.  How  do  we  cluster  initial  features  based  on  their 
characteristics  and  geometric  relationship  in  an  effi¬ 
cient  way,  in  order  to  get  a  combination  which  rep¬ 
resent  a  higher  level  features? 

2.  How  do  we  merge  perceptual  features  to  avoid  mul¬ 
tiple  features  with  the  same  perceptual  content? 

Both  question  point  towards  a  a  difference  measure. 
Given  a  feature,  we  define  its  characteristics  as  a  vector  of 
attribute  values  (e.g.  eccentricity,  orientation,  location). 
The  difference  between  two  features  is  then  defined  as 
the  pairwise  difference  between  the  attribute  values.  If 
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Figure  5:  Detected  Closures  in  Mozart  Bust;  (a)  shows 
the  original  image,  (b)  and  (c)  show  five  different  clo¬ 
sures  from  skewed  symmetries,  and  (d)  shows  a  closure 
from  U-shapes,  which  is  a  strong  hint  of  a  rectangular 
structure.  We  only  display  the  convex  hulls. 


all  the  differences  are  small,  then  the  two  features  are 
similar. 

Given  a  set  of  features,  how  do  we  find  the  ones  with 
sirriilar  attribute  values?  Traditional  search  methods 
which  compare  every  possible  pair  are  very  time  con¬ 
suming.  Therefore  recent  vision  systems  [12,  21,  22]  have 
used  indexing  to  find  corresponding  features  with  similar 
characteristics.  The  big  issue  in  indexing  is  the  crucial 
parameter  for  the  quantization  size  and  the  question: 
“What  happens  to  values  which  are  close,  but  fall  in  dif¬ 
ferent  quantization  intervals?”  Two  features  match  only 
in  the  case  when  both  keys  are  exactly  the  same.  This 
means  that  all  the  pairwise  values  have  to  fall  into  the 
same  quantization  intervals.  Suppose  the  range  is  quan¬ 
tized  into  intervals  of  the  the  same  size  q.  Ea^  value,  v, 
is  then  assigned  a  key  based  on  which  interval  u  falls  into, 
and  V  is  corrupted  by  a  random  additive  term  bounded 
by  e;  then  the  probability  that  a  corrupted  vector  is  jis- 
signed  the  same  entry  as  the  original  one  is  [22] 

p"(fc|v)  =  (l  -(£/,)". 


Looking  at  this  eq^uation,  it  is  obvious  that  the  proba¬ 
bility  of  matching  long  keys  decreases  rapidly. 

As  Flynn  and  Jain  [5]  point  out,  it  is  essential  to  have 
an  indexing  scheme  that  preserves  proximity  in  the  key 
values.  So  far,  two  strategies  based  on  indexing  have 
been  used  to  deal  with  this  problem:  large  bucket  size 
and  searching  of  neighboring  bins.  While  large  bucket 
size  is  based  on  the  hope  that  “less  values  will  fall  into 
the  incorrect  bin”,  the  search  of  neighboring  bins  has  an 
exponential  complexity  with  respect  to  the  number  of 
false  value  matches. 

We  propose  an  alternative  approach;  Instead  of  using 
all  features  with  all  quantizations  as  a  large  alphabet, 
we  break  the  features  apart  and  use  indexing  on  every 
value  separately. 


Encoding:  To  assure  proximity  we  store  the  feature 
for  every  value  v  twice; 

1.  under  the  key  [[vj,,  [t;],] 


2.  under  the  key 

with 


ifv-  [vj,  >  [e], 
otherwise, 


V 


II 

-  l*'J  »  +  2 

=  M?-2 

Hr 

=  H,  +  ! 

Hr  =  h«-2- 

Retrieval: 

The  retrieval  of 

a  feature  /  is  broken  up 

into  the  retrieval  of  every  value.  Every  value  is  quan¬ 
tized  twice  (see  above)  and  the  stored  features  for  both 
intervals  are  retrieved  and  combined  into  one  set.  For 
all  values  we  get  such  a  feature  set.  The  intersection 
of  all  these  sets  results  in  the  features  which  are  close 
to  /  in  all  their  values.  Due  to  the  interlaced  quan¬ 
tization  we  can  guarantee  to  retrieve  all  features  with 
l''stored  ~ ^  2>  feature  matches 

with  §  <  l^stored  “  ^/l  <  9-  ^he  intersection  process 
can  be  sped  up  by  intersecting  the  sets  in  increasing 
cardinality. 


Complexity:  What  do  we  pay  for  using  proximity  in¬ 
dexing  compared  to  traditional  indexing:  Let  n  be  the 
number  of  features  to  store.  We  assume  that  every  fea¬ 
ture  has  as  a  key  an  attribute  vector  of  fixed  length. 
Every  value  is  quantized  in  a  fixed  number  of  intervals. 
Furthermore  we  assume  that  the  features  are  equally  dis¬ 
tributed  over  the  table  which  consists  of  r  records.  The 
cost  of  using  proximity  indexing  is  summarized  in  the 
following  table: 


search 

indexing 

proximity  indexing 

preserves 

proximity 

yes 

no 

yes 

encoding 

complexity 

0 

0(n) 

0(n)* 

0(nlog(f))“ 

retrieval 

complexity 

OH) 

0(n) 

i 

entries  are  unsorted  ;  entries  are  sorted 


All  tables  mentioned  in  this  Section  are  implemented 
as  hc^h  tables. 


5  Recognition 

So  far  we  described  the  clustering  and  grouping  of  very 
basic  features  to  higher  level  groupings.  But  what  are 
the  advantages  of  going  higher  in  the  feature  hierarchy? 
As  we  mentioned  above,  we  want  to  represent  a  .3-D 
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model  witli  multiple  2-D  views.  But  how  many  views 
do  we  need?  Making  Quantitative  statements  about  the 
number  of  views  which  we  require  for  an  object  is  not 
possible.  A  qualitative  statement  can  be  illustrated  with 
an  example.  In  Figure  6(a)  we  show  a  cluttered  scene,  in 


Figure  6:  Example  Phone 

which  we  want  to  detect  the  phone.  Figure  6  shows  two 
different  views  of  the  phone.  In  Figure  6(b)  the  view  is 
very  similar  to  the  view  in  the  scene.  Figure  6(c)  shows  a 
side  view  of  the  phone.  This  corresponds  to  a  rotation  of 
approximately  70®  from  the  view  in  the  scene.  By  using 
different  features  for  solving  tlie  correspondence  problem 
between  the  scene  and  the  phone  we  can  show  the  per¬ 
formance  of  different  features  in  the  feature  hierarchy. 
In  the  following,  we  show  two  approaches; 

1 .  We  discuss  an  attempt  to  match  two  views  with 
primitives  based  on  local  groupings. 

2.  We  propose  an  object  recognition  system  which  is 
based  on  high  level  perceptual  groupings  (see  Fig¬ 
ure  1). 

We  analyze  both  approaches  and  point  out  the  advan- 
ta^s  of  our  proposed  feature  hierarchy. 

Due  to  the  fact  that  different  models  and  different 
views  mean  the  same  for  our  object  recognition  .system 
we  will  u.se  both  terms  interchangeably  in  the  following 
sections. 

5.1  Correspondence  based  on  Super  Segments 

The  first  example  of  finding  the  correspondence  with 
super  segments  is  based  on  an  object  recognition  sys¬ 
tem  which  was  designed  for  the  recognition  of  multiple 
flat  objects  [21].  Super  segments  are  computed  from  the 
linked  image  edges  and  provide  the  e.s.sential  mechanism 
for  indexing  anof  retrieval.  We  use  the  Quantized  angles 
and  the  eccentricity  as  a  key  and  record  the  super  .seg¬ 
ment  in  a  table.  The  recognition  proceeds  by  comjuiting 
the  super  segments  of  the  scene.  Each  scene  suijor  .seg¬ 
ment  retrieves  model  hypotheses  from  the  hasli  table. 
Ilypcilhcses  are  clustereci  if  they  are  mutually  consi.s- 
tent  in  their  geometric  relationship,  and  repre.sent  the 


instance  of  a  model.  This  methodology  allows  us  to  rec¬ 
ognize  2-1)  models  from  a  2-D  scene  in  the  presence  of 
noi.se,  occlusion,  scale,  rotation,  translation  and  weak 
perspective. 

Consistency  between  matching  hy|)otheses  reejuire  sta¬ 
ble  geometric  relationships.  It  is  obvious  that  these  ge¬ 
ometric  relations  between  parts  of  the  image  arc  only 
valid  in  similar  views.  The  projection  of  the  view 
onto  the  scene  is  shown  in  Figure  7. 


Figure  7:  Recognition  based  on  Super  Segments 


Analysis  ;  The  matching  presented  here  succeeds  only 
because  the  2-D  views  of  the  models  are  quite  similar.  In 
fact,  the  method  fails  to  identify  view  #2  in  the  image. 
We  therefore  need  a  large  number  of  views  (a  few  hun¬ 
dred  for  a  full  model!)  for  each  object.  This  would  be 
a  very  cumbersome  and  expensive  (space  and  time)  ap¬ 
proach!  Furthermore,  we  encounter  a  rapid  growth  in  the 
number  of  hypothe.ses  which  have  to  be  verified,  when  we 
increase  the  number  of  models  in  the  data  ba.se.  Due  to 
the  fact  that  the  verification  has  a  time  complexity  of 
(with  n  the  number  of  generated  hypotheses),  the 
recognition  process  would  become  unaccejUaldy  slow. 

5.2  Correspondence  based  on  High  Level 
Features 

Here  instead  we  use  high  level  features  to  find  the  corre¬ 
spondence.  Instead  of  using  geometric  relationships  be¬ 
tween  features  we  exploit  iopotogic  relationships,  such  as 
adjacency,  proximity,  intersection,  etc.  These  relations 
between  features  are  more  stable  with  respect  to  the  3-D 
viewpoint  than  properties  based  on  2-D  geometry. 

The  high  level  features  are  generated  by  using  features 
from  the  perceptual  grouping  stage  (see  Figure  1).  Each 
such  feature  is  approximated  by  its  convex  hull.  Those  so 
called  convex  approximations  (CA)  are  grouped  ba.sed  on 
ch  sure  and  proximity;  two  features  are  defined  as  neigh¬ 
bors  if  they  share  a  common  line  segment  (or  common 
vertex).  To  find  “high  level  closures”  we  construct  an 
adjacency  graph.  In  this  adjacency  graph  we  look  for 
cycles  which  rejiresent  a  closure  between  CAs. 

The.se  groupings  can  be  encoded  ba.sed  on  attributes 
such  a.s  the  connectivity  of  the  underlying  (.’As,  the  two- 
dimensional  topologic  relationships  between  them,  and 
the  number  of  corners  of  the  underlying  CAs.  Using  the 
code  as  a  key,  we  record  the  grouping  in  a  table.  A  scene 
is  processed  in  the  same  way,  and  bv  using  the  resulting 
high  level  scene  groupings  we  arc  able  to  retrieve  corre¬ 
sponding  groupings  from  the  table. 

In  the  example  in  Figure  8  we  show  such  a  groui'ing 
which  consists  of  four  CAs,  namely  a  cycle  of  C' As  (la¬ 
beled  0,1,3)  and  another  ('A  (label  2)  which  is  adjacent 
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(b)  Scene 


Figure  8;  Recognition  based  on  High  Level  Groupings 


to  CA  0  and  CA  1.  By  allowing  a  certain  tolerance  in 
the  number  of  corners,  we  can  match  these  two  group¬ 
ings,  despite  the  fact  that  CA  2  has  four  corners  in  the 
model  and  five  corners  in  the  scene.  The  recognition  of 
the  view  #2  in  Figure  6(c)  in  the  scene  is  shown  in  Fig¬ 
ure  8(b).  Instead  of  the  projection  of  the  phone  itself 
(it  would  look  too  warped)  we  show  the  corresponding 
groupings. 

By  using  groupings  which  include  no  geometric  rela¬ 
tionships  in  the  matching  process,  we  can  find  corre¬ 
spondences  between  very  different  views.  This  implies 
that,  for  the  purpose  of  object  recognition,  we  need 
niuch  fewer  two-dimensional  views  to  represent  a  three- 
dimensional  model  than  in  the  approach  discussed  in 
Section  5.1.  Furthermore,  we  claim  that  such  high  level 
features  are  highly  discriminative.  To  firmly  establish 
that  is  part  of  the  ongoing  research. 

6  Conclusions 

In  this  paper,  we  have  developed  an  approach  to  use 
perceptual  organization  for  the  purpose  of  object  rccog- 
nitioi),  and  show  some  promising  results.  The  major  dis¬ 
tinguishing  aspects  of  our  system  compared  to  previous 
approaches  are; 

•  For  our  object  description  we  use  multiple  represen¬ 
tations. 

•  We  incroc  intermediate  results  to  increase  the  over¬ 
all  reliability  of  feature  detection. 

•  We  expand  and  contract  the  amount  of  features. 


•  No  Early  Commitment.  Our  perceptual  grouping  is 
purely  data  driven.  We  do  not  try  to  resolve  any 
ambiguities  and  the  groupings  do  not  necessarily 
lead  to  a  single  physical  interpretation.  By  using 
the  feature  groups  for  recognition  purpo.ses  we  make 
effective  use  of  the  underlying  organization. 

•  Our  system  deals  with  topo/o^ica/ relations,  not  with 
spatial  correspondences. 

•  By  using  a  set  of  different  views  to  represent  a  model 
we  can  deal  with  incomplete  model  descriptions. 

Our  future  work  aims  at  answering  the  following  ques¬ 
tions; 

•  What  happens  when  the  system  does  not  find  cor¬ 
responding  high  level  groupings  (e.g.  due  to  heavy 
occlusion)?  We  want  to  focus  on  this  point  by  devel¬ 
oping  a  multilevel  matching,  which  allows  the  sys¬ 
tem  to  “fall  back”  on  lower  level  features  in  order 
to  find  correspondences. 

•  We  want  to  extend  the  feature  hierarchy  by  includ¬ 
ing  perceptual  organization  between  high  level  fea¬ 
tures.  So  far  we  use  only  proximity  and  closure  to 
generate  high  level  groupings.  We  are  investigat¬ 
ing  the  possibility  of  including  symmetry  and  par¬ 
allelism. 

•  There  are  several  other  features  and  grouping 
strategies  which  we  ignore  so  far;  continuation,  tex¬ 
ture,  saliency  ...  Including  these  features  would  en¬ 
rich  the  descriptive  and  discriminative  power  of  our 
feature  hierarchy. 

Our  preliminary  work  and  the  corresponding  results  in 
this  paper  have  demonstrated  the  viability  of  this  ap¬ 
proach. 


x . : . X 
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Appendix 

We  claim  that 

A  =  |o’ -,d'|  <  21?.  (1) 


•  Proximity  Indexing  and  the  use  of  li.,ear  segments 
leads  to  an  efficient  implementation. 


For  the  sake  of  simplicity  let  o'  =  «  and  =  ,d,  then 
( 1 )  becomes 


•  We  implemented  a  first  stage  of  using  high  level 
grouping. 


A  =  [o  —  l)\  =  |o  I  -t-  O',  —  —  ,i,\ 

=  |{o,-/^,)  +  (o-. -/f',)|<21? 
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To  prove  (2),  we  show  that 

-20<(ai-/3i)<O  (3) 

and 

0  <  (02  -  /?2)  <  2fl.  (4) 

First  let  us  focus  on  equation  (3).  We  know  from  Fig¬ 
ure  9(b)  that 

ai  -I-  7r/2  +  0  +  yi  =  TT  (5) 

and 

A  -I-  5r/2  -  0  4-  72  =  JT.  (6) 

Subtracting  (5)  from  (6)  results  in 

(A  -  ai)  + (72  -  7i)  =  2^.  (7) 

Furthermore  we  know  that 

72  +  7i  <  (8) 

By  showing  that  72  >  71  we  can  show  that  (3)  is  true. 
The  proof  of  72  >  71  is  divided  into  two  cases: 

1  •  72  >  ’r/2:  Considering  equation  (8)  we  get  72  >  7i- 
2.  72  <  t/2:  Based  on  projective  geometry  we  get 
72  >  7i  O  m2  >  mi 

with  m2  =  mi  4-  c.  If  we  can  show  that  c  >  0,  we 
show  that  72  >  71-  We  know  that 

c  =  btanfi.  (9) 

Furthermore  we  get  from  the  projective  geometry  a 
relation  between  a  and  6: 

a/i  =  6/2i  <:»  6  =  2a.  (10) 

Since  a  =  isintf  we  get  from  equation  (9)  and  (10) 
c  =  2a;  sin  0  tan  and  therefore  c  >  0  for  0  <  5  < 
ir/2. 

Therefore  equation  (3)  is  true.  Equation  (4)  can  be 
shown  in  the  same  way. 

Q.E.D. 
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Abstract 

This  paper  addresses  the  problem  of  effectively  matching  a 
single  2D  image  of  a  potentially  cluttered  scene  to  a  library 
containing  multiple  polyhedral  objects.  In  our  approach  to 
recognition,  the  search  for  matches  to  3D  objects  from  a  mul¬ 
tiple  object  library  is  optimised  by  generating  descriptions  of 
the  projections  of  the  objects  from  expected  views,  organis¬ 
ing  all  the  descriptions  into  a  single  network  representation, 
and  then,  during  the  recognition  phase,  finding  matches  be¬ 
tween  the  resulting  viete  dcMcription  network  and  the  input 
image.  Our  design  for  the  recognition  phase  process  is  pre¬ 
sented  and  demonstrated  on  images  contmning  multiple  ob¬ 
jects  and  outdoor  scenes.  The  efRciency  of  the  system  and 
the  general  approach  are  discussed. 

1  Introduction 

Oat  research  objective  is  a  system  capable  of  recognising 
modelled  polyhedta  from  2D  images  of  cluttered  scenes. 
The  system  is  presented  with  a  library  containing  multi¬ 
ple  objects  and  an  image  of  a  scene  that  may  contain 
any  combination  of  the  objects  in  arbitrary  positions 
with  respect  to  the  viewer.  The  goal  of  the  system  is 
to  find  the  correct  3D  matches  to  all  objects  in  the  scene 
for  which  there  is  sufficient  evidence  in  the  image,  where 
a  3D  match  is  an  assignment  of  model  features  to  im¬ 
age  features  and  the  estimated  coordinate  transforma¬ 
tion  between  model  and  camera  (pose)  that  best  aligns 
the  assigned  features.  The  recognition  system  must  be 
designed  to  find  the  correct  3D  matches  in  an  efficient 
manner.  A  3D  match  is  completed  and  verified  by  the 
potentially  costly  process  of  searching  for  image  line  seg¬ 
ments  that  comprise  sufficient  evidence  for  the  match.  In 
addition,  the  number  of  possible  3D  matches  can  be  very 
large;  for  the  images  and  objects  studied  in  this  paper, 
there  are  tens  of  billions.  Therefore,  it  is  essential  for  the 
efficiency  of  the  system  that  the  number  of  3D  matches 
selected  for  completion  be  minimised. 

In  our  approach,  the  search  for  matches  to  3D  objects 
from  a  multiple  object  library  is  optimised  by  gener¬ 
ating  descriptions  of  the  projections  of  the  objects  from 
expected  views,  organising  all  the  descriptions  into  a  sin¬ 
gle  network  representation,  and  finding  matches  between 

*This  research  was  supported  by  the  Defense  Advanced 
Research  Projects  Agency  (via  TACOM)  under  contract 
DAAE07-91-C-R035. 


Figure  1:  Objects  used  to  demonstrate  the  system. 


the  resulting  view  description  network  and  the  input  im¬ 
age.  The  contributions  of  our  research  have  been  in  the 
automatic  compilation  of  network  descriptions  of  object 
views  [4,  6]  and  the  analysis  of  the  usefulness  of  2D  fea¬ 
tures  for  3D  object  discrimination  under  view  variation 
[5,  6].  In  the  latter,  it  was  established  that  there  does  not 
exist  a  feature  (function  of  the  projection)  that  is  view- 
invariant  for  arbitrary  point  sets.  Since  view  invariants 
are  not  guaranteed  for  any  given  library  of  3D  objects,  it 
is  imperative  that  systems  be  developed  that  are  capa¬ 
ble  of  utilising  view- varying  features  as  well  [2,  6].  This 
paper  discusses  effective  strategies  for  matching  view  de¬ 
scription  networks  to  images  and  farther  motivates  the 
overall  approach. 

2  The  description  network  approach  to 
object  recognition 

Objects  in  a  library,  such  as  in  Fig.  1,  will  typically  have 
a  variety  of  similarities  and  differences  which  must  be 
considered  while  optimising  the  selection  of  the  correct 
matches.  In  general,  objects  can  be  differentiated  us¬ 
ing  combinations  of  two  types  of  features:  metric  and 
structural.  A  metric  feature  is  any  measurement  such 
as  length  and  angle  that  can  be  defined  for  a  set  of  pro¬ 
jected  elements  in  the  image,  which  in  our  task  are  2D 
line  segments.  Clearly,  this  type  of  feature  can  be  use¬ 
ful  for  discrimination;  for  example,  the  two  triangular 
prisms  that  have  different  proportions.  This  differenti¬ 
ating  property  can  be  measured  in  terms  of  the  length 
ratio  of  the  bold  line  segments  labelled  2  and  5,  which  is 
capable  of  distinguishing  the  two  prisms  from  almost  all 
of  their  views  [5,  6]. 

Structural  features  represent  how  elements  that  make 
up  an  object,  such  as  straight  line  segments,  are  con¬ 
nected  or  otherwise  organised.  For  example,  in  Fig.  1, 
some  objects  have  line  segments  assembled  into  triangles 
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and  others  into  pentagons  or  parallelograms,  and  these 
assemblages  themselves  are  connected  into  larger  struc¬ 
tures.  Clearly,  the  identification  or  detection  of  these 
structural  features  can  help  in  the  discrimination  of  the 
objects.  The  identification  of  structural  features  in  the 
image  is  important  in  another  way:  matched  structural 
descriptions  provide  a  context  in  which  metric  features 
can  be  measured  and  meaningfully  used  for  discrimina¬ 
tion.  For  example,  the  length  ratios  considered  useful  in 
Fig.  1  are  meaningless  if  we  do  not  know  which  pair  of 
image  segments  are  to  be  measured.  There  may  be  an 
enormous  number  of  ordered  pairs  of  line  segments  in  a 
cluttered  image,  and  each  pair  produces  its  own  length 
ratio.  The  image  segment  pair  whose  ratio  provides  the 
discriminating  length-to-width  proportion  must  be  iden¬ 
tified,  and  this  can  be  done  by  first  matching  a  structural 
description. 

The  importance  of  structural  features  for  discrimina¬ 
tion  must  be  stressed,  for  their  use  has  a  fundamental 
effect  on  the  design  of  a  multi-object  recognition  system. 
Since  identifying  structural  features  in  the  image  means 
searching  for  matches  to  non-trivial  structural  descrip¬ 
tions,  an  important  part  of  the  object  discrimination 
process  is  an  optimised  search  for  these  matches.  Or¬ 
ganising  the  structural  descriptions  is  an  important  step 
towards  optimising  this  search,  an  approach  stressed  in 
this  study  and  related  work  [l,  7]. 

In  our  design,  object  information  is  organised  into 
description  networks  where  parts  or  geometric  aspects 
shared  by  objects  are  explicitly  represented  as  nodes  in 
a  network,  with  direct  or  indirect  links  to  all  the  objects 
characterised  by  them.  Objects  are  then  recognised  via 
the  network  by  a  process  referred  to  here  as  recursive 
indexing.  In  this  strategy,  indexing  of  object  informa¬ 
tion  takes  place  in  stages;  each  indexing  step  identifies 
important  substructures  (parts)  in  the  image  which  are 
in  turn  used  as  structural  features  to  index  more  com¬ 
plex  descriptions,  until  descriptions  to  specific  objects 
are  indexed  and  successfully  matched. 

This  approach  to  recognition  is  in  contrast  to  two 
other  important  recognition  strategies  that  have  recently 
seen  increased  development:  the  single-level,  geomet¬ 
ric  hashing  methods  [13,  14],  and  the  use  of  interpre¬ 
tation  trees  [9,  10].  Recursive  indexing  could  be  a  sig¬ 
nificant  improvement  over  the  unstructured,  or  single- 
level,  hashing  methods.  In  single-level  systems,  simple 
image  feature  combinations  are  used  to  directly  index 
the  objects  in  the  library.  As  indicated  by  the  experi¬ 
mental  work  reported  in  [14],  the  object-specificity  of  the 
features  greatly  affects  the  number  of  objects  retrieved, 
and  the  simple  combinations  of  features  used  in  their  ex¬ 
periments  were  not  sufficient  to  avoid  a  saturation  effect 
(where  the  number  of  objects  retrieved  grows  at  least  lin¬ 
early  with  the  size  of  the  object  library).  The  specificity 
problem  may  become  more  manageable  using  a  recur¬ 
sive,  multilevel  procedure,  primarily  because  combina¬ 
tions  of  structural  parts  identified  in  earlier  steps  could 
provide  the  system  with  a  rich  set  of  composite  features 
for  indexing  at  the  next  step. 

It  is  also  important  to  distinguish  the  recursive  index- 


Figure  2:  View  description  network  for  objects  in  Fig.  1. 
(a)  Each  2D  model  node  in  the  network  b  represented  in 
the  figure  by  a  line-segment  example  that  satisfies  the  2D 
model.  Dashed  lines  connecting  line  segments  indicate  met¬ 
ric  features  that  are  functions  of  the  pair  of  specified  line 
segments,  (b)  The  feature  probability  densities  ^ven  the 
different  2D  models  discriminated  by  them.  Each  denuty 
function  is  stored  in  the  appropriate  2D  model  node  and  is 


inherited  by  all  of  its  successors  in  the  network. 


ing  of  descriptions  in  a  network  from  the  use  of  interpre¬ 
tation  trees.  An  interpretation  tree  search  when  clutter 
is  present  in  the  image  may  be  prohibitively  inefficient 
[9].  The  recursive  indexing  strategy  supported  by  net¬ 
works,  however,  has  a  fundamentally  different  behavior 
that  may  mean  much  greater  efficiency.  Instead  of  me¬ 
thodically  searching  a  large  portion  of  a  tree  for  possible 
interpretations,  the  evidence  from  matches  to  multiple 
parts  and  the  convergent  structure  of  the  description  net¬ 
work  are  used  together  to  provide  a  potentially  focused 
search  for  the  correct  interpretation.  The  experiments 
in  cluttered  images  presented  in  this  paper  provide  some 
demonstration  of  this. 

3  View  description  networks 

View  description  is  an  important  approach  to  recognu- 
ing  3D  objects  in  2D  images.  In  this  method,  prior  to 
recognition,  descriptions  of  the  projections  of  each  object 
from  distinct  views  are  generated;  then,  during  recogni- 
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tion,  these  2D  expectations  are  matched  to  the  image.  A 
view  description  may  be  valid  for  the  object’s  projection 
over  a  range  of  views,  as  long  as  there  is  a  description 
associated  with  each  view  and  the  descriptions  of  each 
object  are  distinct  from  those  of  other  objects.  This 
method  is  useful  since  it  does  not  depend  on  reliable  3D 
scene  reconstruction.  It  also  facilitates  the  use  of  image 
information  sufficient  for  recognition  [2,  5]  and  the  ef¬ 
ficient  matching  of  multiple  3D  objects  to  cluttered  2D 
images  through  the  organization  of  the  view  descriptions 
into  networks.  Similar  approaches  can  be  found  in  [7,  8], 
and  also  in  [10],  though  the  system  of  Ikeuchi  is  not  for 
single  intensity  images. 

In  our  system,  a  view  description  is  a  relational  graph 
that  represents  the  discriminating  features  for  some  ob¬ 
ject  and  range  of  view.  Each  distinct  element  in  the 
graph  represents  a  line  segment,  and  the  relations  (arcs) 
are  features  defined  over  the  associated  line  segments. 
The  relation  stores  the  feature  type,  such  as  line  segment 
length  ratio  s  or  relative  orientation  a,  and  the  proba¬ 
bility  densities  of  the  feature  given  the  associated  pair  of 
object  line  segments  and  a  uniform  sampling  across  the 
range  of  view. 

The  view  descriptions  are  organized  into  a  network, 
where  each  terminal  node  corresponds  to  an  object- 
specific  view  description.  The  view  descriptions  are  re¬ 
cursively  built  up  from  smaller,  simpler  relational  graphs 
associated  with  intermediate  nodes  in  the  network  via 
combination  and  specialization  links.  We  will  refer  to 
the  information  stored  in  any  node  as  a  2D  model.  Fig.  2 
shows  a  network  automatically  constructed  to  recognize 
the  objects  in  Fig.  1.  The  combination  link  specifies  how 
a  set  of  part  descriptions  are  combined  into  more  com¬ 
plex,  object-specific  ones;  currently,  combinations  are 
formed  of  pairs  of  parts.  For  example,  in  Fig.  2(a), 
the  model  2D-PR1SM  is  represented  as  a  combination 
of  2D  models  TRIANGLE  and  PARALLELOGRAM, 
which  are  isomorphic  to  two  of  its  subgraphs.  Special¬ 
ization  links  between  a  2D  model  node  and  its  network 
successors  specify  the  addition  of  new  relations  to  the 
2D  model;  in  other  words,  new  features  (see  Fig.  2a, 
dashed  labelled  arcs)  and  their  probability  density  func¬ 
tions  given  the  object  associated  with  each  successor 
node  (see  Fig.  2b).  For  example,  the  2D-PRISM  model  is 
associated  with  two  objects  [short-prism  and  tail-prism), 
and  each  is  assigned  a  successor  containing  a  new  rela¬ 
tion:  the  feature  s  for  element  pair  (2,  5)  and  its  density 
function  for  the  relevant  object.  (New  relations  can  also 
be  added  during  combination,  see  Fig.  2b.)  Discrim¬ 
ination  of  the  objects  based  on  the  stored  probability 
information  is  discussed  below  in  Section  4.2. 

4  Matching  images  to  view  description 
networks 

Given  a  view  description  network,  the  gocd  of  the  recog¬ 
nition  phase  of  the  system  is  to  identify  objects  in  images 
by  efficiently  searching  for  matches  between  image  line 
segments  and  object  view  descriptions  (2D  model  nodes) 
stored  in  the  network.  It  was  argued  above  that  recur¬ 
sive  indexing  is  an  effective  way  to  recognize  using  a  de- 
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Figure  3:  An  example  of  match  extension,  (a)  Relevant 
portion  of  Fig.  2.  (b)  Matches  to  TRIANGLE  and  PAR¬ 
ALLELOGRAM.  (c)  A  match  to  their  common  successor 
2D-PRISM,  created  by  (d)  composing  each  predecessor  node 
match  with  the  model-to-model  maps  specified  in  the  network 
and  then  merging  the  two  resulting  maps. 
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scription  network.  This  approach  is  best  understood  as 
recursive  2D  match  extension  followed  by  3D  match  com¬ 
pletion  and  verification.  Match  extension  is  the  creation 
of  a  match  to  a  2D  model  (a  2D  match)  from  matches  to 
its  predecessors  in  the  network,  where  a  2D  match  is  the 
assignment  of  the  line  elements  in  the  2D  model  to  line 
segments  detected  in  the  image.  For  example,  given  the 
network  in  Figs.  2(a)  and  3(a),  matches  to  TRIANGLE 
and  PARALLELOGRAM  (  Fig.  3b)  can  be  combined 
into  a  match  to  2D-PR1SM  (  Fig.  3c). 

The  design  of  the  recognition  system  follows  naturally 
from  the  recursive  nature  of  matching  to  view  description 
networks.  The  process  is  initialized  by  detecting  line 
segments  in  the  image  and  generating  promising  matches 
to  the  initial,  simple  2D  model  nodes  in  the  network. 
The  system  then  searches  for  correct  matches  to  the  more 
complex  2D  models,  and  eventually  to  the  3D  modek, 
by  iteratively  executing  the  following  three  steps: 

1.  Extend  oj  verify  the  selected  2D  matches,  depending 
on  the  type  of  match. 

(a)  Matches  to  2D  model  nodes  in  the  terminal  por¬ 
tions  of  the  network  are  associated  with  3D  ob¬ 
jects;  when  selected,  the  system  attempts  to 
verify  them  by  computing  the  3D  match. 

(b)  Otherwise,  the  system  attempts  to  extend  the 
match  to  a  more  complex  2D  model  node 
match. 

2.  Evaluate  and  incorporate  the  resulting  3D  or  2D 
matches  into  the  current  state  of  the  system. 

(a)  All  new  2D  matches  are  added  to  the  pool  of 
matches  that  may  be  extended  or  verified  in 
future  cycles. 
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Figure  4:  An  example  of  extension  to  an  indirect  successor 
in  the  network  of  Fig.  2;  see  text,  (a)  Matches  to  C-SHAPE, 
U-SHAPE  and  2D-HOUSE.  (b)  A  portion  of  the  network, 
with  relevant  nodes  and  links  in  bold. 

(b)  If  a  SD  match  is  verified,  it  is  output  from  the 
system,  and  competing  2D  model  matches  are 
eliminated.  A  match  is  competing  if  it  assigns 
the  same  image  segments. 

3.  Select  the  best  2D  matches  for  extension  and  verifi¬ 
cation  in  the  next  cycle. 

In  the  experiments,  the  process  was  made  to  termi¬ 
nate  on  discovery  of  all  the  correct  3D  matches.  The 
organisation  of  the  system  into  the  above  three  steps 
naturally  structures  the  discussion  into  three  parts:  2D 
match  extension,  evaluation  and  selection.  Verification 
by  3D  match  completion  is  also  clearly  important,  but 
is  outside  the  scope  of  this  paper;  an  implementation  of 
the  3D  pose  algorithm  of  Kumar  [11,  12]  was  used  in  the 
demonstrations. 

4.1  Match  extension 

As  discussed  above,  the  matching  of  descriptions  orga¬ 
nised  into  networks  assumes  the  form  of  recursive  exten¬ 
sion.  Given  a  pair  of  existing  2D  matches,  their  exten¬ 
sion  has  two  steps:  retrieve  2D  models  to  which  they 
can  be  extended,  and  then,  compute  the  line  segment 
assignments  for  the  new,  extended  match  as  in  Fig.  3. 

Complications  can  occur  when  key  parts  of  an  object’s 
projection,  or  important  relationships  between  the  parts, 
are  poorly  represented  in  the  image.  The  network  con¬ 
tains  an  idealised  description  of  the  object  projections, 
with  ideal  parts  and  relations  between  these  parts.  If 
the  actual  representation  of  a  part  in  the  image  is  poor, 
then  a  simple,  step-by-step,  recursive  extension  could  be 
difficult  and  the  extension  process  must  be  made  more 
adaptable.  The  matching  of  a  house  image  in  Fig.  4(a) 
provides  an  example  of  this.  (See  Fig.  7  for  the  digital 
image.)  Given  the  network  in  Figs.  2(a)  and  4(b),  a  2D- 
HOUSE  model  is  made  up  of  two  parts,  PENTAGON 


and  PARALLELOGRAM,  which  are  in  turn  made  up 
of  the  simpler  models,  U-SHAPE,  C-SHAPE  and  COR¬ 
NER.  The  extension  of  the  U-SHAPE  match  in  Fig.  4(a) 
to  a  match  of  its  direct  successor,  PARALLELOGRAM, 
via  combination  with  CORNER  is  not  possible  since  the 
image  frame  excludes  the  part  of  the  projection  associ¬ 
ated  with  the  desired  CORNER  match.  Without  the 
evidence  associated  with  this  CORNER  match,  there  is 
no  way  to  know  whether  the  U-SHAPE  match  should  be 
interpreted  as  part  of  a  PARALLELOGRAM  or  PEN¬ 
TAGON  match,  both  being  successors  of  U-SHAPE  (see 
Fig.  4b).  However,  a  viable,  unambiguous  extension  to 
the  indirect  successor  2D-HOUSE  is  possible  by  combin¬ 
ing  the  U-SHAPE  match  with  the  available  C-SHAPE 
match  shown  in  Fig.  4(a).  The  importance  of  indi¬ 
rect  predecessor  extension  for  matching  can  be  appre¬ 
ciated  in  the  application  of  the  recognition  system  to 
cluttered  and  corrupted  images  as  demonstrated  later  in 
this  paper.  To  support  indirect  predecessor  extension, 
our  system  is  designed  to  index  2D  models  given  pairs 
of  matches  to  their  fragments,  that  is,  their  indirect  pre¬ 
decessors.  During  the  compilation  phase  of  the  system, 
2D  models  and  potentially  useful  pairs  of  their  indirect 
predecessors  arc  hashed  into  an  extension  table  for  rapid 
indexing  during  the  recognition  phase.  In  addition,  to 
facilitate  the  extension  operation,  the  compilation  phase 
process  also  pre-computes  and  stores  the  model  line  seg¬ 
ment  mappings  between  the  2D  models  and  useful  indi¬ 
rect  predecessors. 

4.2  The  evaluation  of  matches  to  2D  model 
nodes 

Once  a  2D  match  has  been  generated,  its  priority  for  ex¬ 
tension  or  3D  verification  needs  to  be  determined  for  ef¬ 
fective  control.  An  important  factor  in  eusessing  this  pri¬ 
ority  is  the  probability  that  the  match  is  correct,  which  is 
estimated  in  two  steps.  First,  all  of  the  metric  and  struc¬ 
tural  features  represented  in  the  2D  model  are  measured; 
then,  the  values  of  the  measured  features  and  the  condi¬ 
tional  density  functions  stored  in  the  models  are  used  to 
estimate  the  posterior  probability  of  the  match. 

The  metric  and  structural  features  used  in  the  match 
evaluatioi  are  measured  with  respect  to  an  approximate 
reconstru'  ion  of  the  object’s  projection  consistent  with 
the  matcl  Fig.  5  shows  2m  example  of  this.  The  struc¬ 
tural  features  in  the  2D  model  specify  how  the  matched 
line  segments  should  be  connected  together.  The  re¬ 
construction  enforces  the  specified  segment  connectivity, 
while  minimising  the  error  between  the  detected  frag¬ 
ments  and  reconstructed  lines  [6].  Since  the  errors  re¬ 
flect  the  degree  of  actual  connectivity  of  the  detected 
line  segments,  they  are  used  as  measurements  (indica¬ 
tions)  of  the  structural  features  for  match  evaluation. 
The  metric  features,  such  as  line  segment  length  ratio  s 
or  relative  orientation  a,  are  measured  with  respect  to 
the  reconstructed  lines. 

A  match  is  an  iusignment  of  2D  model  lines  to  seg¬ 
ments  in  the  image,  and  this  assignment  is  one  possi¬ 
ble  interpretation  of  the  image  segments  among  many. 
These  alternative  interpretations  are  classes,  and  the 
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Figure  5:  An  example  of  line  segment  reconstruction  during 
2D  model  matching.  Bold  lines:  line  segments  detected  in 
an  image  of  a  tall  prism.  Dashed  lines:  reconstruction  of  the 
projection  given  a  2D-PRISM  match. 


problem  of  estimating  the  probability  that  a  given  match 
is  correct  can  be  treated  as  a  problem  in  Bayesian  clas¬ 
sification: 

P(„.  I  /)  = 

where  ui  is  the  class  associated  with  the  2D  model  match 
being  evaluated,  the  >  1  are  alternative  interpre¬ 
tation  classes,  and  /  is  the  vector  of  measured  features. 
This  formulation  is  useful  and  straight-forward;  however, 
it  requires  the  determination  of  the  alternative  interpre¬ 
tation  classes  Uj,j  >  1,  and,  for  all  classes  Wj,j  >  1,  the 
class  priors  P(u>j)  and  the  conditional  density  functions 

P(/ 

For  a  given  match,  there  can  be  a  very  large  number  of 
alternative  interpretations  of  the  same  set  of  image  line 
segments;  however,  these  interpretations  can  be  usefully 
organised  into  a  small  number  of  classes.  For  each  alter¬ 
native  class,  the  actu2d  match  being  evaluated  is  consid¬ 
ered  incorrect,  and  in  our  formulation,  each  alternative 
interpretation  class  is  associated  with  a  different  type 
of  matching  mistake  that  could  have  produced  the  in¬ 
correct  match.  The  match  being  evaluated  is  generated 
by  extending  a  pair  of  other  matches,  say  Mi  and  M2, 
and  this  implies  the  following  classes  of  alternative  in¬ 
terpretations  (matching  mistakes):  (1)  Mi  and  M2  are 
both  correct  but  the  given  extension  is  not  (i.e.,  some 
other  extension  is);  (2)  Mi  and  M2  are  correct  but  they 
cannot  be  combined  into  any  valid  extension;  (3)  Mi  is 
itself  incorrect;  (4)  M2  is  incorrect;  and  (5)  both  Mi  and 
M2  are  in  correct.  These  distinct  classes  have  different 
priors  and  class  conditional  densities  for  the  features. 

The  priors  assigned  to  each  class  are  derived  from  view 
analysis  and  estimates  of  match  error  rates.  Assuming 
a  uniform  distribution  for  views,  the  prior  probabilities 
for  the  given  2D  match  and  classes  of  type  (1)  are  a 
function  of  the  visibility  of  the  different  2D  models;  i.e., 
the  fraction  of  views  for  which  they  are  valid.  The  al¬ 


ternative  interpretation  classes  associated  with  the  other 
matching  mist2d[es  (2-5)  are  assigned  priors  that  reflect 
the  expected  rate  at  wUch  these  mistakes  occur.  These 
error  rates  have  not  been  rigorously  estimated;  however, 
this  does  not  seem  to  have  caused  serious  problems  in 
the  experiments. 

Estimating  the  probability  that  a  2D  match  is  cor¬ 
rect  also  requires  the  conditional  density  functions  for 
the  features  given  each  of  the  interpretation  classes  de¬ 
fined  above.  For  all  the  classes  considered,  the  features 
are  assumed  to  be  independent.  As  discussed  in  Section 
3,  the  density  functions  for  the  metric  features  given  a 
correct  match  to  the  2D  model  are  represented  with  the 
model  (see  Fig.  2b).  The  structural  features  are  mea¬ 
sured  in  terms  of  the  reconstruction  fit  errors.  Given 
that  the  2D  model  match  is  correct,  the  error  in  the 
fit  is  strictly  a  function  of  the  inaccuracies  of  the  de¬ 
tected  image  segments,  and  the  associated  detection  er¬ 
ror  density  function  is  assumed  to  be  a  Gaussian  with 
sero  mean.  When  the  match  is  incorrect,  structural  fea¬ 
tures  (fit  errors)  are  largely  due  to  the  incorrect  model 
line  assignments,  and  the  associated  assignment  error 
density  function  is  a  much  wider  Gaussian.  For  some  of 
the  alternative  interpretation  classes  defined  above,  not 
all  of  the  match  is  considered  incorrect,  and  thus  not  all 
of  the  associated  reconstruction  fit  errors  are  modelled 
as  assignment  errors.  For  example,  in  class  (3),  the  part 
of  the  match  extended  from  Mi  is  considered  incorrect 
but  not  the  part  from  M2.  The  probability  densities  of 
the  fit  errors  for  these  two  different  parts  of  the  match 
thus  reflect  assignment  and  detection  errors  respectively. 
A  more  complete  treatment  can  be  found  in  [6]. 


4.3  Control:  Selecting  2D  matches  for 
extension  and  verification 

As  discussed  above,  the  overall  form  of  the  matching 
system  is  that  of  a  search  for  correct  3D  matches  by 
iterative  2D  match  extension  and  verification.  Therefore, 
the  system’s  behavior  is  controlled  by  the  selection  of 
matches  to  extend  or  verify  during  each  iteration.  While 
verification  is  the  transformation  of  a  single  2D  match 
into  its  3D  counterpart,  match  extension  involves  the 
combination  of  pairs  of  matches.  It  is  computationally 
prohibitive  to  select  them  by  evaluating  and  assigning  a 
priority  to  every  pair  of  existing  matches.  Thus,  for  each 
iteration,  the  selection  proceeds  in  two  steps:  (1)  select 
a  set  of  individual  matches,  each  with  a  high  likelihood 
of  leading  to  the  correct  interpretation,  and  (2)  for  each 
selected  match  that  is  to  be  extended,  retrieve  partner 
matches  for  combination. 

Given  the  above  process,  it  is  clear  that  not  all  of  the 
possible  combinations  of  a  match  are  attempted  when 
it  is  selected  for  extension.  Even  if  the  combinations 
tried  seem  the  most  promising,  the  correct  one  may  have 
been  excluded.  It  is  thus  desirable  to  be  able  to  re-select 
a  match  in  another  iteration,  and  retrieve  a  new  set  of 
partners  to  combine  with  it. 


Figure  6:  DiiTeient  pairs  of  CORNER  matches  demonstrat¬ 
ing  three  perceptual  organisation  Csctors  important  for  as¬ 
sessing  the  potential  usefulness  of  a  match  pair  as  a  combi¬ 
nation.  For  each  example,  line  segments  associated  with  one 
match  are  shown  in  black,  those  with  the  other,  in  gray,  and 
those  with  both,  in  alternating  black/gray.  The  factors  ex¬ 
emplified  are  (a)  connectedness  (segment  sharing),  (b)  good 
continuity  (collinearity),  and  (c)  proximity. 

4.5.1  Selecting  individual  matches  to  extend  or 
verify. 

The  set  of  individual  matches  selected  during  each  it¬ 
eration  should  satisfy  some  combination  of  the  following 
two  criteria:  the  matches  must  be  promising  candidates 
for  extension  or  verification,  and  the  selected  set  must 
be  distributed  about  the  image  in  a  way  that  is  advan¬ 
tageous  for  recursive  matching  to  view  description  net¬ 
works. 

The  first  factor,  selecting  candidates  for  extension  or 
verification  with  high  likelihoods  of  leading  to  the  correct 
3D  interpretation,  is  clearly  a  function  of  the  probabil¬ 
ity  that  the  2D  match  itself  is  correct.  However,  for 
those  matches  selected  for  extension,  this  likelihood  is 
also  a  function  of  whether  or  not  the  match  can  be  suc¬ 
cessfully  combined  with  another  match.  From  the  above 
discussion,  it  is  clear  that  a  match  may  have  already 
been  selected  in  an  earlier  iteration,  and  thus  some  of 
the  most  promising  combinations  with  it  will  have  been 
attempted  already.  It  seems  reasonable  to  assume  that 
the  chance  that  the  correct  combination  has  yet  to  be 
generated  goes  down  each  time  a  match  is  re-selected  for 
extension.  The  priority  assigned  to  a  match  is  a  product 
of  the  probabUity  that  it  is  correct  and  the  probability 
that  it  can  yet  be  successfully  combined. 

The  second  factor,  matching  an  image  to  an  object 
represented  by  a  view  description  network,  is  best  sat¬ 
isfied  if  the  match  extension  activity  is  distributed  over 
the  projection  of  the  scene  object.  In  this  way,  matches 
to  different  parts  of  the  projection  will  have  a  greater 
chance  of  being  available  for  important  combinations  at 
roughly  the  same  time.  The  following  is  the  method  used 
for  distributing  the  selected  matches:  (a)  initialize  the 
matching  system  by  generating  primitive  2D  model  node 
matches  in  different  portions  of  the  image;  and  (b)  for 
subsequent  iterations,  select  the  highest  priority  match 
in  the  neighborhood  of  each  of  the  matches  last  extended. 
In  our  system,  the  neighborhood  of  a  match  includes  itself 
and  all  matches  related  to  it  through  extension. 

4.3.2  Selecting  match  combination  partners 

Once  a  match  is  selected  for  extension,  the  system 

searches  for  other  matches  that  provide  promising  com¬ 


binations.  Each  time  a  match  is  re-selected  for  exten¬ 
sion,  a  new  set  is  sought  for  combination,  in  order  of 
most  promising  sets  first.  A  pair  of  matches  is  a  pronris- 
ing  combination  if  the  resulting  extension  to  a  new  2D 
match  has  a  high  probability  of  being  correct.  In  a  clut¬ 
tered  image,  it  is  easy  to  select  a  pair  of  matches  whose 
individuid  probabilities  are  high,  but  their  combination 
has  a  low  probability  of  being  correct;  thus,  it  is  impor¬ 
tant  to  be  able  to  rank  candidate  pairs  based  on  how  they 
combine.  An  important  set  of  heuristics  for  ranking  com¬ 
binations  in  order  of  their  probability  of  being  correct 
can  be  found  in  psychological  studies  of  perceptual  or¬ 
ganization.  Generally,  image  features  exhibit  compelling 
perceptual  organization  if  they  appear  to  the  viewer  as 
parts  of  the  same  object. 

The  three  perceptual  organization  factors  exemplified 
in  Fig.  6,  connectedness,  good  continuity,  and  proximity, 
are  important  for  assessing  match  combinations  and  have 
been  incorporated  into  our  recognition  system.  Connect¬ 
edness  is  exhibited  when  two  matches  share  the  same  im¬ 
age  line  segment.  If  these  two  matches  are  correct,  there 
is  a  high  probability  that  they  are  of  the  same  object’s 
projection.  Two  matches  exhibit  good  continuity  if  a  pair 
of  image  line  segments,  one  from  each  match,  are  close 
and  approximately  collinear.  Given  good  continuity,  the 
visual  organization  is  very  compelling,  though  not  with 
the  strength  of  connectedness.  Finally,  two  matches  are 
considered  proximate  if  their  associated  image  line  seg¬ 
ments  are  close.  Proximate  matches  do  appear  more 
likely  to  be  parts  of  the  same  object  than  more  distant 
pairs,  but  this  factor  is  clearly  not  as  compelling  as  the 
others. 

The  combinations  of  a  given  match  are  generated  in 
order  of  the  compellingness  of  the  grouping  heuristic  in¬ 
volved:  all  combinations  exhibiting  connectedness  are  of 
highest  priority,  then  those  with  good  continuity,  and 
for  subsequent  passes,  the  rest  are  ordered  by  proxim¬ 
ity.  Given  this  scheme,  the  extensions  of  a  match  are 
attempted  in  roughly  best-first  order.  In  addition,  the 
retrieval  of  the  desired  match  combinations,  given  each 
of  the  factors,  can  be  made  efficient  by  suitable  image 
and  n>tch  data  base  organization,  as  has  been  imple¬ 
mented  in  our  system. 

5  Matching  experiments 

Our  matching  system  has  been  applied  to  the  recognition 
of  objects  in  real,  digital  images.  For  each  image,  the 
same  initialization  procedure  was  followed.  First,  lines 
were  detected  [3]  and  filtered  by  length  (>  10  pixels)  and 
intensity  contrast  (>  5  gray  levels).  Next,  all  matches  to 
CORNER  with  low  gap  error  (<  12.5%  of  line  length) 
and  low  overshoot  of  the  intersection  (<  3.5%)  were 
found.  Finally,  redundant  CORNER  matches  due  to  re¬ 
dundantly  represented  object  edges  were  removed  by  the 
following  filter:  if  a  line  segment  is  matched  to  CORNER 
multiple  times,  and  the  other  segments  in  these  multiple 
CORNER  matches  are  next  to  and  parallel  with  each 
other,  then  select  only  the  best  one. 

The  first  three  images  in  Fig.  7  are  of  scenes  contain¬ 
ing  multiple  objects.  These  were  matched  to  a  network 
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Fig.  T  Imagei  of  scenes  containing  multiple  objects  and  an  outdoor  scene.  Images:  (a)  separated-objects,  (b)  top-scene, 
(c)  side-scene  and  (d)  ovtdoor-scene 


Fig.  8  The  resulting  correct  3D  matches  for  the  images.  The  line  segment  assignments  of  each  match  was  generated  by 
the  network  matching  process  and  are  indicated  by  the  thick  black  lines.  The  thick  gray  lines  are  projections  of  the  object  from 
the  3D  pose  estimated  by  the  sdgorithm  of  Kumar,  given  the  segment  assignments  shown. 


3D malch m SHORTPRISM  3D tnaldllo SHORTPRISM  3D malch lo  SHORTPRISM 


I’jg.  g  The  incorrect  3D  matches  generated  by  the  system  and  rejected  by  the  verifier.  All  of  the  incorrect  3D  matches 
are  from  the  image  side-scene;  the  matches  eua  represented  as  in  the  previous  figure. 


Statistics 

Images 

Separate-objects 

Top- scene 

Side-scene 

Outdoor-scene 

Average 

object  library  size 

3 

3 

3 

4 

3.25 

image  segments 

63 

77 

142 

41 

80.8 

3D  matches  possible 

2.5  X  10» 

7.1  X  10* 

1.6  X  10" 

3.6  X  10* 

4.3  X  10'“ 

3D  matches  generated 

3 

3 

9 

1 

4.0 

3D  matches  per  correct  match 

1 

1 

3 

1 

1.6 

number  of  match  iterations 

4 

4 

5 

4 

4.25 

ave  height  of  network 

4.7 

4.7 

4.7 

4.8 

4.73 

iterations  per  network  level 

.86 

.86 

1.07 

.83 

.91 

3D  and  2D  matches  generated 

230 

264 

379 

137 

252.5 

ave  #  line  segs  per  match 

2.6 

2.5 

2.3 

1.9 

2.37 

Table  1  Statistics  for  matching  experiments.  Each  column  reports  the  statistics  for  a  c^erent  image,  except  for  the  last 
column  which  is  the  average  over  aU  runs.  The  statistics  are  explained  in  the  discussion  section. 
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for  the  three  objects  in  the  scenes;  it  was  identical  to  the 
one  shown  in  Fig.  2,  without  the  house  view  descriptions. 
The  system  was  also  applied  to  the  last  image  in  Pig.  7 
of  an  outdoor  house  scene.  The  system  searched  for 
matches  between  this  image  and  all  four  objects  shown 
in  Fig.  7,  using  the  view  description  network  shown  in 
Fig.  2.  All  of  the  correct  3D  matches  were  found  for 
each  image;  Pig.  8  shows  the  original  hypothesised  3D 
matches  as  thick  black  lines  and  the  resulting  projection 
given  the  estimated  3D  pose  in  thick  gray  lines.  The 
system  also  hypothesised  and  attempted  to  verify  some 
incorrect  3D  matches,  shown  in  Fig.  9.  Note  that  there 
were  no  incorrect  3D  matches  generated  for  the  images 
separaied-objecis,  iop-aeene  and  outdoor-scene,  and  only 
six  for  side-scene.  The  first  four  of  the  incorrect  matches 
were  due  to  false  line-segment  junctions  and  accidental 
parallels  in  projection  (or  shadows),  which  produced  er¬ 
roneous  small  2D  matches  of  high  probability.  In  spite 
of  such  complications  in  the  image,  the  matching  system 
as  a  whole  seems  effective. 

Table  1  shows  some  useful  statistics  for  each  matching 
trial  and  the  average  across  all  of  them  (last  column). 
For  the  images  and  objects  studied  here,  a  confident  3D 
match  typically  requires  the  assignment  of  five  model 
line  segments,  making  the  possible  3D  matches  per  image 
number  in  the  tens  of  billions.  In  spite  of  this,  the  num¬ 
ber  of  3D  matches  actually  hypothesised  is  very  small, 
averaging  1.6  per  correct  3D  match.  Another  illuminat¬ 
ing  statistic  is  the  number  of  iterations  that  the  system 
runs  before  finding  all  of  the  correct  matches.  Including 
the  initialisation  and  final  verification  steps,  the  average 
number  of  iterations  is  4.25.  In  relation  to  the  average 
height  of  the  view  description  network,  this  is  a  good 
result.  The  average  height  of  the  network  roughly  rep¬ 
resents  the  number  of  steps  in  a  network-directed  con¬ 
struction  of  the  3D  match,  starting  from  an  unmatched 
set  of  image  line  segments.  For  the  network  used  in  this 
study,  the  average  height  is  4.7,  which  is  higher  than 
the  actual  average  number  of  iterations  used  by  the  sys¬ 
tem  to  generate  the  correct  3D  matches.  In  part,  this 
reflects  the  fact  that  the  system  sometimes  performed 
match  extensions  to  indirect  successors  in  the  network 
and  thus  avoided  some  of  the  construction  steps.  It  also 
reflects  the  utility  of  the  match  combination  approach 
in  general.  As  argued  in  Section  2,  the  evidence  from 
matches  to  multiple  parts  and  the  convergent  structure 
of  the  description  network  are  used  together  to  provide 
a  potentially  focused  search  for  the  correct  interpreta¬ 
tion.  The  results  presented  here  help  demonstrate  this 
potential. 

The  total  number  of  3D  and  2D  matches  generated 
ako  seems  reasonable.  This  is  especially  true  when 
one  considers  the  number  of  line  segments  in  the  im¬ 
age  and  the  average  sise  of  the  matches  (number  of  as¬ 
signed  model  lines).  In  the  trials  presented  here,  matches 
to  the  two  simplest  2D  modeb,  LINE-SEGMENT  and 
CORNER,  make  up  the  majority  of  the  total,  and  an¬ 
other  large  portion  of  the  total  is  made  up  of  matches  to 
2D  modeb  that  are  almost  as  simple:  the  three-segment 
matches  to  U-SHAPE,  C-SHAPE  and  TRIANGLE.  For 


the  images  tested,  the  system  consbtently  converges  on 
the  correct  interpretation  without  generating  many  2D 
or  3D  matches  of  sise  greater  than  three  segments  -  even 
though  there  can  be  many  initial  matches  to  the  smaller 
2D  modeb. 


In  our  approach  to  recognition,  the  search  for  matches 
to  a  multiple  3D  object  library  is  optimised  by  matching 
view  description  networks  that  are  automatically  com¬ 
piled  from  the  library.  The  contribution  of  the  research 
described  in  this  paper  is  the  development  of  effective 
strategies  for  the  network  matching  process.  The  exper¬ 
iments  show  that  a  recognition  system  based  on  view  de¬ 
scription  networks  is  capable  finding  the  correct  matches 
to  3D  objects  in  complex  images  with  a  potentially  high 
level  of  efficiency. 
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Abstract 

Projective  invariants  are  shape  descriptors 
which  are  independent  of  the  point  of  view  from 
which  the  shape  is  seen,  and  therefore  they 
are  of  major  importance  in  object  recognition. 
They  make  it  possible  to  match  an  image  of  an 
object  to  one  stored  in  a  data  base  without  the 
need  to  search  for  the  correct  viewpoint.  In  this 
paper  we  obtain  an  invariant  representation 
(“signature”)  of  a  general  curve.  The  calcula¬ 
tion  is  local  and  does  not  suffer  from  the  occlu¬ 
sion  problem  of  global  descriptors.  The  affine 
transformation  subset  is  also  treated.  To  make 
the  method  robust,  we  have  developed  differ¬ 
entiation  techniques  which  give  much  more  re¬ 
liable  results  than  previous  ones.  These  dif¬ 
ferentiation  methods  are  useful  in  many  other 
applications  as  well. 

1  Introduction 

A  major  problem  in  object  recognition  is  the  fact  that 
the  same  shape  can  be  seen  from  different  points  of  view, 
resulting  in  different  images.  In  order  to  compare  a  given 
image  to  one  that  is  stored  in  a  library  of  images,  existing 
methods  had  to  search  in  a  multidimensional  parameter 
space  to  find  the  appropriate  “pose”,  or  point  of  view. 
Invariants  of  shapes,  being  independent  of  the  point  of 
view,  free  us  from  this  search  and  allow  direct  matching 
between  the  observed  and  the  stored  images. 

Projective  invariants  were  a  very  active  mathematical 
subject  in  the  latter  half  of  the  19th  century.  However, 
in  vision  only  one  projective  invariant,  the  cross  ratio  of 
four  points  on  a  line  [7],  was  used  until  recently. 

Projective  invariants  of  curves  and  surfaces  were  first 
introduced  in  vision  by  this  author  [13].  The  two  main 
kinds  of  invariants,  the  algebraic  and  differential  invari¬ 
ants,  were  reviewed  in  that  paper,  which  pointed  out 
their  usefulness  for  object  recognition.  Algebraic  invari¬ 
ants  were  then  successfully  applied  to  industrial  objects 
in  [8].  Recognition  of  occluded  surfaces  using  invariants 
was  treated  in  [3],  and  semi-differential  invariants  were 
used  in  [12]  and  [1].  Other  papers  are  listed  in  the  ref¬ 
erences. 

’This  work  was  supported  in  part  by  ONR  Grant  N00014- 
91-J-1222 


Algebraic  invariants  are  well  suited  to  use  with  alge¬ 
braic  shapes,  i.e.  shapes  that  can  be  expressed  as  a  2-D 
polynomial  f{x,y)  =  0,  e.g.  conics.  The  polynomial 
coefficients  are  obtained  by  fitting  the  appropriate  poly¬ 
nomial  to  the  whole  visible  shape  and  calculating  the 
invariants  from  the  polynomial  coefficients.  These  in¬ 
variants  have  several  problems:  first,  most  shapes  are 
not  algebraic,  making  it  hard  to  fit  simple  polynomials  to 
them.  Second,  the  algebraic  method  is  global,  requiring 
dealing  with  whole  shapes.  Like  any  global  descriptors, 
they  are  vulnerable  to  occlusion  problems. 

Differential  invariants  overcome  these  problem.*!  be¬ 
cause  they  ate  local,  i.e.  invmiants  are  found  for  each 
point  on  the  curve.  Furthermore,  the  curve  can  be  quite 
general. 

In  the  Euclidean  case  it  is  common  to  plot  the  curva¬ 
ture  against  the  arclength,  both  Euclidean  invariants,  to 
obtain  a  “signature”,  or  an  invariant  curve.  Given  this 
signature,  we  can  reconstruct  the  original  curve  up  to 
a  Euclidean  transformation.  Similar  signatures  can  be 
obtained  for  both  affine  and  projective  transformations, 
which  are  our  interest  here.  This  is  because  of  a  general 
completeness  property. 

The  completeness  property  of  differential  invariants 
can  be  described  as  follows.  Given  a  plane  curve  and  a 
transformation  group,  there  are  two  independent  invari¬ 
ants  of  the  transformations  at  each  point  of  the  curve. 
These  invariant  functions  contain  pH  the  information 
about  the  curve,  except  for  the  transformation  to  which 
they  are  invariant.  Accordingly,  given  two  invariants  for 
each  curve  point,  we  can  determine  the  original  curve  up 
to  a  transformation  belonging  to  the  group. 

More  accurately,  the  following  theorem  holds  [9, 
p.  144]: 

Theorem.  All  differential  invariants  of  a  (transitive) 
transformation  in  the  plane  are  functions  of  two  invari¬ 
ants  of  the  lowest  order  and  their  derivatives. 

Thus,  given  a  curve,  one  can  find  a  corresponding  in¬ 
variant  curve,  its  signature,  that  describes  it  uniquely, 
except  for  the  relevant  transformation. 

In  this  paper  we  present  a  method  of  object  recogni¬ 
tion  based  on  the  above  property.  The  transformations 
we  are  interested  in  are  the  affine  and  projective  ones, 
because  they  are  induced  by  changing  the  point  of  view. 
At  each  point  of  the  given  curve  we  calculate  two  in¬ 
variants,  /i,/2.  We  plot  these  numbers  as  a  point  in 
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an  “invariant  plane”  whose  coordinates  represent  invari¬ 
ants.  In  effect  we  plot  one  invariant  against  the  other. 
In  this  way  the  given  curve  maps  into  an  invariant  signa¬ 
ture  curve  in  the  invariant  plane.  The  signature  uniquely 
identifies  the  curve  regardless  of  the  point  of  view. 

It  is  useful  to  know  the  amount  of  information  that 
needs  to  be  obtained  from  the  image.  To  find  invariants, 
we  have  to  eliminate  the  information  in  the  image  which 
is  specific  to  the  coordinate  system.  For  example,  given 
a  pencil  that  can  move  or  rotate  on  a  table,  the  posi¬ 
tion  of  the  pencil  and  its  orientation  are  not  invariant 
but  its  length  is  a  Euclidean  invariant.  Given  the  coor¬ 
dinates,  say  of  the  ends  of  the  pencil,  we  can  eliminate 
the  position  and  orientation  and  calculate  the  distance. 
Thus  from  the  four  measured  coordinates  we  have  elim¬ 
inated  the  three  Euclidean  transformation  parameters 
and  found  one  invariant. 

Similar  arguments  apply  for  other  transformations.  In 
the  projective  C2ise,  we  want  to  eliminate  eight  parame¬ 
ters  of  the  transformation,  so  the  number  of  coefficients 
to  be  obtained  from  the  image  should  exceed  eight.  It 
is  easy  to  apply  this  to  algebraic  shapes.  For  example, 
one  conic  does  not  contain  enough  coefficients  but  two 
conics  do,  with  ten  independent  ones,  and  so  do  a  cubic 
(nine  coefficients)  and  a  quartic  (14). 

In  the  differential  case  the  same  arguments  apply. 
However,  .n  traditional  methods  of  obtaining  invari¬ 
ants  [16],  a  complication  arises  because  the  curve  is 
represented  parametrically  as  x{t),y{t),  with  an  arbi¬ 
trary  curve  parameter  t.  Wilczynski’s  method  requires 
the  eighth  derivative  (of  both  *(t),  y(t))  to  find  two 
invariants  to  both  the  projection  and  the  change  of 
parametrization,  for  a  total  of  18  quantities.  If  we  dis¬ 
regard  the  parameter  problem,  we  need  only  the  fourth 
derivatives  of  x,  y,  namely  only  ten  numbers.  Thus,  if 
we  could  overcome  the  parameter  problem,  the  number 
of  data  quantities  needed  will  be  no  more  than  required 
by  an  algebraic  method. 

The  parameter  problem  can  be  avoided  from  the  out¬ 
set  because  the  parameter  is  not  part  of  the  geometry  of 
the  curve;  the  coordinates  x,y  of  each  point  are  sufficient 
to  determine  the  curve.  The  parameter  is  an  arbitrary 
function  introduced  for  convenience,  so  it  is  clear  that 
one  can  improve  on  these  methods.  In  this  paper  we 
present  an  approach  which  avoids  the  parametrization 
problem  and  enables  us  to  find  two  local  invariants  with 
only  the  ten  quantities  that  arc  needed  from  a  purely 
geometrical  point  of  view.  In  this  way  we  combine  the 
advantage  of  the  differential  method,  its  locality,  with 
that  of  the  algebraic  method,  which  does  not  need  a 
curve  parameter. 

Of  course  the  fourth  derivative,  or  similar  quantities 
needed  in  our  new  invariant  method,  are  still  high  by 
computer  vision  standards,  and  common  methods  com¬ 
pletely  fail  to  obtain  them.  We  have  found  a  smoothing 
and  differentiation  method  that  gives  good  results  for 
high  derivatives.  In  our  experiments  the  error  in  the 
derivaiives  and  in  the  invariants  was  no  more  than  the 
error  in  the  data.  Obviously  the  method  has  applications 
beyond  invariants. 

In  the  following  sections  we  review  the  previous  meth¬ 


ods  for  deriving  projective  and  affine  differential  invari¬ 
ants,  then  we  describe  our  new  methods  for  finding  in¬ 
variants  and  for  obtaining  derivatives,  and  finally  exper¬ 
iments  are  presented. 

2  Wilczynski’s  method 

In  this  section  we  describe  the  method  developed  by 
Wilczynski  [1906]  who  obtained  closed  form  formulas 
for  a  complete  set  of  differential  projective  invariants  of 
curves. 

A  projective  transformation  in  the  plane  can  be  writ¬ 
ten  in  Cartesian  coordinates  as 

(  1 )  +  2/^32  +  T33  (  1 ) 

where  T  is  a  constant  matrix.  The  factor  multiplying  T 
in  this  equation  contains  the  coordinates,  so  the  trans¬ 
formation  in  non-linear  and  also  leads  to  infinities  when 
it  vanishes.  To  avoid  dealing  with  these  problems  it  is 
common  to  generalize  the  treatment  by  working  in  ho¬ 
mogeneous  coordinates  x  =  (xi,X2,X3y  and  write  the 
transformation  ais 

X  =  A(x)Tx 

with  A(x)  being  an  arbitrary  factor,  that  can  be  different 
at  each  point  x.  This  can  be  generalized  to  curves  in  n- 
dimensional  homogeneous  space. 

A  curve  in  n-D  can  be  written  parametrically  as  x(<). 
We  want  to  find  quantities  at  each  point  of  the  curve 
which  are  independent  of  both  the  coordinate  system 
and  the  parameter  t.  To  find  invariants,  one  can  proceed 
in  stages:  (i)  find  invariants  to  the  linear  part  T  of  the 
transformation  above,  (ii)  from  these  derive  invariants 
to  multiplication  by  A,  (iii)  from  the  latter  construct 
invariants  to  the  curve  parameter  t.  We  will  proceed  to 
describe  each  stage. 

The  basic  method  of  obtaining  invariants  in  this  ap¬ 
proach  is  by  using  derivatives.  The  advantage  of  differ¬ 
entiation  is  that  it  eliminates  constants,  which  are  often 
associated  with  the  coordinate  system.  For  instance,  a 
straight  line  can  be  written  as  y  =  ax  +  b,  with  the  con¬ 
stants  a,  6  both  depending  on  the  coordinates.  However, 
the  equation  y"  =  0  represents  straight  lines  (and  only 
them)  invariantly  of  any  coordinate  system.  The  more 
general  case  is  more  involved  but  the  same  principle  ap¬ 
plies. 

Invariance  to  T  can  be  obtained  by  taking  the  deriva¬ 
tives  of  the  curve  x(t)^"^  and  forming  the  set  of  linear 
algebraic  equations 

Qp2x^"-2)-(-.  .  .-Hp„x  =  0  (1) 

for  the  unknowns  pi  ■ .  .Pn,  at  each  point  t.  For  a  plane 
curve  (n  =  3)  we  have 

x'"  -I-  3pix"  -f  3p2x'  +  P3X  =  0 

with  the  three  unknowns  pi,p2,P3.  It  is  easy  to  see  that 
multiplying  x  by  a  constant  matrix  T  has  no  influence 
on  tlie  above  equation  since  the  matrix  factors  out.  Thus 
the  Pn  are  invariant  to  the  linear  transformation  T. 
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Next,  we  have  to  deal  with  the  transformation  of  mul¬ 
tiplying  byA,  namely  x  =  A(x)x.  We  find  quantities  P,-, 
functions  ofp,,  that  are  invariant  to  this  transformation. 
Since  P,  are  not  invari  it  to  transforming  the  parameter 
t,  they  are  called  em'-invariants.  For  a  planar  curve  we 
have  [16,  p.  58] 

P2  =  P2  -  P?  -  Pi  (2) 

P3  =  P3  —  3piP2  +  2pi  —  Pi 

The  final  stage  is  finding  invariants  to  the  parameter  t. 
The  transformation  i.i  question  is  t  t(t),  to  which  we 
can  find  relative  invariants  of  weight  to,  namely  quanti¬ 
ties  that  transform  as 


These  invariants  are  [p.  59] 

Q3  =  P3-Ip2 

08  =  6030^'  -  7(0[,)2  -  27P20i 
They  are  relative  invariants  of  weights  3  and  8,  respec¬ 
tively. 

The  question  arises  of  how  many  independent  invari¬ 
ants  exist  for  a  curve  and  to  what  extent  one  can  re¬ 
construct  the  curve  given  the  invariants.  We  have  the 
completeness  theorem: 

Theorem.  The  invariants  O3, 08  completely  determine 
a  plane  curve  up  to  a  projective  transformation. 

This  is  a  special  case  of  the  more  general  property 
mentioned  in  the  introduction.  From  these,  other  invari¬ 
ants  can  easily  be  derived.  In  particular 

012  =  30308  ~  80803 

Although  the  invariant  set  03,08  is  complete,  it  can¬ 
not  be  used  as  is  to  identify  the  curve  because  they  are 
relative  invariants,  i.e.  the  transformation  contains  the 
unknown  (£(<)')“"'.  But  we  can  derive  from  them  abso¬ 
lute  invariants  («;  =  0).  We  can  choose 


h 


-03’ 


l2  = 


012 


We  can  now  define  an  “invariant  plane”,  with  coordi¬ 
nates  /i,/2.  For  each  point  on  the  given  curve  we  can 
calculate  the  two  invariants  and  draw  a  point  /i ,  I2  in 
the  invariant  plane.  In  this  way  we  obtain  an  invariant 
curve,  or  signature,  of  the  original  curve. 

Calculating  the  above  invariants  requires  the  eighth 
derivative  with  respect  to  the  parameter,  and  this  poses  a 
hard  problem  from  a  practical  point  of  view.  The  invari¬ 
ants  03,08  were  implemented  numerically,  using  simple 
finite  difference  methods,  in  [2].  It  w^ls  concluded  that 
in  thic  simple  implementation  the  above  invariants  are 
quite  unreliable  and  hard  to  use  in  practice. 

The  method  of  invariants  can  be  made  reliable  in  sev¬ 
eral  ways,  of  which  two  are  treated  here: 

i)  Calculate  derivatives  using  more  sophisticated  nu¬ 
merical  analysis  methods.  We  have  succeeded  in  reliably 
obtaining  a  fourth  derivative  from  numerical  data. 


ii)  Develop  methods  that  need  fewer  quantities  such 
as  derivatives  that  need  to  be  obtained  from  the  image. 
We  have  developed  a  method  that  does  not  require  a 
parametrization  of  the  curve  and  thus  obviates  the  need 
for  the  high  derivatives  needed  in  the  above  invariants. 

3  Our  New  Invariants 

In  this  section  we  derive  semi-invr.riants  requiring  fewer 
derivatives  than  Wilezynski’s.  We  later  derive  an  en¬ 
tirely  different  method  based  on  a  canonical  coordinate 
system. 

Modified  Semi- invariants 

The  semi-invariants  written  above  can  be  modified  to  re¬ 
duce  the  number  of  derivatives  needed  from  five  to  four. 
From  eq.  (2)  we  see  that  the  semi-invariant  Pn  contains 
only  the  first  derivative  pj.  Since  the  p,-  themselves  de¬ 
pend  on  the  third  derivatives  of  the  curve,  P2  depends 
on  the  fourth  derivatives.  However,  P3  depends  on  the 
fifth  derivative  of  the  curve.  This  can  be  eliminated  by 
subtracting  P^- 

P3  =  P3-  P2-  P2  -  2pip'i  +  3piP2  -  2p? 

Our  curve  can  now  be  described  by  Pn,  P^  which  are 
invariant  to  the  projection  and  to  change  of  the  arbitrary 
multiplying  factor  A(t),  and  only  involve  four  derivatives. 
They  are  not  invariant  to  the  change  of  parameter  t. 

Canonical  Invariants 

We  now  develop  a  new  way  to  obtain  invariants,  which  is 
different  from  Wilezynski’s  and  offers  more  insight  into 
the  geometrical  nature  of  the  invariants.  So  far  only 
semi-invariants  are  obtained  by  this  method  as  they  de¬ 
pend  on  the  parameter.  In  the  next  section  we  derive  full 
invariants  by  a  different  method  that  does  not  involve  a 
parameter. 

The  basic  idea  is  to  use  a  projective  transformation 
in  order  to  go  over  from  a  given  coordinate  system  to 
a  “canonical”  one,  in  which  some  of  the  derivatives  are 
predetermined.  Other  derivatives  will  then  become  in¬ 
variants.  The  concept  can  be  illustrated  by  examples  of 
simpler  transformations.  If  a  1-D  function  x{t)  is  sub¬ 
ject  to  scale  transformation  in  x,  we  can  always  fix  the 
scale  along  x  by  defining  a  canonical  coordinate  x  such 
that  x'(0)  =  1.  The  transformation  to  the  canonical 
system  is  a  simple  normalization  x  =  x/x'(0).  By  fi.x- 
ing  the  scale,  other  quantities  that  depend  on  it  such  as 
the  second  derivative  x"(0)  are  also  fixed  and  are  now 
scale  invariant.  In  another  example,  given  a  curve  in  th'' 
Euclidean  plane,  we  can  move  the  coordinate  system  so 
that  the  x  eixis  is  tangent  to  a  certain  point  on  the  curve, 

=  0.  The  second  derivative  in  this  system  is  now 
fixed  (as  curvature)  and  is  invariant  since  we  can  obtain 
the  same  canonical  system  regardless  of  which  system  we 
started  with.  We  see  that  by  determining  some  of  the 
derivatives  of  the  system,  the  others  are  also  determined 
and  become  invariant.  We  generalize  this  process  for  the 
projective  case. 

Wj  define  our  canonical  system  as  follows,  with  non- 
homogeneous  coordinates: 
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Table  1.  Canonical  Coordinate  System 


f 

/' 

/" 

jtt// 

z 

0 

1 

0 

1 

0 

y 

0 

0 

1 

h 

h 

where  x,y  are  the  canonical  coordinates,  f  in  the  first 
line  is  either  *  or  y  and  the  primes  denote  derivatives 
with  respect  to  t.  The  /i,  mean  that  there  are  no  con¬ 
ditions  on  these  derivatives,  they  are  our  desired 

invariants,  as  will  be  shown  below.  There  are  eight  con¬ 
ditions  in  all,  and  they  can  be  satisfied  by  employing  the 
eight  parameters  of  a  projective  transformation  to  ar¬ 
rive  at  this  canonical  coordinate  system.  The  following 
theorem  holds; 


which  is  a  general  definition  for  any  frame  transforma¬ 
tion  A.  We  now  have 

{t,n}'  =  C(.4){t,n}  (4) 

When  we  transform  the  curve  both  sides  of  this  equation 
are  multiplied  by  a  constant  rotation  matrix  B.  This 
leaves  the  equation  unchanged  and  thus  C{A)  is  invari¬ 
ant. 

In  our  case  is  a  simple  rotation  and  it  is  easy  to 
calculate  its  Cartan  matrix: 

C(-4)=(_"W  t’) 

with  k{s)  being  the  curvature  at  s,  a  well  known  Eu¬ 
clidean  invariant.  Eq.  (4)  can  be  written  explicitly  as 


Theorem.  Given  a  curve  in  the  canonical  system  of  Ta¬ 
ble  1,  its  parameterization  around  the  origin  is  deter¬ 
mined  uniquely  up  to  and  including  the  fourth  deriva¬ 
tive.  Conversely,  given  a  curve  and  its  first  four  deriva¬ 
tives  with  respect  to  some  parameter  t,  one  can  always 
find  a  coordinate  system,  unique  for  each  point,  satisfy¬ 
ing  these  canonical  conditions. 

The  proof  of  the  converse  is  constructive,  and  gives 
the  appropriate  transformation  from  the  given  coordi¬ 
nate  system  to  the  canonical  one.  The  details  are  given 
elsewhere  [14]. 

These  invariants  have  been  calculated  explicitly  and 
tested  numerically.  It  was  confirmed  that  they  are  in¬ 
variant  to  the  projectivity,  but  not  to  a  change  of  the 
parameter.  Full  invariance  requires  either  higher  deriva¬ 
tives,  as  in  Wilczynski’s  method,  or  getting  rid  of  the 
parameter  altogether,  as  described  later. 

4  Affine  Differential  Invariants 

Differential  invariants  can  be  obtained  using  Cartan ’s 
method  of  moving  frames.  Unlike  other  methods  such  as 
Lie  group  prolongations,  it  does  not  require  solving  dif¬ 
ferential  equations.  We  will  briefly  describe  the  method 
and  apply  it  to  the  affine  case,  following  [9]. 

The  basic  idea  is  to  find  an  invariant  of  a  “moving 
frame”.  This  is  a  local  coordinate  system  that  moves 
along  the  curve  by  the  action  of  a  transformation.  We 
will  first  illustrate  it  in  the  Euclidean  case. 

Given  a  curve  parametrized  by  its  arclength  s,  we  can 
define  a  local  orthogonal  frame  at  each  point,  made  up 
of  the  unit  tangent  t  and  the  normal  n.  This  is  our 
Euclidean  moving  frame.  We  also  have  a  fixed  global 
frame,  with  unit  vectors  ei ,  62  along  the  coordinate  axes. 

At  each  point  s  the  local  frame  can  be  obtained  from 
the  fixed  one  by  a  rotation  matrix  i4(s)  (translation  is 
ignored): 

{t,n}  =  A(s){e),e2} 

The  variation  of  the  local  frame  along  s  can  be  written 
“  d 

=  A'{s){e,,e2)  =  A'A->{t,n} 

We  define  the  Cartan  matrix  C(A) 

C(A)  =  A' A-'  (3) 
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-  =  K(s)n 

dn  .  . 

-  =  -.(s)t 

which  are  the  well  known  Frenei  equations  of  plane  dif¬ 
ferential  geometry. 

Cartan’s  method  generalizes  the  construction  and  the 
invariance  of  the  Cartan  matrix  C(A).  We  can  represent 
a  curve  by  a  moving  frame  generated  by  the  matrix  A 

{ai,a2}  =  A(<){ei,e2} 

with  ai,a2  being  local  vectors  making  up  the  moving 
frame.  Transforming  the  curve  is  equivalent  to  multi¬ 
plying  all  the  local  frames  by  the  same  transformation 
matrix  B,  i.e.  the  frame  matrix  in  the  new  coordinates 
system  is  A(t)  =  A{t)B.  We  now  have  the  theorem: 

Theorem.  Let  the  (differentiable)  frame  matrix  A(f) 
belong  to  a  Lie  group  G.  Let  B  be  a  constant  trans¬ 
formation  matrix  also  belonging  to  G,  which  does  not 
change  the  parameter  t.  Then  C(A)  =  C(AB). 

In  other  words,  the  Cartan  matrix  C(A)  is  invariant 
under  the  transformation  B. 

Proof;  From  the  definition  (3)  of  the  Cartan  matrix 
we  have  immediately 

C{AB)  =  C{A)  -t-  AC(B)A-'  =  C(A) 

The  last  equality  is  due  to  the  fact  that  B  is  constant  so 
its  Cartan  matrix  vanishes. 

This  theorem  assumes  that  the  parameter  t  is  not 
changed  by  the  constant  transformation  B.  Thus,  for  the 
method  to  work,  we  have  to  find  an  invariant  parameter. 
This  is  easy  to  do  for  a  unimodular  affine  transformation 
(i.e.  one  with  a  unit  determinant).  In  this  case  we  define 
the  moving  frame  as  the  two  local  vectors  cxj.cxd  with 
the  normalizing  factor  c  =  (x(, X(i|~’/^.  Thus  the  frame 
transformation  matrix  is 


A(<)  =  c{x,,x„} 
The  Cartan  matrix  is 

1  |x,.x„,| 

2  |x,,x„) 


1 


|X„,X„,| 

lx,,x„| 


t.  >  |X.,X„,| 
^2  X,,X  I 


C(A)  = 


As  it  stands  this  matrix  is  not  invariant  because  the 
parameter  t  is  not  invariant.  However,  we  can  change  the 
parameter  so  that  the  diagonal  elements  vanish.  Since 
this  vanishing  is  an  invariant  condition,  the  new  param¬ 
eter  r  is  invariant,  and  thus  the  rest  of  the  elements  of  A 
also  become  invariant.  The  diagonal  elements  will  vanish 
if 

|Xr,Xrrrl  ~  —  0 

or 

|Xt  )  Xrr  I  —  1 

Transforming  from  the  old  parameter  to  the  invariant 
one  we  have 

|xt,X/«|  =  |Xr,XrT|T-f  =  rf 

from  which  the  invariant  parameter,  the  affine  arclengih 
is  obtained: 

T=  f  abs|x<,xtj|'/^dt 

Jta 

With  this  invariant  parameter,  there  is  only  one  signifi¬ 
cant  element  left  in  the  Cartan  matrix,  namely  the  affine 
curvature 

Ka(T)  =  IXrnXrrrl 

Expressing  this  with  the  original  parameter  we  see  that 
the  fourth  derivative  is  needed; 

«a0)  =  -||x«,X«|“*/^|x,,X«tP 
+  ^|x«,X„r®''^|X(«,X,»| 
+i|x<,x,t|“®/®|x«,x<«,| 

Following  the  general  theorem  of  differential  invariants 
cited  earlier,  the  invariants  r, ««  are  a  complete  set  that 
determines  the  curve  up  to  a  unimodular  affine  transfor¬ 
mation. 

Regarding  a  general  affine  transformation,  these  in¬ 
variants  still  leave  two  undetermined  coefficients:  the 
starting  point  <o  and  the  determinant  of  the  affine  trans¬ 
formation.  If  we  plot  with  respect  to  r  we  will 

obtain  a  curve  which  is  affine  invariant  except  for  its 
starting  point  and  its  scale,  which  would  affect  both  axes 
in  the  same  way. 

It  is  interesting  to  observe  that  the  affine  invariants 
T)  Xa(T)  are  expressed  as  simple  determinants.  Thus 
their  invariance  can  be  verified  directly,  without  rely¬ 
ing  on  the  Cartan  method  [3].  The  same  is  true  for 
the  Euclidean  curvature  and  for  the  algebraic  projective 
invariants  (but  not  for  the  differential  projective  ones). 
This  is  not  surprising  since  the  transformation  of  deter¬ 
minants  involves  only  the  Jacobian.  However,  the  ad¬ 
vantage  of  Cartan ’s  method  is  that  it  can  be  used  for 
higher  dimensions  and  other  transformation  groups. 

Since  the  method  requires  differentiation  with  respect 
to  a  parameter,  it  is  not  optimal  in  the  sense  that  high 
derivatives  are  required  to  eliminate  the  parameter.  The 
parameterless  method  described  later  should  reduce  the 
number  of  quantities  that  are  needed  to  be  extracted 
from  the  image. 


5  Diflferential  Invariants  without 
Derivatives 

We  describe  here  a  method  that  combines  the  locality 
of  the  differential  invariants  with  the  advantages  of  the 
algebraic  method,  such  as  avoiding  the  need  for  curve 
parametrization. 

A  problem  with  the  differential  method,  as  mentioned 
before,  is  that  the  arbitrary  parameter  is  a  source  of 
ambiguity  whose  elimination  requires  higher  derivatives 
then  the  projectivity  itself  needs.  While  the  projectivity 
has  only  eight  parameters  that  have  to  be  eliminated, 
requiring  the  fourth  derivatives,  the  unknown  parameter 
forces  us  to  obtain  more  information  from  the  image,  or 
higher  derivatives,  to  eliminate  it.  On  the  other  hand, 
the  parameter  is  not  in  fact  part  of  the  geometry  of  the 
curve.  The  coordinates  ar,  y  of  each  point  are  sufficient  to 
characterize  the  curve  and  the  parameter  is  introduced 
artificially  for  convenience.  The  information  contained 
in  a  derivative  is  partly  wasted  from  a  purely  geometrical 
point  of  view  because  it  informs  us  about  the  relation  of 
X,  y  to  the  parameter  rather  than  to  each  other,  and  to 
obtain  the  full  geometrical  information  we  are  forced  to 
go  to  higher  derivatives.  It  would  thus  be  profitable  to 
find  other  quantities  that  are  purely  geometrical  and  so 
can  be  fully  utilized  for  deriving  geometrical  invariants. 

A  way  to  approach  the  problem  is  to  deal  with  an 
implicit  representation  of  the  curve,  i.e.  one  of  the  form 
/(x,y)  =  0,  without  an  explicit  parameter.  The  given 
curve  itself  is  quite  arbitrary  and  it  is  hard  to  find  an 
/  that  will  represent  it  in  this  way.  However,  at  each 
point  of  it,  one  can  find  a  simpler,  osculating  curve  that 
can  be  represented  implicitly  and  whose  invariants  are 
relatively  easy  to  find. 

An  osculating  curve  is  a  generalization  of  the  tangent. 
A  tangent  is  a  line  having  at  least  two  points  in  common 
with  the  curve  in  an  infinitesimal  neighborhood,  i.e.  two 
“points  of  contact” .  This  can  be  expressed  as  a  condition 
on  the  first  derivative.  Similarly,  a  higher  order  osculat¬ 
ing  curve  has  more  (independent)  points  of  contact,  and 
the  condition  on  the  derivatives  can  be  written  as 

^(/*(*.y)-/(*.y))  =  o,  (l:  =  0...n)  (5) 

with  /*  being  the  osculating  curve,  /  the  given  curve, 
and  n  the  order  of  the  osculation.  Since  the  derivatives 
vanish,  this  condition  is  invariant  to  the  parameter  i. 
Since  it  has  a  geometric  interpretation  with  points  of 
contact,  the  condition  is  also  projectively  invariant. 

To  find  invariants  at  a  point  of  a  given  curve,  we  can 
find  an  algebraic  (implicit)  curve  that  osculates  it  at  that 
point.  The  coefficients  of  this  osculating  curve  are  i  de¬ 
pendent  of  any  parametrization  and  we  only  need  ten  of 
them  to  find  two  independent  invariants.  (This  includes 
the  coordinates  of  the  point  itself  which  may  already  be 
known.)  Thus  the  reliability  of  the  method  should  be 
greater  than  the  derivative  based  method  which  needs 
18  quantities. 

Obvious  questions  ari.se  as  to  which  algebraic  curve  we 
should  choose  Jis  an  osculating  curve  and  how  to  calcu¬ 
late  it.  We  solve  this  problem  by  a  method  similar  to  the 
one  used  for  obtaining  robust  derivatives,  i.e.  we  choose 
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a  curve  that  preserves  tlie  first  moments  of  the  original 
data  over  some  window. 

Unlike  other  moment  methods,  ours  is  local,  being 
applied  over  a  neighborhood  around  each  point.  This 
method  of  local  moment  preservation  is  analyzed  in  de¬ 
tail  in  [14]  and  summarized  in  the  next  section,  for  the 
case  of  ordinary  derivatives.  We  show  both  analytically 
and  numerically  that  it  leads  to  much  greater  reliability 
than  other  methods.  We  derive  a  quantitative  relation 
between  the  various  parameters  involved,  i.e.  the  the 
window  size,  the  order  of  the  moments  to  be  preserved 
and  the  accuracy  of  the  method.  The  analysis  can  be 
easily  carried  over  to  the  implicit  case.  The  equality  of 
the  first  local  moments  also  ensures  that  the  curve  /*  is 
indeed  osculating  to  /. 

Scmi-cliffcrential  invariants 

One  can  reduce  the  number  of  data  quantities  needed  at 
each  point  if  some  other  information  about  the  shape  is 
known,  such  as  the  appearance  of  straight  lines  or  known 
feature  points.  For  example,  a  silhouette  of  an  airplane 
can  contain  both  curved  parts  and  straight  lines.  Invari¬ 
ants  involving  both  derivatives  and  reference  points  were 
found  by  [1]  and  [12]. 

The  “derivativeless”  method  described  above  is  per¬ 
fectly  suited  for  this  situation,  and  again  leads  to  sav¬ 
ing  in  the  number  of  data  quantities  needed  from  the 
image  and  imcreased  reliability.  If  we  have  a  known 
“feature”  point  (not  neces-arily  on  the  curve),  then  for 
each  point  on  the  curve  we  can  construct  an  osculat¬ 
ing  curve  that  touches  it  at  fewer  contact  points,  but 
will  also  pass  through  the  known  feature  point.  This 
is  analogous  to  reducing  the  order  of  derivatives  in  the 
parametrized  methods.  Similarly,  if  we  have  a  known 
line,  we  can  demand  that  the  osculating  curve  be  tan¬ 
gent  to  this  line.  The  invariants  can  be  found  as  before. 
Of  course,  the  correspondence  between  the  appropriate 
parts  of  the  shape  has  to  be  known. 

6  Noise  Resistant  Derivatives 

As  we  saw,  the  success  of  the  differential  method  de¬ 
pends  critically  on  a  reliable  differentiation  method.  In 
this  section  we  develop  such  a  method;  our  results  have 
significant  uses  beside  invariants.  We  analyze  the  re¬ 
quirements  that  a  differentiation  method  has  to  meet, 
show  that  the  Gaussian  is  not  good  enough  for  our  pur¬ 
poses  and  present  the  method  that  was  successful  in  our 
experiments.  More  details  can  be  found  in  [14]  and  [10]. 

Two  basic,  and  conflicting,  requirements  are  involved: 
a)  Accuracy:  At  least  in  the  noiseless  case,  one  would 
like  to  obtain  the  correct  derivatives,  at  least  at  low  or¬ 
ders.  b)  Smoothing,  to  alleviate  the  effect  of  noise  and 
discretization.  Generally,  smoothing  reduces  the  accu¬ 
racy  even  in  the  analytic  case  as  the  more  rapid  changes 
in  the  function  are  smoothed  out.  The  goal  in  designing 
a  derivative  filter  is  then  to  strike  the  correct  balance 
between  accuracy  and  smoothing.  In  this  section  we  de¬ 
scribe  how  this  balance  can  be  achieved. 

As  an  illustration,  we  first  show  that  the  Gaussian 
gives  the  wrong  result  even  for  the  simplest  functions. 


Smoothing  over  x-  gives 

g{x,  (t)  0  X-  =  X-  +  (T- 


where  g(x,  a)  is  the  Gau.ssian,  and  0  means  convolution. 
Similar  results  are  obtained  for  derivatives.  We  can  see 
that  an  error  is  introduced  that  increa.ses  a.s  we  incrca.se 
the  smoothing,  i.e.  higher  a.  Similar  rc.sidts  are  ol)taine(l 
for  higher  powers  or  for  taking  derivatives.  For  effectivi' 
smoothing  cr  should  not  be  too  small  so  this  error  can  l)e 
substantial.  This  is  a  systematic  “orrrsmoolli.ug  error''. 
as  opposed  the  random  noise  that  we  want  to  smooth. 
As  noted  before,  we  do  not  expect  a  smoothing  operator 
to  give  accurate  results  but  we  do  want  to  improve  the 
balance  between  accuracy  and  smoothing  so  tliat  at  least 
the  simple,  smooth  functions  will  remain  accurate. 

We  will  now  di.scuss  this  noise  versus  signal  problem 
in  terms  of  the  ratio  between  the  smoothing  parameter 
<T,  and  a  “natural”  scale  of  the  shape,  sn-  For  a  stnooth 
shape  f{x),  we  estimate  this  sq  as  the  scale  over  which 
the  relative  change  in  the  shape.  A///,  is  of  the  order 
of  magnitude  of  1.  We  then  rescale  the  x  axis  so  that 
Xi  =  x/so,  and  rewrite  the  shape  as  f(xi).  For  example, 
if  f{x)  =  sin(x/so)  then  /(xi)  =  sinxi.  This  makes 
the  derivatives  of  /  with  respect  to  Xj  of  the  order  of 
magnitude  of  1.  In  this  way  we  “normalize”  the  signal 
in  the  x  direction  and  deal  with  its  scale  separately. 

A  natural  tool  in  dealing  with  the  derivative  in  some 
neighborhood  of  a  smooth  shape  is  the  Taylor  expansion 


/W  =  /(|;)  =  E 


with  the  derivatives  =  577/ /(,)'■'  **  ^>(0-  'Ve  will 
look  at  the  result  of  Gaussian  filtering  at  x  =  0.  It  is 
easy  to  show  that  the  error  introduced  by  this  smoothing 
is 

<70/- /=«/"(-)- +  5/^"'(-)''  +  ... 

So  o  So 

If  we  want  accurate  results,  we  have  to  keep  this  error 
small.  Looking  at  the  leading  term,  we  see  that  the 
accuracy  is  proportional  to  <t*/.s5,  so  we  have  to  keep  er 
small.  Unfortunately,  this  limits  the  ability  of  the  filter 
to  smooth  out  the  noise.  A  similar  result  is  obtained  for 
derivatives. 

A  way  to  improve  the  situation  is  to  eliminate  the 
leading  terms  in  the  expansions  f)f  the  errors  above.  If 
the  first  term  is  eliminated,  for  instatice,  then  the  error 
will  be  reduced  to  «  if  —  )’.  This  wav  we  can  obtain  a 
much  better  accuracy  for  the  same  smoothing  parameter 
<7  as  before  (for  cr  <  sq).  Alternatively,  we  ran  increase 
the  smoothing  without  compromising  accuracy. 

Since  the  first  error  terms  come  from  the  first  powers 
in  the  Taylor  expansion  of  /,  we  need  a  filter  /■'/  that  will 
preserve  the  fir.;t  I  powers  xb 


r,  0x"=x",  n  =  ().../ 


(For  the  Gaussian  /  =  1.)  It  ran  be  ea.sily  shf)wn  that 
the  powers  will  be  [(reserved  if  thi'  first  /  moments  of  the 
filter  vanish  (and  the  Oth  moment  is  normalizerl  t((  I). 
Defining  the  “normalized  moments"  as 

=  J /^'/(x)f/x 
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where  cr  is  now  a  measure  of  the  filter  size,  e.g.  the 
variance,  we  have  the  conditions 

mo  =  1 ;  m„  =0  for  n  =  I ..  .1 

The  error  in  the  smoothing  of  at  x  =  0  is  now  simply 
the  I  4- 1  term: 

(/+!)! 


y^(/+fc+i)(^y+i  ^ 
So 


7  Experiments 

We  describe  here  experiments  done  on  synthetic  sliapes. 
Despite  the  limitations  of  such  experiments,  we  were  able 
to  obtain  a  good  idea  of  the  reliability  of  the  liigh  deriva¬ 
tives  by  perturbing  (he  data  and  looking  at  the  change 
in  the  invariants.  The  results,  with  the  differentiation 
method  that  we  employed,  were  very  good. 


Thus,  increasing  the  order  /  of  the  filter  eliminates  the 
powers  (^)"  up  to  n  =  /  and  improves  the  accuracy  to 

the  above  tolerance.  \ 

.  '  \ 
1 

Accuracy  Criterion 

For  good  accuracy,  the  above  error  has  to  be  small.  This 
leads  to  an  “accuracy  criterion”  for  choosing  appropriate 
parameters.  Since  /f")  ss  0(1),  we  obtain  the  condition 


^i+\  / g  g+i 


«  1 


(6) 


This  criterion  can  be  used  to  estimate  the  parameters 
in  several  ways.  Given  the  meaningful  scale  of  change 
of  the  signal,  soi  and  the  smoothing  parameter  <t,  we 
can  calculate  the  order  I  of  the  filter  needed  to  lower 
the  oversmoothing  error  to  some  acceptable  level.  <t  can 
be  determined  from  the  estimated  noise.  Conversely,  for 
a  given  order  I,  we  can  calculate  the  largest  smoothing 
parameter  r  that  will  still  maintain  a  desired  accuracy. 
Generally  speaking,  at  low  orders  we  need  a  <  so,  but 
at  high  orders  we  can  afford  to  have  the  smoothing  <t 
bigger  than  sq  because  the  factor  mj+i/(/  4-  1)!  in  eq. 
(6)  is  small. 

One  way  of  obtaining  such  zero-moment  filters  is  by 
multiplying  a  Gaussian  with  appropriate  polynomials. 
The  filter  of  order  /,  eliminating  errors  up  order  I,  is  (on 
an  infinite  interval) 


i 

Fi  =  ^(aiP<(x))</(x) 
i=0 

where  Pi{x)  are  Ilermite  polynomials  which  are  orthog¬ 
onal  with  respect  to  the  Gaussian  weight  function.  The 
coefficients  a,  are  chosen  so  that  the  first  I  moments  of 
the  filter  vanish,  in  accordance  with  our  conditions. 

Oversmoothing  is  only  one  problem  with  the  Gaus¬ 
sian.  The  truncation  of  the  infinite  filter  is  al.so  a  serious 
problem,  because  the  derivative  becomes  meaning¬ 
less  at  the  ends  of  the  window  and  contributes  a  large 
error  [14].  Discrete  versions  of  this  method  on  a  finite 
window  are  described  in  detail  in  [10].  Closed  form  fil¬ 
ters  are  derived  there  that  yield  differentiation  filters  of 
the  desired  orders.  We  use  here  one  of  these  methods, 
based  on  the  Krawtchouk  polynomials.  They  are  de¬ 
fined  by  the  condition  of  orthogonality  with  respect  to 
the  binomial  weight  function. 

In  summary,  we  have  been  able  to  obtain  good  esti¬ 
mation  of  high  derivatives  by  increasing  the  filter  size  a 
and  its  order  /  to  meet  the  above  accuracy  criterion. 


Figure  1:  Peanut:  Projection  1 


I 

/ 


Figure  2:  Peanut:  Projection  2 


Projective  Invariants 

Fig.  1  shows  a  “peanut”  shape,  created  as 


x(t)  =  2  cost 

y{t)  =  sin  <  4-  ^  sin  3t 


689 


and  Fig.  2  is  its  projection  with  tilt  and  slant  of  about 
45®.  Both  projections  were  discretized  and  the  deriva¬ 
tives  were  calculated  from  the  discrete  images. 


i.p 


Figure  3;  Peanut;  Invariant  curve 


To  suppress  noise  in  the  derivatives,  we  would  like  the 
smoothing  factor  of  the  filter  a  to  be  big.  But,  according 
to  the  analysis  of  the  last  section,  <r  cannot  be  much 
bigger  than  the  scale  of  variation  of  the  shape,  which 
here  is  so  =  1/3.  We  chose  the  order  of  the  filter  as 
/  =  10,  and  found  that  very  good  accuracy  is  obtained 
with  a  —  0.4,  in  line  with  our  accuracy  criterion,  eq. 
(6).  With  these  parameters,  we  obtained  very  reliable 
and  noise  resistant  invariants.  Perturbing  the  data  by 
10%  yielded  perturbations  in  the  invariants  of  less  than 
that  amount,  without  sacrifying  the  distinctiveness  of 
the  curve.  A  smaller  <r  would  require  a  smaller  window 
but  would  result  in  more  sensitivity  to  noise.  This  shows 
that  there  is  no  serious  problem  in  using  high  derivatives 
as  long  as  the  differentiation  method  and  its  parameters 
are  chosen  properly. 

In  Fig.  3  we  have  plotted  normalized  versions  of  our 
modified  semi-invariants,  /i,/2,  which  were  similar  for 
both  projections,  for  a  quarter  of  the  images. 

h  = 

f  f  -_iL_ 

'  v/T+TJ  ^ 

The  first  line  above  makes  the  invariants  of  equal  weight. 
The  second  one  is  a  normalization  done  to  counter  the 
effect  of  the  singularity  that  we  observe.  This  singu¬ 
larity  comes  from  the  inflection  point  of  the  shape,  i.e. 
where  the  curvature  has  a  zero  crossing.  At  this  point 
both  invariants  tend  to  infinity  and  are  equal  in  absolute 
magnitude,  but  jumps  between  ±00.  The  normaliza¬ 
tion  reduces  the  infinities  to  ±1. 

Inflection  points  are  themselves  projective  invariants 
and  can  be  found  in  other  ways  as  well.  Thus  their  loca- 


Figure  4:  Spiral;  Projection  1 


tion  on  the  curve  is  fixed  and  should  not  pose  a  problem 
for  our  method.  The  power  of  the  differential  invari¬ 
ants  is  in  providing  information  about  the  behavior  of 
the  curve  between  the  inflection  points.  We  can  see  that 
the  location  of  the  inflection  points  is  determined  quite 
precisely  in  this  method,  an  observation  that  has  signifi¬ 
cance  for  the  general  problem  of  accurate  edge  detection. 

For  the  Krawtehouk  filter  that  we  have  used,  we  have 
<r  ss  w/l,  with  w  being  the  width  of  the  window.  Thus 
we  needed  a  window  of  width  la  =  4,  which  is  more 
than  half  the  shape  (but  we  only  used  11  points  in  each 
window).  We  believe  that  an  optimized  differentiation 
method  such  as  proposed  in  [14]  can  provide  the  same 
smoothing  with  a  smaller  window. 

The  same  method  was  applied  to  logarithmic  spirals. 
Figs.  4  and  5  show  the  two  projections.  An  interesting 
result,  which  can  be  proven  analytically,  is  that  a  log¬ 
arithmic  spiral  hais  constant  projective  invariants  along 
its  arclength.  This  may  be  interesting  in  connection  with 
recognizing  biological  forms  because  according  to  some 
theories  [5]  many  such  forms  can  be  segmented  into  spi¬ 
ral  parts. 

Affine  Invariants 

The  affine  invariants  developed  before  are  invariant  to 
reparametrization  of  the  curve  as  well  as  the  projec¬ 
tion  so  we  could  test  this  property  under  conditions  of 
discretization.  The  derivatives  were  calculated  in  the 
same  way  before,  and  again  the  fourth  derivative 
was  needed.  Fig.  6  shows  an  affine  projection  of  our 
peanut  with  a  unimodular  2x2  matrix.  Fig.  7  shows 
the  invariant  curves  for  both  projections,  with  differ¬ 
ent  parametrizations.  In  both  curves  the  affine  invari- 
“1/2 

ant  Ka  '  is  plotted  against  the  affine  arclength  t.  The 
discrete  points  are  also  shown. 

We  can  see  that  the  curves  are  quite  rlns«'.  The  biggest 
difference  is  in  the  region  where  the  affine  curvature  is  at 
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its  maximum,  at  which  point  it  also  changes  sign.  This 
region  surrounds  a  singularity  so  the  results  there  are 
somewhat  less  reliable  but  still  very  informative.  The 
point  at  which  the  affine  curvature  touches  zero  without 
changing  sign  is  the  inflection  point,  i.e.  the  Euclidean 
curvature  changes  sign  there. 

We  note  that  unlike  the  projective  case,  here  we  have 
to  use  the  same  starting  point  fo,  otherwise  the  curves 
will  be  shifted  horizontally.  Also,  for  a  general  affine 
transformation  (with  arbitrary  determinant)  the  scale  of 
the  invariant  curve  will  change.  We  chose  the  invariants 
so  that  the  scale  change  will  be  the  same  for  both  axes. 

8  Conclusions 

We  have  developed  new  methods  of  obtaining  differential 
invariants  and  implemented  some  of  them  experimen¬ 
tally.  We  have  obtained  an  invariant  signature  that  can 
be  used  to  recognize  a  plane  curve  regardless  of  the  point 
of  view  from  which  the  curve  is  seen.  We  have  found  a 
new  method  of  smoothing  and  differentiation  and  showed 
that  it  is  robust  to  noise,  which  makes  the  signature 
reasonably  reliable.  Our  smoothing  method,  involving 
preservation  of  moments  in  local  neighborhoods,  has  sig¬ 
nificant  applications  in  other  fields  as  well. 

References 

[1]  Barrett,  E.,  Payton,  P.,  Haag,  N.  and  Brill,  M. 
[1991],  “General  methods  for  determining  projec¬ 
tive  invariants  in  imagery”,  CVGIP:IU  53,  45-65. 

[2]  Brown,  C.M.  [1991]  “Numerical  evaluation  of  dif¬ 
ferential  and  semi-differential  invariants”,  TR-39S, 
University  of  Rochester  Computer  Science  Depart¬ 
ment,  September  1991. 


Figure  6:  Peanut:  Projection  3 


1-4.5 

Figure  7:  Peanut:  Affine  invariant  curves 


691 


[3]  Bruckstein,  A.  and  Netravali,  A.  [1990],  “On  dif¬ 
ferential  invariants  of  planar  curves  and  recognizing 
partially  occluded  planar  objects”,  ATiiT  TR,  July 
1990. 

[4]  Burns,  J.B.,  Weiss,  R.,  and  Riseman,  E.M.  [1990], 
“View  variation  of  point  set  and  line  segment  fea¬ 
tures”,  Proc.  DARPA  lU  Workshop,  650-659. 

[5]  Cook,  T.  [1914]  The  Curves  of  Life,  New  York: 
Dover. 

[6]  DARPA-ESPRIT  Workshop  on  Invariance,  Pro¬ 
ceedings  [1991],  Reykjavik,  Iceland,  March  1991. 

[7]  Duda,  R.O.  and  Hart,  P.E.  [1973]  Pattern  Recogni¬ 
tion  and  Scene  Analysis,  New  York:  Wiley. 

[8]  Forsyth,  D.,  Mundy,  J.L.,  Zisserman,  A.,  and 
Brown,  C.M.  [1990],  “Invariance — A  new  frame¬ 
work  for  vision”,  Proc.  ICCV,  598-605. 

[9]  Guggenheimer,  H.  [1963]  Differential  Geometry, 
New  York:  Dover. 

[10]  Meer,  P.  and  Weiss,  I.  [1989],  “Smoothed  differen¬ 
tiation  filters  for  images”,  Center  for  Automation 
Research,  University  of  Maryland,  TR  424. 

[11]  Nielsen,  L.  and  Sparr,  G.  [1991]  “Projective  area 
invariants  as  an  extension  of  the  cross  ratio”, 
CVGIP:IU54,  145-159. 

[12]  Van  Gool,  L.,  Wagemans,  J.,  Vandeneede,  J.,  Oost- 
erlinck,  A.,  [1990],  “Similarity  extraction  and  mod¬ 
eling”,  Proc.  ICCV,  530-534. 

[13]  Weiss,  I.  [1988]  “Projective  invariants  of  shapes”, 
Proc.  DARPA  lU  Workshop,  1125-1134. 

[14]  Weiss,  I.  [1991]  “High  order  differentiation  filters 
that  work”.  University  of  Maryland,  Center  for  Au¬ 
tomation  Research,  TR  CAR  545. 

[15]  Weiss,  I.,  Meer,  P.  and  Dunn,  S.M.  [1991],  “Ro¬ 
bustness  of  algebraic  invariants”,  Proc.  DARPA- 
ESPRIT  Workshop  on  Invariance,  Reykjavik. 

[16]  Wilczynski,  E.J.  [1906],  Projective  Differential  Ge¬ 
ometry  of  Curves  and  Ruled  Surfaces,  Leipzig: 
Teubner. 


I 


692 


Polynomial-Time  Object  Recognition  in  the 
Presence  of  Clutter,  Occlusion,  and  Uncertainty 

Todd  A.  Cass 

Artificial  Intelligence  Laboratory 
Massachusetts  Institute  of  Technology 


Abstract 

We  consider  the  problem  of  object  recogni¬ 
tion  vis  local  geometric  feature  matching  in 
the  presence  of  sensor  uncertainty,  occlusion, 
and  clutter.  We  present  a  general  formulation 
of  the  problem  and  a  polynomial-time  algo¬ 
rithm  which  guarantees  finding  all  feasible  in¬ 
terpretations  of  the  data,  modulo  uncertainty, 
in  terms  of  the  model.  This  formulation  ap¬ 
plies  to  problems  involving  both  2D  and  3D 
objects. 

We  will  describe  the  theory  in  general,  and 
analyte  particular  cases  in  detail  including  an 
approach  achieving  efficient  practical  perfor¬ 
mance.  Algorithms,  implementations,  and  ex¬ 
periments  are  presented. 

1  Introduction 

The  task  considered  here  is  the  recognition  by  com¬ 
puter  of  a  known  object  in  some  environment  via  the 
use  of  sensory  data,  such  as  a  visible  light  image,  de¬ 
rived  from  the  environment.  Object  models  and  sen¬ 
sory  data  are  represented  as  sets  of  local  geometric  fea¬ 
tures,  e.g.  points  and  lines.  The  problem  is  formu¬ 
lated  as  matching  model  features  and  data  features  to 
determine  the  pose  of  an  instance  of  the  model.  This 
problem  is  hard  because  there  are  spurious  and  miss¬ 
ing  features,  as  well  as  sensor  uncertainty.  This  paper 
presents  improvements  and  extensions  to  earlier  work[9, 
7]  describing  robust,  complete,  and  provably  correct 
methods  for  polynomial-time  object  recognition  in  the 
presence  of  clutter,  occlusion,  and  sensor  uncertainty. 

We  assume  the  uncertainty  in  the  sensor  measure¬ 
ments  of  the  data  features  is  bounded.  A  model  pose 
is  considered  feasible  for  a  given  model  and  data  fea¬ 
ture  match  if  at  that  pose  the  two  matched  features  are 
aligned  modulo  uncertainty,  that  is,  if  the  image  of  the 
m^el  feature  falls  within  the  uncertainty  bounds  of  the 
data  featnre.  We  show  that,  given  a  set  of  model  and 
data  features  and  assuming  bounded  sensor  uncertainty, 
there  are  only  a  polynomial  number  of  qualitatively  dis¬ 
tinct  poses  matching  the  model  to  the  data.  Two  differ¬ 
ent  poses  are  qualitatively  distinct  if  the  sets  of  feature 
matches  aligned  (modulo  uncertainty)  by  them  are  dif¬ 
ferent.  The  approach  is  based  on  determining  the  set 


of  qualitatively  different  poses.  We  call  this  approach 
po$e  equivalence  analysis.  A  previous  p«q>er  [S]  intro¬ 
duced  the  idea  of  pose  equivalence  analysis;  this  paper 
contributes  a  simpler  explanation  of  the  approach  based 
on  linear  constraints  and  transformations,  outlining  the 
general  approach  for  the  case  of  3D  and  2D  models  with 
2D  data.  This  formalisation  provides  a  simple  and  clean 
mathematical  framework  within  which  to  analyse  the 
feature  matching  problem  in  the  presence  of  bounded 
geometric  uncertainty,  providing  insight  into  the  funda¬ 
mental  nature  of  this  type  of  feature  matching  problem. 
We  analyse  a  particular  case  of  2D  models  and  planar 
transformations  to  illustrate  how  the  structure  of  the 
matching  problem  can  be  exploited  to  develop  efficient 
matching  ^gorithms  based  on  this  approach. 

1.1  Robust  Object  Localisation  via  Feature 
Matching 

Object  localization  will  be  defined  as  the  problem  of 
determining  the  geometric  correspondence  between  the 
model  and  some  a  priori  unknown  subset  of  the  data. 
We’re  given  a  geometrical  model  of  the  spatial  structure 
of  an  object  and  the  problem  is  to  select  data  subsets 
corresponding  to  instances  of  the  model,  and  determine 
the  position  and  orientation,  or  pose^,  of  the  object  in 
the  environment  by  matching  the  geometric  model  with 
instances  of  the  object  represented  in  the  sensory  data. 

In  this  paper  the  model  and  the  sensory  data  wiU  be 
represented  in  terms  of  local  geometric  features  conast- 
ing  of  points  or  lines  and  possibly  curve  normals.  Lo¬ 
calisation  is  achieved  via  matching  model  and  data  fea¬ 
ture  subsets.  There  are  four  features  of  this  task  domain 
which  are  important  to  consider  in  the  work:  The  featnre 
matching  problem  is  difficult  because  the  correct  feature 
correspondences  are  unknown;  there  are  spurious  data 
features;  there  can  be  model  features  missing  from  the 
data  due  to  occlusion  and  failures  in  feature  extraction; 
most  importantly,  the  sensory  data  axe  subject  to  geo¬ 
metrical  uncertainty  that  greatly  affects  the  process  of 
robust  and  correct  detection  and  pose  determination. 

Localisation  is  reasonably  divided  into  a  pose  hypoth¬ 
esis  stage  and  a  pose  verification  stage.  This  paper  con- 

*The  pose  of  the  model  is  its  position  and  orkata- 
tion,  which  it  equivalent  to  the  transformation  producing 
it.  In  this  paper  pose  and  transformation  will  be  used 
interchangeably. 
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sideis  pose  hypothesis  construction  via  feature  matching. 

1.2  RAbustness,  Completeness,  and 
l^actability 

There  are  three  important  criteria  by  which  to  ana¬ 
lyse  methods  for  object  localisation:  robu$tneit,  com¬ 
pleteness,  and  tractabUity.  The  robustness  requirement 
means  that  careful  attention  is  paid  to  the  geometric  un¬ 
certainty  in  the  features,  so  no  correct  feature  correspon¬ 
dences  are  missed  due  to  sensor  error.  The  completeness 
requirement  means  that  all  sets  of  geometrically  con¬ 
sistent  feature  correspondences  are  found,  including  the 
correct  ones.  The  tractability  requirement  simply  means 
that  a  polynomial-time  and  hopefully  efficient  algorithm 
exist  for  the  matching  procedure.  Except  for  our  pre¬ 
vious  work  [7,  8],  and  recent  work  by  Breuel[4],  among 
those  existing  methods  accurately  accounting  for  uncer¬ 
tainty,  none  can  both  guarantee  that  all  feasible  object 
poses  will  be  found  and  do  so  in  polynomial  time.  Those 
that  do  account  for  error  and  guarantee  completeness 
have  expected-case  exponential  complexity!  13]. 

Uncertainty  b  the  main  factor  which  makes  the  lo- 
calbation  problem  difficult.  If  there  b  no  uncertainty 
then  simple  polynomial-time  algorithms  are  possible  for 
localisation,  guaranteeing  success[l5].  However,  if  the 
measured  position  of  features  are  used  without  account¬ 
ing  for  possible  deviation  &om  the  correct  positions  then 
these  approaches  cannot  guarantee  the  correct  matching 
of  the  model  to  the  data. 

1.3  Correspondemce  Space  vs.  Pose  Space 

For  given  sets  of  model  and  data  features,  we  defined 
localisation  as  both  determining  which  subset  of  data 
features  correspond  to  the  model  features,  and  how  they 
correspond  geometrically.  Localbation  can  be  accom- 
plbhed  by  either  searching  for  geometrically  consistent 
feature  correspondences  or  searching  for  the  model  trans¬ 
formation  or  pose  geometrically  aligning  model  and  data 
features.  Techniques  based  on  these  approaches  can 
be  called  correspondence  space  methods  and  pose  space 
methods,  respectively. 

Correspondence  space  b  the  power  set  of  the  set  of 
all  model  and  data  feature  pairs:  where 

{m{}  and  {dj}  represent  the  sets  of  model  and  data 
features,  respectively.  We  define  a  match  set  M  € 
as  an  arbitrary  set  of  model  and  data  fea¬ 
ture  matches.  Localbation  can  be  achieved  by  finding 
geometrically  consbtent  match  sets,  that  is,  match  sets 
for  which  there  exbts  some  pose  of  the  model  align¬ 
ing  modulo  uncertainty  the  matched  features.  One 
way  to  structure  thb  b  as  a  search  through  the  cor¬ 
respondence  space,  often  structured  as  a  tree  search[l4, 
3].  Correspondence  space  is  an  exponential-sbed  set, 
and  although  enforcing  geometric  consistency  in  the 
search  prunes  away  large  portions  of  the  search  space, 
it  has  been  shown  that  the  expected  search  time  b  still 
exponential[lS]. 

Pose  or  transformation  space  is  the  space  of  possi¬ 
ble  transformations  on  the  model.  Localisation  can 
be  accompibhed  by  searching  pose  space  for  model 
poses  aligning  modulo  uncertainty  model  features  to 


data  features.  Examples  of  techniques  searching  pose 
space  are  pose  clustering[l9],  transformation  sampling[6, 
5],  and  the  method  described  in  thb  paper,  pose  equiva¬ 
lence  analysisfSj.  The  pose  space  b  a  high  dimensional, 
continuous  space,  and  the  effects  of  data  uncertainty  and 
missing  and  spurious  features  make  effectively  searching 
it  to  find  consistent  match  sets  difficult. 

The  approach  described  in  this  paper  provides  a 
framework  for  unifying  the  correspondence  space  ap¬ 
proach  and  the  pose  space  approach. 


We  will  represent  the  model  and  the  image  data  in  terms 
of  local  geometric  features  such  as  points  and  line  seg¬ 
ments  derived  from  the  object’s  boundary,  to  which  we 
may  also  associate  an  orientation.  Denote  a  model  fea¬ 
ture  m  =  (pm ,  )  l>y  ^  ordered  pair  of  vectors  rep¬ 
resenting  the  feature’s  position  and  orientation,  respec¬ 
tively.  Similarly  the  measured  geometry  of  a  data  fea¬ 
ture  b  given  by  d  =  (p4,9«i).  Define  Uf  and  Uf  to  be 
the  uncertainty  region  for  the  position  and  orientation, 
respectively,  for  data  feature  di.  We  assume  that  the 
uncertainty  in  position  or  orientation  are  independent. 
The  true  position  of  d,  falls  in  the  set  Uf,  and  its  true 
orientation  falls  in  the  set  Vf.  We  can  think  of  U,-  as  a 
re^on  surrounding  dU.  For  most  of  the  following  analysb 
we  will  consider  the  case  where  the  features  are  simply 
points  in  the  plane  without  an  associated  orientation;  we 
then  show  how  to  incorporate  orientation  constraints. 

Correctly  hypothesbing  the  pose  of  the  object  b 
equivalent  to  finding  a  transformation  on  the  model  into 
the  scene  aligning  the  model  with  its  instance  in  the  data, 
by  aligning  individual  model  and  data  features.  Aligning 
a  model  feature  and  a  data  feature  consbts  of  trans¬ 
forming  the  model  feature  such  that  the  transformed 
model  feature  falls  within  the  geometric  uncertainty  re¬ 
gion  for  the  data  feature.  We  can  think  of  the  data  as  a 
set  of  points  and  uncertainty  regions  {(ps,,  Uf)}  u>  the 
plane,  where  each  measured  data  position  b  surrounded 
by  some  positional  uncertainty  region  Uf.  A  model  fea¬ 
ture  with  position  pm^  nnd  a  data  feature  with  position 
P4.  are  aligned  via  a  transformation  T  if  T[pm,]  G  Uf. 
Because  the  pose  of  an  instance  of  the  model  in  the  en- 
vuonment  and  the  model  itself  are  related  by  a  rigid 
transformation,  some  transformations  will  align  many 
correct  model  and  data  feature  matches  although  which 
matches  arc  correct  is  unknown  a  priori.  Intuitively,  the 
whole  problem  is  then  to  find  single  transformations  si¬ 
multaneously  aligning  in  this  sense  a  large  number  of 
paus  of  model  and  image  features. 

One  of  the  main  contributions  of  thb  work,  and  the 
key  innght  of  thb  approach  b  the  idea  that  under  the 
bounded  uncertainty  model  there  are  only  a  polyno¬ 
mial  number  of  qualitatively  different  transformations  or 
poses  aligning  subsets  of  a  given  model  feature  set  with 
subsets  of  a  given  data  feature  set.  Finding  these  equiv¬ 
alence  classes  of  transformations  is  equivalent  to  finding 
all  qualitatively  different  sets  of  feature  correspondences. 
Thus  we  need  not  search  through  an  exponential  num¬ 
ber  of  sets  of  possible  feature  correspondences  as  previ- 
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oos  systems  have,  aoi  consider  an  infinite  set  of  possible 
tiansfisimations. 

In  the  2D  case  the  transformations  will  consist  of  a 
planar  rotation,  scaling,  and  translation.  We’ve  said  that 
a  model  feature  mt  and  data  feature  dj  are  aligned  by  a 
transfisrmation  T  iff  T[mj]  €  Uj.  Two  transformations 
are  gualUaiivelg  eimilar  if  and  only  if  they  align  in  this 
sense  exactly  the  same  set  of  feature  matches.  All  trans- 
formatimis  which  align  the  same  set  of  feature  matches 
are  equivalent.  Thus  there  are  equivalence  classes  of 
transformations.  This  is  a  key  insight.  More  formally,  let 
n  be  the  transformation  parameter  space,  and  let  T  G  O 
be  a  transform.  Define  ^(T)  =  {(Tn{,d,)|T[mj]  G  Uj} 
to  be  the  set  of  matches  aUgned  by  the  transformation 
T.  The  function  ^(T)  partitions  (I  forming  equivalence 
classes  of  transformations  Ek,  where  O  =  Ut 
T  =  T'  <=»  <p{T)  =  <p{T').  The  entire  recognition  ap¬ 
proach  developed  in  this  paper  is  based  upon  computing 
these  equivalence  classes  of  transformations,  and  the  set 
of  feature  matches  associated  with  each  of  them. 

2.1  Relating  Pose  Space  and  Correspondence 
Space 

If  a  model  feature  nn  and  a  data  feature  dj  are  to  cor¬ 
respond  to  one  another,  the  set  of  transformations  on 
the  model  feature  which  are  feasible  is  given  by  the  set 
of  transformations  =  {T  G  fllT[»ni]  G  Uj}.  Let 

M  ■=■  {(m{,d,)}  be  some  match  set?.  A  match  set  is 
called  geometrically  consistent  iff  n(mt,dj)eM 
0,  that  is  iff  there  exists  some  transformation  which  is 
feasible  for  all  (m{,dj)  G  M. 

The  match  set  given  by  ^(T)  for  some  transformation 
Tis  called  a  maximal  geometrically-consistent  match  set 
A  match  set  Ad  is  a  maximal  geometrically-consistent 
match  set  (or,  a  maximal  match  set)  if  it  is  the  largest 
geometrically  consistent  match  set  at  some  transforma¬ 
tion  T.  Thus  by  definition  the  match  set  given  by  ip{T) 
is  a  maximal  match  set.  The  function  ^(T)  u  a  map¬ 
ping  from  transformation  space  to  correspondence  space. 
<p{T)  :  n  — *  and  there  is  a  one-to-one  corre¬ 

spondence  between  the  pose  equivalence  classes  and  the 
maximal  match  sets  given  by  <p(T)  * — »  iff  T  G  Ej,. 
The  function  ^(T)  partitions  the  infinite  set  of  possible 
object  poses  into  a  polynomial-sised  set  of  pose  equiva^ 
lence  classes;  and  identifies  a  polynomial  sised  subset  of 
the  exponential-sised  set  of  possible  match  sets. 

The  important  point  is  that  the  pose  equivalence 
classes  and  their  associated  maximal  match  sets  are  the 
only  objects  of  interest:  all  poses  within  a  pose  equiv¬ 
alence  class  ate  qualitatively  the  same;  and  the  maxi¬ 
mal  geometrically  consistent  match  sets  are  essentially 
the  cmly  sets  of  feature  correspondences  that  need  be 
considered  because  they  correspond  to  the  pose  equiv¬ 
alence  classes.  Figure  1  shows  the  relationship  between 
correspondence  space  and  transformation  space.  Note 

’Thb  is  also  sometimes  called  a  correspondence,  or  a 
To  clarify  terms,  we  will  define  a  match  as  a  pair  of 
amodd  feature  and  a  data  feature,  and  a  match  «et  as  a  set  of 
matches.  The  term  matchtuf  implies  a  match  set  in  which  the 
model  and  data  features  are  in  one-to-one  correspondence. 


Figure  1:  The  pose  space  is  partitioned  into  a  polynomial 
number  of  pose  equivalence  classes,  each  associated  srith  a 
marimal  match  set,  via  the  function  v(T).  The  maximal 
match  sets  are  the  only  sets  of  feature  correspondences  we 
need  to  consider. 


that  this  implies  we  do  not  need  to  consider  all  consis¬ 
tent  match  sets,  or  search  for  one-to-one  feature  match¬ 
ings,  because  they  are  simply  subsets  of  some  maximal 
match  set,  and  provide  no  new  pose  equivalence  classes. 
However,  given  a  match  set  we  can  easily  construct  a 
maximal,  one-to-one  matching  between  data  and  model 
features  [l  6]. 

One  distinction  between  this  approach,  which  works 
in  transformation  space,  and  robust  and  complete  cor¬ 
respondence  space  tree  searches[l4,  3]  is  that  for  each 
maximal  geometrically  consistent  match  set  (or  equiva¬ 
lently  for  each  equivalence  class  of  transformations)  there 
is  an  exponentiad  sised  set  (in  terms  of  the  cardin^ty  of 
the  match  set)  of  different  subsets  of  feature  correspon¬ 
dences  which  all  specify  the  same  set  of  feasible  trans¬ 
formations.  Thus  the  straightforward  pruned  tree  search 
does  too  much  (exponentially  more)  work  by  exploring 
these  subsets  of  maximal  match  sets.  This  is  part  of  the 
reason  why  these  correspondence  space  search  techniques 
have  exponential  expected  case  performance,  yet  our  ap¬ 
proach  is  polynomial  (see  section  8  and  the  discussion  of 
Breuel’s[4]  work). 

2.2  Feature  Matching  Requires  Only 
Polynomial  Time 

Formalising  the  localisation  problem  in  terms  of 
bounded  uncertainty  regions  and  transformation  equiv¬ 
alence  classes  allows  us  to  show  that  it  can  be  solved  in 
time  polynomial  in  the  sise  of  the  feature  sets.  Cass[7] 
oriipnally  demonstrated  this  using  quadratic  uncertainty 
constraints.  This  idea  can  be  easily  illustrated  us¬ 
ing  the  linear  vector  space  of  2D  scaled  rotations  and 
tran8lations[2,  3],  and  the  linear  constraint  formulation 
used  by  Baird[3j  and  recently  by  Breuel[4].  In  the  2D 
case  the  transformations  will  consist  of  a  planar  rota¬ 
tion,  scaling,  and  translation.  Any  vector  s  =  [si,  S2]^  = 

cos  fi.  O' sin  8]^  is  equivalent  to  a  linear  operator  S 
performing  a  rigid  rotation  by  an  orthogonal  matrix 
R  €  SOj  and  a  scaling  by  a  factor  o  G  R,  0  <  o  <  oo, 

-tintf  1  _  «i  -S3  I 
cosfi  J  sa  sx  Y 
We  denote  the  group  of  all  transformations  by  0,  with 
translations  given  by  t  =  and  scaled  rotations 

ipven  by  s  =  [si,#]]^,  so  a  point  x  is  transformed  by 
T[x]  =  Sx  -I- 1,  and  a  transformation,  T,  can  be  repre¬ 
sented  by  a  vector  T  «-►  6  Thus  the 

set  of  all  scaled  rigid  planar  motions  is  isomorphic  to  R*. 


where  S  =  o-R  =  tr 


By  assuming  A-sided  polygonal  uncertainty  regions^ 
Uf  and  following  the  formulation  of  Baird,  the  uncer¬ 
tainty  regions  can  be  described  by  the  set  of  points  x 
satisfying  inequality  (1):  (x— P4j)^ni  <  ei  for  1=1, 
and  thus  by  substitution  the  set  of  feasible  transforma¬ 
tions  for  a  feature  match  are  constrained  by  in¬ 
equality  (2):  (Spm*  + 1  -  <  £j  for  I  = 

which  can  be  rewritten  as  constraints  on  the  transfor¬ 
mation  vector  [«!, as  si(p{|,^  nr  +  + 

*s(pS.i«f-l44nf)+linf +l,nf  <  pw/nj-l-ej,  I  = 
where  Ai  is  the  unit  normal  vector  and  C|  the  scalar  dis¬ 
tance  describing  each  linear  constraint  for  i  =  1, ...,  k. 
The  first  set  of  linear  inequalities,  (1),  delineate  the 
polygonal  uncertainty  regions  Uf  by  the  intersection  of 
k  halfplanes,  and  the  second  set  of  inequalities,  (2),  pro¬ 
vides  It  hyperplane  constraints  in  the  linear  transfor¬ 
mation  space  n.  The  intersection  of  the  k  halfspaces 
forms  the  convex  polytope  of  feasible  transfor¬ 

mations  for  match  (m{,dj).  The  arrangement^  of  these 
mn  convex  polytopes  forms  the  partition  of  transforma¬ 
tion  space  into  equivalence  classes.  To  see  this  another 
way,  consider  the  set  of  k  hyperplanes  for  each  of  the  nm 
feature  matches  (tnj,dj).  The  arrangement  constructed 
by  these  kmn  hyperplanes  partitions  the  set  of  trans¬ 
formations  into  cells.  It  is  well  known  from  computa¬ 
tional  geometry  that  the  complexity  of  the  arrangement 
of  kmn  hyperplanes  in  is  Q(k*m*n*)  in  terms  of  the 
number  of  elements  of  the  arrangement.  These  elements 
are  called  k-face$  where  a  0-lace  is  a  vertex,  a  1-face  is 
an  edge,  a  2-face  is  a  facet,  and  a  4-face  is  a  cell,  and 
these  elements  can  be  constructed  and  enumerated  in 
0{k*m*n*)  time[lO]. 

We  are  interested  in  the  relationship  between  the 
equivalence  classes  Ef,  of  transformations  defined  by  the 
function  <p{T),  and  the  elements  of  the  arrangement  im¬ 
posed  by  the  set  of  all  kmn  constraint  hyperplanes.  The 
transformation  equivalence  classes  are  the  cells  and  faces 
formed  by  the  arrangement  of  the  mn  convex  polytopes 
in  the  transformation  space  [7] .  So  the  arrange¬ 
ment  formed  by  the  feasible  regions  Tmi.ij  is  in  a  sense 
a  subset  of  the  arrangement  formed  by  the  kmn  hyper¬ 
planes,  whose  complexity  is  higher.  Thus  there  are  ele¬ 
ments  of  the  hyperplane  arrangement  which  do  not  de¬ 
lineate  equivalence  classes.  Intuitively,  the  cells  defined 
by  the  arrangement  of  the  kmn  hyperplanes  are  a  finer 
subdivision  of  0  than  that  imposed  by  <p(T).  Thus  the 
number  of  qualitatively  different  poses  is  bounded  by 
O(jk^m^n^). 

To  construct  a  provably  correct  and  complete 
polynomial-time  pose  hypothesis  algorithm  for  object 
localisation  we  enumerate  the  set  of  pose  equivalence 
classes  by  deriving  them  from  this  arrangement  induced 
by  the  two  feature  sets  and  the  uncertainty  constraints. 
Each  equivalence  class  is  associated  with  a  geometricaUy 

*Eadi  data  feature  dy  can  have  an  arbitrary  polygonal 
uncertainty  regum  by  considering  a  separate  set  {(ni^',<i^)} 
of  omstraiat  parameters  for  each  data  feature. 

*The  conq>utational  geometric  term  is  arrangement  for 
the  topological  configuration  of  geometric  objects  like  linear 
surfaces. 


consistent  set  of  feature  matches,  and  so  we  simply  select 
those  consistent  match  sets  of  significant  size.  This  is  a 
simple  illustration  that  the  problem  of  determining  geo¬ 
metrically  consistent  feature  correspondences  in  the  2D 
case  can  be  done  in  polynomial  time,  and  the  solution  is 
correct. 

Good  pose  hypotheses  are  given  by  regions  where  the 
constraints  on  many  feature  matches  are  satisfied.  We 
need  some  way  of  evaluating  hypotheses  (prior  to  ver¬ 
ification)  to  determine  how  good  they  are.  Intuitively 
we  wish  to  know  how  much  of  the  data  is  explained  in 
terms  of  the  model.  Two  possible  measures  are  the  size 
of  the  maximal  match  sets,  and  the  size  of  the  largest 
one-to-one  matching  contained  in  the  maximal  match 
set.  We  use  an  approximation  to  the  later  consisting  of 
the  minimum  of  the  number  of  distinct  image  features 
and  distinct  model  features[l6]. 

2.3  The  Case  of  3D  models  and  2D  data 

Of  particular  interest  is  the  localization  of  3D  objects 
from  2D  image  data.  We’ll  consider  the  case  where 
the  transformation  consists  of  rigid  3D  motion,  scal¬ 
ing  and  orthographic  projection;  where  a  model  point 
Pmi  is  transformed  by  T[pmJ  =  Spmj  +  t  with  S  = 

su  si3  1  =  ^PR  and  p  =  [  j 

*Ji  sjj  *js  0  10’ 

R  €  SOs,  and  <r  >  0  6  R.  In  the  case  of  pla¬ 
nar  3D  objects  this  corresponds  to  the  transformation 

S  =  ***  ,  $ij  G  R,  describing  the  projection  of 

all  rotations  and  scalings  of  a  planar  3D  object.  Baird[3] 
and  Breuel[4]  noted  the  linear  formulation  applies  to  any 
affine  transformation.  To  exploit  linear  uncertainty  con¬ 
straints  on  the  feasible  transformations  as  before,  we 
must  have  the  property  that  the  matrices  S  form  a  vec¬ 
tor  space.  This  is  true  for  S  =  ,  si ,  *2  €  R, 

in  the  case  of  2D  models  and  2D  transformations;  and 
the  case  of  S  =  *"  *“  ,  tu  £  R,  for  planar  models 

»2l  *22  ^ 

and  3D  transformations,  but  is  not  the  case  for  general 
3D  models  with  S  =  o-PR  because  the  components  of 
S  must  satisfy  SS^  =  o-^I,  where  I  is  the  2x2  iden¬ 
tity  matrix.  To  handle  this  case  we  follow  the  following 
strategy.  We  compute  equivalence  classes  in  an  extended 
transformation  space  fl  which  is  a  vector  space  contain¬ 
ing  the  space  (I  (see  also  [20]).  After  computing  trans¬ 
formation  equivalence  classes  we  then  restrict  them  back 
to  the  non-linear  transformation  space  we  are  interested 
in.  So  consider  the  vector  space  ^  of  2  x  3  matrices 

S  =  jjl  j”  J  e  S  where  hj  G  R,  and  de¬ 
fine  (l  to  be  the  set  of  transformations  (S,t)  G  where 
T[pnt,]  =  Spm.  -l-t  as  before.  The  set  Cl  is  isomorphic  to 
R*.  Again  expressing  the  uncertainty  regions  in  the  form 
of  linear  constraints  we  have  (Spm^  +  t  -  <  £( 

for  I  =  1, ...,  It  or  (E^M^vr)  +<i"r  +<»"*  <  +<i 

for  1=1, ...,  k,  where  the  constants  are  functions  of 
and  ft{.  This  describes  k  constraint  hyperplanes  in 
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the  lineal,  S-dimensionai  transfoimation  space  Cl.  The 
kmn  hyperplanes  due  to  all  feature  matches  again  form 
an  arrangement  in  Cl.  Analogous  to  the  2D  case  there 
are  6(fc*m*n*)  elements  in  this  arrangement. 

For  the  case  of  3D  planar  objects  under  the  trans¬ 
formation  T[ptoJ  =  Spm^  +  t  with  S  = 

the  transformation  space  b  6-dimensional  and  there  are 
Q{k*m*n*)  elements  in  the  arrangement.  Note  that  the 
special  case  of  planar  rotation,  translation,  and  scaling 
is  a  restriction  of  this  transformation  to  those  matrices 
S  satbfying  SS^  =  which  are  the  matrices  of  the 


<11  <13 

*31  *13 


form  S  = 


*1 

*3 


*1,  *2  €  R,  shown  before. 


Define  =  {f  e  ftlT[p„J  €  Uf}  and  ^(f )  = 

{(m<,d^);T[pmj]  6  which  as  before  defines  equiv¬ 
alence  classes  of  transformations  E),.  The  equivalence 
classes  are  the  elements  of  the  arrangement  of  the  mn 
convex  polytopes  fm.dj  which  as  described  before  b  re¬ 
lated  to  the  arrangement  of  the  constraint  hyperplanes 
but  of  lower  complexity,  thus  there  are  0(k^Tn^n^)  equiv¬ 


alence  classes  Ei,. 

To  consider  the  case  where  the  general  2x3  linear 
transformation,  S  b  restricted  to  the  case,  S  of  true  3D 
motion,  scaling,  and  projection  to  the  image  plane,  we 
simply  eliminate  elements  of  the  arrangement  which  do 
not  intersect  the  quadratic  surface  described  by  the  con¬ 
straints  §S^  =  «■*!.  We  still  have  0(fc*m*n*)  maximal 
match  sets  with  the  restricted  transformation,  although 
the  equivalence  classes  are  more  complicated. 

As  was  shown  in  [7]  these  same  ideas  apply  to  cases 
using  non-linear  uncertainty  constraints,  such  as  cucles. 
The  basic  idea  is  the  same  however  in  these  cases  we  must 
analyse  an  arrangement  of  quadratic  surfaces  which  is 
computationally  more  difficult. 


3  Efficiently  Exploring  Pose 
Equivalence  Classes 

We  see  from  the  previous  analysis  that  there  is  only 
a  polynomial  sued  set  of  qualitatively  different  model 
poses.  A  simple  algorithm  for  object  localbation  consbts 
of  constructing  the  transformation  equivalence  classes, 
or  representative  points  of  them,  from  the  arrangement 
of  constraints  in  transformation  space.  For  example  a 
short-cut  is  to  simply  compute  the  vertices  of  the  ar¬ 
rangement  of  the  hyperplanes,  rather  than  the  entire  ar¬ 
rangement.  Thb  is  essentially  the  approach  taken  by 
Cass[7]  where  the  constraints  were  quadratic.  Every  el¬ 
ement  of  the  arrangement  is  bounded  by  some  vertex, 
thus  by  analysing  the  set  of  vertices  we  can  form  a  set  of 
representative  transformations  covering  all  equivalence 
classes. 

Unfortunately  this  straightforward  approach  is  im¬ 
practical  because  of  the  complexity  of  computing  either 
the  arrangement  or  representatives  of  the  elements  of 
the  arruigement.  So  the  idea  is  to  develop  algorithms 
to  explore  the  arrangement  in  an  efficient  way  finding 
those  regions  of  transformation  space  associated  with 
large  maximal  match  sets,  i.e.  transformation  satbfying 


a  large  number  of  constraints,  but  without  explicitly  con¬ 
structing  or  exploring  the  entire  arrangement  if  possible. 
Developing  efficient  methods  for  exploring  pose  equiva¬ 
lence  classes  to  find  interesting  ones  now  becomes  the 
focus  of  algorithms  based  on  this  formalbation;  which  b 
really  a  problem  in  computational  geometry. 

Because  the  underlying  problem  is  that  of  pairwise 
matching  the  local  features  of  a  rigid  object  with  fixed 
data  features,  there  b  considerable  structure  to  the  prob¬ 
lem.  First,  when  considering  the  formulation  with  linear 
constraints,  we  do  not  have  an  arbitrary  arrangement  of 
kmn  hyperplanes  in  D,  rather  each  feature  match  b  as¬ 
sociated  with  a  convex  polytope  Trm.dj  in  H  describing 
the  set  of  feasible  transformations.  The  arrangement  in 
O  in  which  we  are  interested  is  the  arrangement  of  these 
mn  polytopes. 

The  main  point  regarding  the  structure  of  the  con¬ 
straints  b  that  the  complexity  of  the  arrangement  of  fea¬ 
sible  polytopes  depends  on  how  well  the  model  matches 
the  data.  The  more  constraints  that  can  be  simulta¬ 
neously  satisfied,  the  higher  the  complexity  of  their  ar¬ 
rangement  b  in  transformation  space.  So  the  complexity 
cannot  get  arbitrarily  close  to  the  upper  bound  unless 
the  model  matches  most  of  the  data  at  some  transfor¬ 
mation,  a  case  in  which  we  can’t  resolve  the  data  anyway 
because  it  is  swamped  in  uncertainty.  In  typical  cases  of 
interest  we  would  expect  the  model  features  to  match  a 
large  number  of  data  features  in  only  a  small  number  of 
poses,  so  the  complexity  of  the  arrangement  is  hopefully 
much  less  that  the  upper  bound. 

Another  important  aspect  of  the  geometric  matching 
problem  in  thb  domain  is  that  many  of  the  data  fea¬ 
tures  can  be  spurious,  and  thus  the  matches  of  model 
features  with  spurious  data  features  generate  spurious 
constraints  in  transformation  space,  adding  complexity 
to  the  partition  of  transformations  space.  If  it  b  unlikely 
that  many  spurious  matches  are  geometrically  consbtent 
with  the  model  then  the  complexity  of  the  arrangement 
due  to  spurious  matches  will  be  low. 

To  exploit  the  inherent  structure  of  the  geometric 
matching  problem,  the  conceptual  and  algorithmic  ap¬ 
proach  we  take  is  to  decompose  a  transformation  into  its 
scaled-rotational  component  and  its  translational  com¬ 
ponent,  and  explore  projections  of  the  constraints.  The 
partition  of  Q  we  seek  is  then  composed  of  separate  but 
inter-dependent  partitions  of  the  rotation  space  and  the 
translation  space.  The  following  sections  discuss  thb  ap¬ 
proach. 

4  Decomposing  Translation  and 
Scaled-Rotation 

We’U  consider  in  some  detail  the  case  of  planar  2D  trans¬ 
formations  to  illustrate  these  ideas.  We’re  trying  to  de¬ 
velop  a  method  of  exploring  the  pose  equivalence  classes 
without  explicitly  constructing  them  all.  The  approach 
we  have  taken  b  to  explore  two  interdependent  projec¬ 
tions  of  the  entire  space:  The  projection  of  the  con¬ 
straints  in  (I  onto  the  plane  for  a  fixed  rotation  S, 
and  the  projection  onto  the  *i-*3  plane  of  certain  faces, 
edges,  and  vertices  of  the  constraint  hyperplane  arrange- 
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Figure  2:  Left:  Model  features  (open  circles)  after  some 
arbitrary  rotation,  data  features  (filled  circles),  and  uncer- 
taintr  regions.  Bight:  One  set  of  translation  constraints, 
each  possible  feature  match  (mi,  dy)  for  the  features 
shown  on  the  left,  for  the  case  of  the  particular  rotation  ap¬ 
plied  to  the  model  as  shown  on  the  left.  Each  square,  Ciy,  on 
the  right  is  the  oaujunction  of  constraints  for  a  sinde  feature 
match  rei»esentiag  those  translation  feasible  for  tMt  match 
and  the  particular  rotation. 

ment.  The  algorithm  we  develop  is  based  on  finding 
maximal  match  sets,  rather  than  explicitly  constructing 
transformation  equivalence  classes.  The  following  pro¬ 
vides  a  brief  overview  of  the  approach. 

Suppose  we  fix  the  rotational  part,  S,  of  the  trans¬ 
formation.  For  fixed  S  the  uncertainty  constraints  im¬ 
pose  constraints  on  feasible  trarulaiioiu.  This  imposes 
an  arrangement  of  constraints  in  the  plane  as  shown 
in  figure  2.  Each  cell  of  this  2D  arrangement  defines  a 
maximal  match  set  according  to  the  match  constraints 
consistent  with  the  translations  in  the  cell.  So  for  fixed 
rotation  S  we  can  read  off  all  the  maximal  match  sets 
feasible  at  S. 

Now,  as  we  vary  S  the  constraints  in  the  fi-tj  plane 
change,  and  the  set  of  feasible  maximal  match  sets 
changes.  Intuitively,  if  we  vary  over  all  rotations  S  we 
can  collect  all  feasible  maximal  match  sets.  As  well 
show  however,  there  are  equivalence  classes  of  rotation 
S  and  so  we  only  have  to  vary  S  over  a  set  of  repre¬ 
sentatives  of  these  equivalence  classes.  The  idea  b  that 
the  projection  of  the  transformation  constraints  onto  the 
si-sj  plane  partitions  the  rotation  space  into  equivalence 
classes.  Each  rotational  equivalence  class  is  associated 
with  a  set  of  topologically  equivalent  projections  of  the 
constraints  onto  the  ti-i2  plane  (for  fixed  S).  We’ll  next 
look  at  thu  in  more  detail. 

4.1  l^anslational  Equivalence  Classes 

Consider  a  specific  match  in  the  2D  case  between  a 
model  point  and  a  data  point  ps^.  A  transforma¬ 
tion  T|pm  J  =  Spm,  -i- 1  on  the  model  feature  b  feasible 
if  Sp,„,  +  t€Uj.  Define  C.y(S)  =  {t|Sp„,  -t- 1  e  UJ} 
to  be  the  set  of  feasible  translations  for  (mj,dy)  for 
^7  given  rotation  S.  It  b  easy  to  see  that  Cij{S) 
b  equivalent  to  the  region  Uf  translated  by  (-Sp^i)- 
For  a  fixed  rotation  S,  define  the  function  ^s(t)  = 
{(m<,dy)|Spm<  +  t  €  Just  as  before,  thb  func¬ 

tion  partitions  the  space  of  translations  into  equivalence 
classes  E^hj\  =  i'  <=>  ^s(t)  =  ^5(t'). 

The  point  of  thb  decomposition  of  rotation  and  trans¬ 
lation  b  that  a  particular  rotation  S  determines  an  ar¬ 


rangement  of  constraints  in  the  ti-t2  plane  of  all  con- 
str^t  regions  C,-y(S)  as  shown  in  figure  2.  The  elements 
of  thb  arrangement  formed  by  the  boundary  curves  of 
the  sets  Cjy(S)  define  the  equivalence  classes  The 
cells  of  thb  arrangement  allow  us  to  compute  the  max¬ 
imal  match  sets  directly;  For  each  cell  the  matches  in 
the  maximal  match  set  are  given  by  the  set  of  feasible 
translations  Cij{S)  which  intersect  the  cell. 

4.2  Topological  Equivalence  Classes  of 
Rotation 

The  arrangement  of  the  translation  constraints  in  the  t^- 
<3  plane  b  a  fiinction  of  rotation  S.  As  we  vary  S,  each 
time  the  topology  of  the  ti-tj  arrangement  changes,  some 
new  set  of  translation  constraints  becomes  satbfied  or 
an  old  set  ceases  to  be  satbfied.  It  b  here  where  new 
maximal  match  sets  appear  or  disappear.  Intuitively 
the  idea  b  to  vary  over  all  rotations  and  enumerate  the 
set  of  all  maximal  match  sets  as  they  appeu.  We  can 
do  this  exploration  of  rotation  space  in  an  orderly  way 
by  noting  that  there  are  equivalence  classes  of  rotation 
within  which  the  topology  of  the  ti-tj  plane  arrangement 
doesn’t  change,  and  thus  the  set  of  maximal  match  sets 
represented  doesn’t  change.  This  means  that  there  are 
equivalence  classes  of  the  rotation  parameter  S  within 
which  the  same  topological  arrangement  of  the  transla¬ 
tion  constraints  Ctj  is  imposed  in  the  ti-tj  plane,  and  the 
same  set  of  maximal  match  sets  b  feasible.  So  two  dif¬ 
ferent  rotational  equivalence  classes  are  associated  with 
different  maximal  consistent  match  sets,  although  they 
may  have  most  maximal  match  sets  in  common.  Figure 
3  shows  the  translation  space  arrangement  for  rotations 
in  different  rotational  equivalence  classes.  We  uncover 
all  maximal  match  sets  by  exploring  all  rotations  S,  and 
collecting  the  different  maximal  match  sets  as  they  ap¬ 
pear  in  the  ti-i2  plane  arrangement,  but  because  there 
are  equivalence  classes  of  rotations,  we  need  only  con¬ 
sider  a  representative  of  each  of  these  in  thb  search. 

The  rotational  equivalence  classes  are  formed  by  pro¬ 
jecting  the  constraints  in  the  full  transformation  space 
onto  the  si-si  rotation  plane.  Recall  that  for  each  cell 
in  the  constraint  hyperplane  arrangement  in  the  4  di¬ 
mensional  transformation  space  a  particular  set  of  match 
constraints  are  satbfied.  By  moving  across,  say,  a  face 
or  edge  of  these  ceils  we  change  the  set  of  constraints 
satisfied.  Now  consider  the  projection  of  the  4  dimen¬ 
sional  transformation  space  onto  the  2  dimensional  rota¬ 
tion  space,  the  si-sj  plane.  For  an  arbitrary  fixed  point, 
B,  in  the  thb  plane  consider  the  set  of  match  constraints 
satisfied.  For  a  single  match,  at  the  fixed  rotation  s, 
there  is  always  some  translation  for  which  the  constraints 
are  satbfied  for  the  match.  This  is  consbtent  with  the 
fact  that  the  projection  of  the  constraints  for  the  match 
from  the  full  4D  transformation  space  onto  the  Si-Sj 
plane  covers  the  whole  plane.  In  contrast,  consider  the 
intersection  of  constraints  for  two  different  matches.  The 
projection  of  their  mutually  feasible  transformation  re- 
^on  onto  the  Si-S]  plane  does  not  cover  the  entire  plane: 
there  are  some  scaled  rotations  a  for  which  there  is  some 
translation  satbfying  the  constraints  for  both  matches, 
and  there  are  some  rotations  for  which  there  b  no  trans- 
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Figace  3:  Two  mapahott  of  rotation  space  and  translation 
space  for  particular  model  and  data  sets  of  3  featiues  each,  at 
duFerent  rotations  of  the  model.  For  each  pair  of  shots,  one 
shows  the  si-ss  plane  position  constraints  and  a  dot  on  the  a- 
drcle  at  the  rotation  applied,  and  the  other  shows  associated 
set  of  ti-t]  plane  constraints. 


plane  for  the  case  of  point  features  and  square  uncer¬ 
tainty  regions,  (like  those  shown  in  figure  2).  In  this 
figure,  each  line  segment  in  the  Si-S}  plane  is  the  pro¬ 
jection  of  the  intersection  of  the  constraint  hyperplanes 
for  2  feature  matches.  This  figure  is  composed  of  sev¬ 
eral  squares  of  different  orientations  arranged  around  the 
origin,  the  circle  is  centered  at  the  origin  and  wiU  be  ex¬ 
plained  later.  Each  square  is  associated  with  a  different 
pair  of  matches.  The  rotations  inside  each  square  are 
the  rotations  for  which  it  b  possible  to  align  the  two 
matches  for  some  translation.  There  are  several  geo¬ 
metric  events  in  the  ti-tj  plane  associated  with  the  pro¬ 
jected  constraint  lines  in  the  si-sj  plane.  An  example  b 
shown  in  figure  4. 

To  construct  the  partition  of  the  si-sj  plane  we  sim¬ 
ply  consider  all  pairs  and  triples  of  feature  matches,  and 
project  to  the  si-sj  plane  the  points  in  transfor'~  \tion 
space  where  the  constraints  intersect  which  art  4Ssoci- 
ated  'vith  a  boundary  of  them  region  of  mutual  feasibility. 
This  forms  the  partition  of  rotation  space.  Points  which 
lie  in  the  boundary  of  the  projected  region  of  feasibility 
for  pabs  or  triples  of  matches  are  what  we  are  interested 
in  most.  These  are  points  of  transformation  space  where 
the  satbfiability  of  the  constraints  for  the  pau  or  triple  of 
matches  hanges.  An  approximation  we  make  b  to  only 
project  these  elements  of  the  arrangement.  The  others 
are  not  associated  with  changes  in  feasibility,  but  just 
the  topology  of  the  constraints. 

5  Algorithms  for  Computing  Maximal 
Match  Sets 


Figure  4:  The  translation  constraints  for  3  different  fea¬ 
ture  matches  shown  for  each  of  3  different  model  rotations. 
For  the  rotation  producing  the  left  panel,  there  b  no  feasible 
translation  tor  aU  3  matues,  while  there  b  for  the  rotation 
producing  the  right  panel.  The  middle  panel  b  the  configu¬ 
ration  produced  by  rotations  which  lie  on  the  intersection  of 
3  hyperplane  constraints  in  O. 

lation  which  will  satbfy  the  constraints.  These  regions 
of  the  si-sj  plane  are  delineated  by  projections  of  the 
constraints  for  the  two  matches  onto  the  plane. 

Considet  the  intersection  of  3  construnt  hyperplanes 
due  to  2  or  3  matches  (two  constraints  from  one  match 
and  one  constraint  from  a  second  match,  or  one  con¬ 
straint  from  each  of  three  matches).  Thb  b  a  line  in 
R'*.  Its  projection  to  the  si-sj  plane  b  a  line  delineating 
the  rotations  associated  with  two  different  topological 
arrangements  of  the  translation  constraints  for  the  asso¬ 
ciated  feature  matches.  For  example,  if  thb  projected 
line  b  due  to  the  coincidence  of  constraints  shown  in 
figure  4  then  for  rotations  on  one  side  of  the  line  it  is 
possible  to  satbfy  all  the  constraints  (for  some  transla¬ 
tion),  and  for  rotations  on  the  other  it  is  not. 

The  idea  is  that  the  projection  of  the  intersection  of 
the  constraints  onto  the  ti-tj  plane  partitions  the  rota¬ 
tion  space  into  equivalence  classes.  Within  each  equiva¬ 
lence  class  the  topological  arrangement  of  constraints  in 
the  ti-tj  translation  plane  is  the  same.  Figure  3  has  pan- 
eb  showing  the  projection  of  constraints  onto  the  si-sj 


In  this  section  we  describe  algorithms  for  computing 
maximal  match  sets  and  pose  representatives  forming 
pose  hypotheses.  Fust  we  consider  a  simple  case  of  un¬ 
oriented  point  features  with  square  uncertainty  regions 
and  known  scale  factor  <r  and  later  show  how  to  extend 
to  more  general  cases. 

The  basic  approach  can  be  summarised  as  follows.  As¬ 
sume  there  are  n  data  features  and  m  model  features. 
Given  a  set  of  model  features  and  a  set  of  data  features 
and  theu  associated  uncertainty  squares,  form  the  set 
of  all  model-data  feature  matches.  Each  feature  match 
{mi,  dj)  defines  a  region  of  feasible  translations  Cjj(S)  in 
the  ti-t2  plane  which  b  a  function  of  the  rotation  S  ap¬ 
plied  to  the  model  feature.  Each  pair  of  feature  matches 
and  the  associated  regions  and  Chi  contribute  to  the 
partition  of  the  Si-S]  plane  into  equivalence  classes.  By 
considering  all  such  pairs  of  matches  we  can  construct 
the  complete  partition  of  rotation  space  into  equivalence 
classes.  Each  such  equivalence  class,  or  cell  of  the  ar¬ 
rangement  in  the  si-sj  plane,  is  associated  with  an  ar¬ 
rangement  of  the  sets  Cij  in  the  li-tj  plane  partitioning 
the  it,  and  thus  a  particular  set  of  maximal  match  sets. 
By  passing  from  one  rotational  equivalence  class  to  an¬ 
other  a4jacent  one  a  local  change  occurs  in  the  induced 
arrangement  of  translation  constraints.  We  only  need 
to  analyse  the  part  that  changes,  and  the  new  maximal 
match  sets  formed,  rather  that  re-analysing  the  entire 
ti-tf  plane.  To  construct  maximal  match  sets  we  step 
through  the  cells  of  the  the  si-sj  arrangement,  analyi- 
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ing  the  local  inctemental  changes  in  the  aiiangement  of 
constraints  in  the  translation  space.  We  simply  enumer¬ 
ate  the  maximal  match  sets  as  they  appear.  We  have  two 
things  to  consider,  analysis  of  the  rotation  space  and  the 
associated  analysis  of  translation  space. 

5.1  Analysing  the  Partition  of  Translation 
Space 

The  geometric  algorithms  we  develop  to  compute  max¬ 
imal  match  sets  rely  on  the  analysis  of  the  two  planar 
partitions  of  the  scaled-rotation  space  and  the  transla¬ 
tion  space  and  the  relationships  between  them.  First 
consider  the  translation  space.  For  some  fixed  rotation 

5  applied  to  all  model  features,  the  space  of  translations 
is  partitioned  by  the  arrangement  of  feasible  translation 
constraints  C{j(S)  for  aU  feature  matches  (mi,dj).  See 
figures  2  and  3.  The  basic  component  of  the  ti-tj  analysis 
is  to  compute  all  the  elements  of  this  arrangement  in  the 
ii-ti  plane  formed  by  nm  isothetic  squares  in  the 
plane.  The  K  intersecting  pairs  of  P  isothetic  rectangles 
in  the  plane  can  be  found  in  time  0(Plg  P+K).  A  fSurly 
simple  plane  sweep  algorithm  using  a  static  interval  tree 
is  described  in  [18].  We  can  extend  the  intersection  re¬ 
porting  algorithm  to  return  the  set  of  squares  Cij  con¬ 
taining  each  of  the  cells  associated  with  each  intersection 
point.  This  aUows  determination  of  the  maximal  match 
set  associated  with  each  cell.  If  we  only  report  the  size 
of  the  maximal  match  sets,  this  can  be  computed  by  the 
intersection  reporting  algorithm  in  0(P  Ig  P  -j-  if)  [9]. 

When  we  change  the  rotation  considered  to  an  a4jar 
cent  rotational  equivalence  class  we  have  an  incremental 
change  in  the  constraints  in  the  ti-tj  plane,  which  in 
this  case  involves  the  topological  configuration  between 
translation  constraints  for  two  matches.  When  we  an¬ 
alyse  this  incremental  change  between  two  squares  Cij 
and  Cu  as  shown  in  figure  5,  if  there  are  k  other  squares 
intersecting  both  Cij  and  Chi  we  can  update  the  change 
in  match  sets  in  0{k  Ig  k)  time.  To  facilitate  these  oper¬ 
ations  we  require  a  dynamic  data  structure  maintaining 
for  each  square  Cij  the  set  of  other  squares  it  intersects. 

6  Analyzing  the  Partition  of  Rotation 
Space 

An  important  special  case  of  the  object  localisation  prob¬ 
lem  is  when  the  scale,  <r,  of  the  object  in  sensory  data 
is  known.  In  this  case  the  rotation  consists  of  rotations 
lying  on  a  circle  of  radius  <r  in  the  si-sj  plane.  Call  this 
the  o'-circle.  The  equivalence  classes  of  rotation  consist 
of  circular  arcs  on  the  o’-circle.  The  critical  rotation  an¬ 
gles  delineating  the  equivalence  classes  are  those  points 
where  the  o’-circle  intersects  the  boundary  curve  of  an 
equivalence  class  in  the  S1-S2  plane.  We  simply  super¬ 
impose  the  O’-circle  over  the  partition  of  rotation  space 
constructed  in  the  previous  sections.  See  figure  3.  Ex¬ 
ploration  of  rotation  space  amounts  to  moving  along  the 
O'-circle  in  order  of  increasing  angle.  Each  time  a  single 
line  segment  is  crossed,  exactly  one  topological  change 
occurs  between  the  regions  Cij  in  translation  space* 

'Note  that  there  is  a  relatively  rare  event  where  more 
than  one  segment  intersects  the  o-circle  at  the  same  point; 


An  algorithm  for  finding  representatives  of  equiva¬ 
lence  classes  of  transformation  and  a  measure  of  the 
sise  of  the  associated  maximal  match  sets  is  as  follows. 
Compute  the  positions  of  all  feasible  translation  squares 
C{j(So)  for  some  initial  rotation  say,  0  =  0.  Perform 
an  analysis  of  the  arrangement  of  these  squares  in  the 
translation  plane  as  described  in  section  5.1.  Because 
there  are  nm  isothetic  squares,  this  operation  takes  time 
0(nmlgmn-f-  K)  where  K  is  the  total  number  of  inter¬ 
secting  pairs  of  squares.  Next  we  perform  an  incremental 
analysis  of  the  topological  changes  occurring  in  transla¬ 
tion  space  as  we  step  through  the  rotation  space.  There 
are  0(m*n*)  line  segments  in  the  si-sj  plane  intersect¬ 
ing  the  <r-circle  forming  the  partition  of  rotation  space. 
For  simplicity  we  assume  that  all  intersection  points  of 
the  line  segments  and  the  o’-circle  are  disjoint.  Each  in¬ 
tersection  point  is  a  critical  rotation  angle  delineating  a 
rotation  equivalence  class.  Sort  the  critical  rotations  by 
angle  in  0(n*m*  Ig  mn)  time.  Step  through  the  critic^ 
rotations  in  order  performing  the  dynamic-incremental 
analysis  of  the  topological  changes  in  the  ti-t2  plane, 
and  record  new  match  sets  that  are  created.  Rather 
than  the  match  sets  and  the  pose  equivalence  classes,  we 
will  record  a  representative  of  the  transformation  equiva¬ 
lence  class  and  a  measure  of  the  size  of  the  match  set.  As 
noted  in  section  5.1  for  the  critical  rotation  S{,  this 
incremental  analysis  requires  time  0(ki  Igki)  where  ki  is 
the  number  of  squares  intersecting  both  squares  associ¬ 
ated  with  the  critical  rotation  angle  S^.  Thus  the  entire 
analysis  requires  time  0(n*m*lgmn)  +  0(nmlgmn  + 

Skir  "  ^  0(k*lgfcft).  To  get  a  sense  of  this  com¬ 
plexity,  if  we  assume  that  it  is  unlikely  that  many  in¬ 
correctly  matched  features  will  be  simultaneously  geo¬ 
metrically  consistent,  then  we  can  approximate  A  ~  Tn. 
We  justified  this  assumption  by  experiment  as  shown  in 
section  7.  Noting  that  K  =  0(n*m*)  then  the  above 
expression  reduces  to  0(n*m*  Ig  m)  for  the  localisation 
process.  We  can  improve  this  bound  as  described  in  the 
fcfilowing  sections. 

6.1  A  Randomued  Algorithm 

We  can  use  a  randomised  algorithm  to  get  an  expected 
time  speedup  in  the  algorithm  by  exploiting  the  struc¬ 
ture  of  the  matching  problem.  The  idea  is  to  explore  the 
equivalence  classes  in  a  certain  order,  and  in  a  way  that 
aDows  us  to  make  approximations  that  let  us  spend  most 
computational  effort  looking  only  at  places  in  transfor¬ 
mation  space  where  there  is  a  high  degree  of  geometric 
consistency. 

Assume  we  have  a  base  match  between  a  model  fea¬ 
ture  and  a  data  feature,  and  that  we  compute  the  equiva¬ 
lence  classes  of  rotation  only  associated  with  all  matches 
paired  with  the  base  match[l7].  There  are  0(mn)  other 
matches  and  so  there  are  0{mn)  critical  rotations.  We 
compute  this  subset  of  the  critical  rotation  angles,  and 
explore  this  sub-partition  of  the  rotation  space  as  be¬ 
fore.  For  each  of  these  rotational  equivalence  classes  we 
analyse  the  ti-t^  plane  constraints  Cij  which  are  con¬ 
sistent  with  the  base  match,  i.e.  for  which  Cu<«  &nd 


this  requires  more  care  to  handle,  but  is  straightforward. 


700 


Fignie  5:  The  inctemental  event  in  the  ti-tj  plane  to  be 
analjraed  at  each  critical  rotation  due  to  position  constraints. 
The  critical  rotation  point  in  the  st-sa  plane  is  associated 
with  the  special  alignment  of  constraints  such  that  a  pair 
of  matches  just  becomes  feasible.  The  bold  squares  are  the 
consteaints  for  the  two  matches  associated  with  the  critical 
rotation.  New  match  sets  are  formed  by  the  new  cells  created 
with  their  overlap  with  the  k  other  squares  affected.  The 
dotted  squares  are  those  not  involved  in  the  analysis  because 
no  topological  changes  occur  with  them. 


Cii  overlap.  The  idea  is  to  randomly  choose  the  base 
matche8[l2].  A  simple  argument  shows  the  expected 
number  of  randomly  chosen  single  base  matches  we  must 
choose  before  we  find  a  correct  one  is  0(n).  Analysing 
the  sub-partition  of  rotation  space  due  to  a  correct  base 
match  does  not  mean  we  wUl  find  the  rotation  where  all 
correct  match  constraints  are  satisfied,  but  it  does  mean 
that  the  correct  matches  wiU  be  weakly  consistent  with 
the  base  match,  meaning  there  is  a  rotation  class  and 
critical  point  in  the  sub-partition  at  which  each  correct 
match  constraint  Cu  is  consistent  with  the  base  match 
constraint  but  where  they  are  not  necessarily  all 

mutually  cousistent.  However,  it  is  very  likely  that  we 
will  find  a  rotation  in  this  way  where  most  if  not  all  the 
correct  match  constraints  are  satisfied.  Nonetheless,  we 
are  guaranteed  to  eventually  find  it,  if  we  look  through 
all  possible  base  matches. 

The  notion  of  weak  coruittency  as  just  defined  can 
be  used  to  define  a  contervative  cotuittency  etUmaie. 
The  number  of  match  constraints  Cij  consistent  with 
the  base  match  constraints  never  underestimates 

the  sise  of  the  largest  maximal  match  set  composed  of 
those  matches.  Thus  this  can  be  used  to  prune  away 
parts  of  the  transformation  space  where  detailed  trans¬ 
lation  plane  analysis  is  unnecessary.  Well  analyse  in 
more  detail  the  complexity  of  this  approach  in  section  7. 

6.2  Generalising  the  Algorithm 

If  we  associate  orientations  with  features  the  true  ori¬ 
entation  for  data  feature  dj  falls  within  an  uncertainty 
region  U*  which  we  will  consider  a  range  of  angles 
U*  =  [{64^  —  S),  04j  -kS)]  where  S  is  the  bound  on  orien¬ 
tation  uncertainty,  and  ffgj  is  the  measured  orientation 
ofdj.  We  can  express  the  angle  constraints  in  terms  of 
linear  constraints  on  the  transformation  space:  we  have 
that  <  0  and  <  0  where  and 

are  normal  respectively  to  the  two  lines  through  the 


ori^n  at  angles  §4^  -f-  6  and  64^  —  6  in  the  image  plane. 

Incorporating  angle  constraints  actually  complicates 
the  algorithm  described  in  section  6  for  analysing  in¬ 
cremental  topological  changes  in  translation  space  con¬ 
straints.  When  we  include  angle  constraints,  crossing 
into  another  rotational  equivalence  class  by  crossing  an 
angle  constraint  has  the  effect  of  a  repon  Cij  suddenly 
appearing  or  disappearing  in  the  ti-tj  plane,  as  the  an¬ 
gle  constraints  on  the  associated  match  change  state. 
Analysing  this  incremental  change  in  arrangement  re¬ 
quires  time  0(k  Ig  h  k^)  where  again  k  is  the  number 
of  other  match  translation  constraint  regions  affected  by 
the  change.  In  practice  we  get  a  huge  speedup  by  includ¬ 
ing  angle  constraints,  because  this  reduces  the  number 
of  constraints  which  interact  with  one  another,  however 
with  the  simple  algorithm  we  have  outlined,  considering 
angle  constraints  adds  to  the  asymptotic  complexity. 

In  the  case  of  unknown  scale  <r  we  must  consider  the 
entire  partition  of  the  $i~t2  plane  to  compute  all  equiva¬ 
lence  classes  of  transformation,  not  just  the  partition  of 
the  <r-circle.  Each  element  of  this  arrangement  is  a  rota¬ 
tional  equivalence  class  and  is  associated  with  a  partition 
of  translation  space  as  above.  However  because  this  is  a 
projection,  the  complexity  of  analysing  the  si-sj  plane 
is  not  asymptotically  the  square  of  the  number  of  line 
segments,  but  less. 

We  note  that  we  can  easily  generalise  the  algorithm 
to  handle  other  features  such  as  line  and  curve  segments 
by  defining  the  appropriate  uncertainty  regions.  We  can 
also  include  uncertainty  c  and  6  among  the  unknown 
parameters. 

7  Experimental  Implementations 

There  are  two  classes  of  experiments  that  have  been 
conducted:  experiments  with  real  images  to  demonstrate 
the  practical  application  of  the  technique,  and  carefully 
contrcdled  experiments  on  S3mthetically  generated  data 
to  investigate  the  computational  complexity  of  the  im¬ 
plemented  algorithm  empirically.  In  both  cases  we  con¬ 
sidered  rigid  planar  motion  (known  scale),  2D  model  and 
data  features  consisting  of  2D  points  and  an  associated 
unit  direction  vector,  and  isothetic  square  uncertainty 
re^ons.  In  the  case  of  real  images  these  were  derived 
from  points  uniformly  sampled  on  the  boundary  con¬ 
tour  of  the  objects.  In  the  synthetic  experiments  models 
were  collections  of  randomly  generated  features  and  data 
were  constructed  by  randomly  transforming  the  model 
features,  randomly  perturbing  their  position  and  orien¬ 
tation  to  simulate  sensor  error,  and  adding  spurious  fea¬ 
tures. 

The  real  images  used  were  quite  difficult  in  terms  of 
occlusion  and  clutter.  The  algorithm  was  applied  to 
about  20  different  images,  and  successfully  located  the 
object  in  all  but  the  most  extremely  occluded  cases.  Ex¬ 
amples  of  these  images  are  shown  in  figure  7. 

The  synthetic  experiments  were  designed  to  determine 
the  amount  of  work  required  to  perform  localisation.  Re¬ 
call  that  the  basic  algorithm  proceeded  as  follows.  We 
compute  the  critical  rotations  delineating  rotation  equiv¬ 
alence  classes  by  considering  O(m^n’)  pairs  of  feature 
matches.  Each  rotational  equivalence  class  is  associated 
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Fignze  6:  Graph  1:  Critical  rotation*  ▼*.  data  act  rise  &r 
fall  angle  conatraint  (aqnaiea)  and  partial  angle  conatraint* 
(open  circle*).  Graph  3:  Critical  rotation*  v*.  data  aet  aiae 
for  no  angle  conatraint*.  Graph  4:  Mean  value*  of  k  (circle*) 
and  k^  (aqnarea)  v*.  data  aet  aiae,  for  the  caae  of  full  angle 
conatraint*  (the  caae  of  partial  angle  conatraint*  i*  aiinilar). 
Grimh  6:  Mean  (circle*)  and  Max  (aqnarea)  value*  of  Je  v*. 
moael  aet  aiae  for  caae  of  foil  angle  conatraint*.  The  caae*  of 
partial  and  no  angle  conatraint*  are  aimilar.  lie  uncertainty 
uaed  waa  <  =  8  and  S  = 


with  a  particular  arrangement  of  translation  constraints 
in  the  ti-t^  plane  &om  which  we  can  earily  compute  the 
maximal  match  sets  valid  at  all  rotations  in  the  rota¬ 
tional  equivalence  class.  Changing  to  an  adjacent  rota¬ 
tional  equivalence  class  causes  an  incremental  change  in 
the  constraints  in  the  ii-it  plane.  This  is  associated  with 
either  the  beginning  or  end  of  the  range  of  feasibility  of 
some  group  of  maximal  match  sets.  Figure  5  shows  this 
event  for  a  critical  point  due  to  position  constraints.  Say 
the  incremental  change  in  the  f  i-fj  plane  due  to  crossing 
a  rotational  critical  point  affects  the  topological  relation¬ 
ship  of  the  constraints  for  k  matches.  Then  if  the  rota¬ 
tional  critical  point  was  due  to  a  position  constraint  then 
the  change  can  be  analysed  in  0{h  Ig  It)  time;  if  it  was 
due  to  an  angle  constraint  then  the  change  can  be  ana¬ 
lysed  in  0{hlgk  +  i’)  time.  To  understand  the  amount 
of  work  we  have  to  do  to  perform  localisation  we  have  to 
know  how  big  It  is,  and  how  many  critical  rotations  there 
really  are.  Asymptotically  we  expect  that  k  =  0(mn), 
and  so  asymptoUeally  with  the  foil  algorithm  described 
we  would  expect  to  do  0(m’n*lgmn)  work  when  an¬ 
gle  constraints  are  not  used,  and  0(m^n*)  when  angle 
constraints  are  used.  However  the  asymptotic  bounds 
describe  a  case  in  which  recognition  is  impossible:  either 
there  is  so  much  spurious  data  that  the  model  is  hallu¬ 
cinated  aU  over  the  image,  or  the  data  are  so  swamped 
in  sensor  uncertainty  that  recognition  is  impossible  any¬ 
way.  What  these  experiments  demonstrate  is  that  for 
problem  instances  of  practical  interest  the  actual  work 
done  in  localisation  is  for  less  than  these  asymptotic  up¬ 
per  bounds. 


Figure  7:  The  edges,  and  the  correct  hjrpotheais,  for  the 
image  shown  in  8.  The  dot*  on  the  contour*  are  the 

feature  point*  used  in  feature  matching.  Image*  of  this  level 
of  clatter  and  occlusion  are  typical. 


We  consider  three  different  algorithms.  One  only  en¬ 
forces  position  constraints,  ignoring  angle  constraints. 
Another  enforces  both  position  and  angle  constraints. 
The  last  uses  angle  and  position  constraints  to  eliminate 
paus  of  feature  matches  whose  constraints  could  never 
be  satisfied,  but  uses  only  position  constraints  to  parti¬ 
tion  the  rotation  space.  Call  these  the  case  of  no  angle 
constr^t,  full  an^e  constraints,  and  partial  angle  con¬ 
straints,  respectively. 

First  well  note  how  many  critical  rotations  need  to  be 
considered.  Figure  6  graphs  1  and  3  show  the  number  of 
critical  rotations  partitioning  rotation  space  as  a  func¬ 
tion  of  the  sise  of  the  data  feature  set  for  fixed  model 
rise.  As  we  expect  there  is  a  quadratic  dependence  on 
the  data  feature  set  sise,  although  as  can  be  seen  by  com¬ 
paring  graph  1  with  graph  3,  when  angle  constraints  are 
used  there  is  a  considerable  constant  foctor  reduction  in 
computation.  Similarly  for  the  number  of  critical  rota¬ 
tions  as  a  function  of  the  model  rise  for  fixed  data  sise 
we  see  quadratic  dependence  as  expected. 

Next  note  how  much  work,  on  average,  has  to  be  done 
for  each  critical  rotation  to  find  the  sise  of  the  maxi¬ 
mal  match  sets  as  they’re  uncovered.  We  know  that  at 
most  we  will  have  to  consider  the  translation  constrrints 
for  each  of  the  exactly  mn  matches.  We  found  empiri¬ 
cally  that  for  practical-sise  problems  over  a  broad  range 
of  sises  the  number  of  matches  considered  per  critical 
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Figure  8:  The  image  for  the  experimental  results  in  figure 


rotation  on  average  is  approximately  constant  as  both 
model  and  data  set  sixes  vary  rather  than  ~  mn,  the 
upper  bound.  Figure  6,  graph  4  shows  the  mean  value 
of  the  number  of  match  constraints  in  the  ti-tj  plane  af¬ 
fected  by  each  critical  rotation,  which  we  called  k  above, 
for  fixed  model  set  sise,  as  a  function  of  the  data  set  sise. 
For  the  cases  where  angle  constraints  were  used,  shown 
in  graph  4,  we  see  that  for  a  broad  range  of  data  set 
sites  of  practical  interest,  up  to  100  times  the  sise  of  the 
model,  the  mean  values  of  k  and  k^  are  approximately 
constant  as  the  data  set  sise  increases.  When  angle  con¬ 
straints  were  not  used,  the  mean  value  for  k  (for  fixed 
model  sise)  is  approximately  a  very  slowly  growing  linear 
function  of  the  data  set  sise,  reasonably  approximated  by 
a  constant.  Figure  6  graph  6  shows  these  quantities  for 
fixed  data  set  sise,  as  a  function  of  the  model  set  sise. 
The  max  value  of  A  is  linear  in  the  model  sise,  while  in¬ 
terestingly  the  mean  value  of  A  is  approximately  a  slowly 
growing  linear  function  of  the  model  set  sise  reasonably 
approximated  as  constant.  So  for  problems  sixes  of  this 
order  we  see  that  the  assumption  that  A  ~  m  is  reason¬ 
able  upper  bound. 

Let’s  analyse  three  different  variations  of  the  algo¬ 
rithm  in  more  detail.  For  the  case  of  partial  angle  con¬ 
straints  and  no  angle  constraints  the  incremental  work 
required  to  analyse  the  change  in  ii-tj  plane  constraints 
associated  with  each  critical  point  is  O(AlgA)  where 
again  A  is  the  number  of  match  constraints  affected  as 
shown  in  figure  5.  Over  a  large  range  of  feature  set  sixes 
the  average  work  per  critical  point  is  a  small  constant, 
and  the  amount  of  work  required  to  analyse  all  maxi¬ 
mal  match  sets  is  ~  m*n*.  If  we  use  the  randomised 
algorithms  we  get  a  speedup  of  ~  m  and  so  the  total  ex¬ 
pected  work  is  ~  n’m.  Arguably,  the  mean  value  of  A  is 
a  very  slowly  growing  linear  function  of  the  data  set  sise 
in  the  case  of  no  angle  constraints  as  can  be  seen  from 
graph  4.  If  this  is  included  in  the  analysis  a  factor  of  m 
is  added  to  the  complexity.  For  the  case  where  we  use 
full  u|^e  constraints  and  position  constraints  to  parti¬ 
tion  the  rotati(Hi  space  the  incremental  analysis  takes  up 
to  A’  time  when  the  associated  critical  point  is  due  to  an 
angle  constraint.  From  graph  4  in  figure  6  we  see  that 


over  a  broad  range  of  data  set  sixes,  the  mean  value  of  A^ 
is  small  and  roughly  constant  as  the  data  sise  increases, 
but  was  found  to  be  quadratic  in  model  sise.  So  the  av¬ 
erage  work  per  critical  point  is  at  most  'v.  m*,  and  the 
total  work  is  ~  m*n*  for  the  full  algorithm  or  ~  n*m* 
expected  case  for  the  randomised  algorithm. 

So  we  see  on  average  the  amount  of  work  to  do  at  each 
critical  rotation  is  about  constant  and  is  not  dependent 
on  the  data  set  sise.  These  experiments  illustrate  two 
points.  First  the  complexity  of  the  constraint  arrange¬ 
ment  in  high  only  near  transformations  where  the  model 
matches  the  data  well.  Elsewhere  the  complexity  is  often 
low.  Second,  because  of  the  random  nature  of  the  spuri¬ 
ous  data,  it  is  unlikely  that  (for  reasonable  data  set  sise) 
many  different  large  spurious  feature  sets  will  match  the 
model,  as  demonstrated  by  the  independence  from  n  in 
the  experiments. 

This  illustrates  the  point  that  the  complexity  of  the 
search  space  is  usually  only  high  in  regions  neat  the  place 
we’re  looking  for,  that  is,  a  range  of  transformations  con¬ 
sistent  with  many  match  constraints.  If  we  are  careful 
about  how  we  explore  the  transformation  space  we  can 
exploit  this  fact  and  utilise  algorithms  that  have  far  less 
that  the  asymptotic  complexity  bounds,  and  are  practi¬ 
cal  to  compute. 


There  are  several  recognition  systems  which  rely  on 
matching  local  geometric  features  to  determine  feasible 
model  poses  from  the  sensory  data.  We  briefly  mention 
some  relevant  ones.  Crimson  and  Losano-P4res  devel¬ 
oped  a  recognition  system  based  on  searching  the  space 
of  possible  match  sets  making  use  of  what  they  called  an 
interpretation  tree.  In  its  basic  form  this  exhaustively 
searches  correspondence  space  exploiting  geometric  con¬ 
straints  to  prune  the  search[l4].  One  of  the  reasons 
why  this  method  has  exponential  complexity  in  spite 
of  geometric  constraints  is  that  it  unnecessarily  searches 
through  exponentially  many  subsets  of  what  we  call  max¬ 
imal  match  sets.  In  practice,  however,  by  using  heuristics 
to  concentrate  the  search  in  areas  where  many  matches 
are  consistent  the  system  performs  well.  One  problem, 
however,  is  that  returning  a  correct  negative  answer  re¬ 
quires  considerable  search. 

Baird[3]  developed  a  method  to  localise  2D  objects 
from  2D  data  using  point  features  as  a  representation. 
Baird  carefully  formalised  the  problem  of  feature  match¬ 
ing  under  uncertainty  by  defining  uncertainty  bounds 
for  the  feature  locations  characterised  by  sets  of  linear 
inequality  constraints  on  the  transformed  feature  posi¬ 
tions.  Important  contributions  of  this  work  are  the  use 
of  scaled  rotations  to  form  the  linear  vector  space  of  2D 
scaled  rigid  motions  and  the  use  of  linear  uncertainty 
constraints.  Essentially  his  method  searched  an  inter¬ 
pretation  tree  except  it  maintained  global  consistency 
using  linear  programming,  checking  the  feasibility  of  a 
match  set  as  each  new  match  were  added. 

Baird  reported  pcfiynomial  expected  case  performance 
when  there  were  no  missing  and  spurious  features,  yet 
could  see  no  way  of  avoiding  exponential  search  when 
there  were  missing  and  spurious  features.  Although  as 


8  Related  Work 


703 


we  have  shown  in  this  paper  the  underlying  transfor¬ 
mation  ''>ace  has  polynomial  complexity,  Baird’s  tree 
searc-  lot  exploit  this  fact.  Baird  focused  on  a 
corrr  e-space  search,  and  used  the  matching 

cons>  .d  linear  programming  to  guide  the  search. 

He  diu  consider  the  idea  that  all  constraints  could 
be  expressed  in  transformation  space  together  delin¬ 
eating  a  polynomial  number  of  transformation  equiva¬ 
lence  classes.  Thus  while  his  formulation  permitted  a 
polynomial-time  solution,  his  algorithm  did  not  exploit 
it. 

Breuel[4]  has  developed  an  elegant  correspondence- 
space  search  technique  based  on  the  problem  formula¬ 
tion  and  tree  search  method  used  by  Baird[3],  and  the 
notion  of  transformation  equivalence  classes  and  max¬ 
imal  match  sets  described  in  [7]  and  here.  Hu  algo¬ 
rithm  exploits  the  constraints  in  transformation  space 
imposed  by  each  match,  and  the  fact  that  there  are 
only  a  polynomial  number  of  meaningful  transformation 
equivalence  classes  to  arrive  at  a  polynomial-time  algo¬ 
rithm.  The  problem  of  exponential  search  is  avoided 
by  guiding  the  correspondence  space  search  so  that  only 
maximal  match  sets  are  constructed.  See  also  [l7,  6,  1, 
111. 
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Abstract 

One  of  the  major  problems  in  model-based  ob¬ 
ject  recognition  is  selection,  namely,  the  prob¬ 
lem  of  determining  which  regions  in  the  im¬ 
age  are  likely  to  come  from  a  single  object. 

In  this  paper  we  present  an  approach  that 
uses  color  as  a  cue  to  perform  selection  either 
based  solely  on  image-data  (data-driven),  or 
based  on  the  knowledge  of  the  color  descrip¬ 
tion  of  the  model  (model-driven).  Specifically, 
the  paper  argues  for  the  specification  of  color 
in  terms  of  color  categories,  as  being  appro¬ 
priate  for  the  task  of  selection.  These  color 
categories  are  used  to  develop  a  fast  region 
segmentation  algorithm  that  extracts  percep¬ 
tual  color  regions  in  images.  The  color  regions 
extracted  form  the  basis  for  performing  data 
and  model-driven  selection.  Data-driven  selec¬ 
tion  is  achieved  by  selecting  salient  color  re¬ 
gions.  The  saliency  of  color  regions  is  judged 
by  a  color-saliency  measure  that  emphasises 
attributes  that  are  also  important  in  human 
color  perception.  The  approach  to  model- 
driven  selection,  on  the  other  hand,  exploits 
the  color  region  information  in  the  model  to 
locate  instances  of  the  model  in  a  given  im¬ 
age.  The  approach  presented  tolerates  some 
of  the  problems  of  occlusion,  pose  and  illumi¬ 
nation  changes  that  make  a  model  instance  in 
an  image  appear  different  from  its  original  de¬ 
scription.  Finally,  the  utility  of  color-based 
data  and  model-driven  selection  is  discussed 
in  the  context  of  reducing  the  search  involved 
in  recognition. 

1.  SELECTION  IN  RECOGNITION 

One  of  the  major  problems  in  object  recognition  is  se¬ 
lection,  namely,  the  problem  of  identifying  regions  in  an 
image  within  which  to  start  the  recognition  process.  In 
other  words,  selection  is  the  process  of  isolating  regions 
in  an  image  that  are  likely  to  come  from  a  single  object. 
Model-based  object  recognition  methods  that  try  to  rec¬ 
ognise  which  of  the  models  from  their  library  of  models 
are  present  in  the  scene,  usually  use  geometric  features 
such  as  points  or  edges  and  try  to  identify  pairings  be¬ 
tween  data  and  model  features  that  are  consistent  with 


a  rigid  transformation  of  the  object  model  into  image  co¬ 
ordinates.  The  large  number  of  such  pairings  that  need 
to  be  examined  in  cluttered  scenes  leads  to  a  combinato¬ 
rial  search  problem.  It  has  been  shown  that  this  search 
can  be  considerably  reduced  if  recognition  systems  are 
equipped  with  a  selection  stage  where  subsets  of  data 
features  can  be  isolated  that  are  likely  to  come  from  a 
single  object,  thus  allowing  search  to  be  focused  on  those 
matches  that  are  more  likely  to  lead  to  a  correct  solution 
[4].  This  isolation  can  be  either  based  solely  on  image 
data  (data-driven)  or  can  incorporate  the  knowledge  of 
the  model  (task-driven  or  model-driven).  In  addition,  it 
is  desirable  to  order  these  subsets  of  data  features  such 
that  the  more  promising  ones,  i.e.,  those  that  are  more 
likely  to  point  to  a  single  object,  are  explored  first.  This 
can  not  only  increase  the  likelihood  of  a  good  match  be¬ 
ing  obtained  earlier,  but  is  also  useful  when  the  task  is  to 
recognise  as  many  objects  as  possible  in  a  scene.  Thus 
the  goak  of  selection  in  recognition  are  two  fold:  To  iso¬ 
late  areas  in  the  image  that  are  likely  to  come  from  a 
single  object,  and  to  order  these  regions  such  that  the 
more  promising  ones  are  explored  first.  From  this,  it  is 
apparent  that  the  goals  of  selection  are  different  from 
those  of  segmentation,  where  the  problem  is  to  partition 
the  image  into  regions  that  contain  a  single  object.  In 
selection,  on  the  other  hand,  it  is  not  essential  to  isolate 
regions  that  totally  contain  a  single  object,  nor  is  it  nec¬ 
essary  to  partition  the  entire  image  into  different  object 
containing  regions. 

Even  though  selection  can  be  of  help  in  recognition, 
it  has  largely  remained  unsolved.  What  makes  selec¬ 
tion  so  difficult?  In  the  ideal  case,  if  the  appearance  of 
the  desired  object  in  the  scene  was  known,  and  objects 
in  the  scene  were  nicely  separated  and  distinguishable 
from  the  background,  and  the  illumination  conditions 
were  known,  then  even  simple  methods  that  rely  on  in¬ 
tensity  measurements  would  work  well  to  extract  groups 
of  features.  But  in  reality  the  appearance  of  the  ob¬ 
ject  is  not  known.  In  addition,  illumination  conditions 
and  surface  geometries  of  objects  present  in  a  scene  can 
cause  problems  of  occlusion,  shadowing,  specularities, 
inter-reflections,  etc.  in  the  image  and  make  it  difficult 
to  interpret  groups  of  data  features  such  as  edges  and 
lines.  Previous  approaches  to  selection  have  focused  on 
the  problem  of  data-driven  selection  by  grouping  data 
features  such  as  edges,  lines,  points,  etc.,  based  on  con- 
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straints  such  as  parallelism,  collinearity,  etc.  [8],  distance 
and  orientation  [?],  and  regions  enclosed  by  a  group  of 
edges  [?].  The  extent  to  which  such  grouping  methods 
reduce  the  search  in  recognition  depends  on  the  relia¬ 
bility  of  the  groups  produced  (i.e.  how  many  of  them 
really  come  from  a  single  object).  With  the  constraints 
used,  maintaining  the  reliability  of  groups  was  found  to 
be  difficult.  So  the  general  problem  of  selection  remains 
largely  unsolved  as  it  is  still  not  obvious  how  to  reli¬ 
ably  characterise  subsets  of  data  features  that  will  give 
clues  that  point  to  a  single  object.  Thus  it  appears  that 
there  is  a  need  for  a  computational  model  of  selection  to 
explain  both  data  and  task-driven  selection. 

We  have  been  involved  in  building  one  such  model  that 
proposes  that  selection  be  accomplished  via  an  attention 
mechanism.  Specifically,  it  is  an  attempt  to  build  a  com¬ 
putational  model  of  the  visual  attention  phenomenon  in 
humans,  and  to  propose  it  as  a  general  purpose  selection 
mechanism  in  recognition.  This  involves  the  isolation 
of  two  modes  of  human  attentional  behavior,  namely 
attracted-attention  and  pay-attention  modes,  to  serve 
as  paradigms  for  data-driven  and  model-driven  selection 
respectively.  The  affracted-atfeniton  mode  of  behavior  is 
spontaneous  and  is  commonly  exhibited  by  an  unbiased 
observer  (i.e.,  with  no  a  priori  intentions)  when  some 
objects  or  some  aspects  of  the  scene  attract  his/her  at¬ 
tention,  while  the  pay- attention  mode  is  a  more  deliber¬ 
ate  behavior  exhibited  by  an  observer  looking  at  a  scene 
with  a  priori  goals  (such  as  the  task  of  recognising  an 
object,  say)  and  hence  paying  attention  to  only  those 
objects/aspects  of  a  scene  that  are  relevant  to  the  goal. 
According  to  this  model,  therefore,  data-driven  selec¬ 
tion  can  be  achieved  by  identifying  regions  in  an  image 
that  attract  attention  (i.e.,  that  are  distinctive)  with  re¬ 
spect  to  some  feature  such  as  color,  texture,  etc.,  while 
model-driven  selection  can  be  achieved  by  paying  atten¬ 
tion  to  the  model  features  (i.e.,  using  the  model  features 
to  decide  saliency  of  features  in  the  image).  While  it  is 
understandable  that  paying  attention  to  model  features 
can  help  isolate  areas  in  the  image  that  could  contain 
subsets  of  data  features  that  are  likely  to  contain  a  sin¬ 
gle  object  (or  the  specific  model  object  in  this  case),  it 
is  not  immediately  apparent  how  locating  salient  regions 
can  help  in  serving  the  goals  of  selection.  Such  a  choice 
is,  however,  motivated  by  the  following  considerations. 
First,  it  is  often  observed  that  an  object  stands  out  in  a 
scene  because  of  some  distinctive  features  that  are  usu¬ 
ally  localised  to  some  portion  of  the  object.  Therefore 
isolating  distinctive  regions  is  more  likely  to  point  to  a 
single  object.  Secondly,  a  distinctive  region,  if  suitably 
found,  can  help  in  limiting  the  number  of  candidate  mod¬ 
els  from  the  library  that  can  potentially  match  the  given 
data.  This  is  especially  true  if  only  a  few  models  in  the 
library  satisfy  the  features  that  made  the  data  region 
distinctive.  Lastly,  it  has  often  been  observed  that  the 
first  objects  recognised  in  a  scene  are  those  that  attract 
an  observer’s  attention.  Thus  ordering  the  regions  by 
distinctiveness  to  decide  which  objects  to  recognise  first 
seems  to  be  in  keeping  with  this  observation.  Finally, 
a  number  of  other  approaches  have  also  suggested  that 
selection,  at  least  data-driven,  can  be  performed  based 


Figure  1:  Illustration  of  color  region  segmentation  and 
color-saliency.  (a)  Input  image  depicting  a  scene  of  ob¬ 
jects  of  different  materials  and  having  occlusions  and 
inter-reflections,  (b)  Segmented  image  using  the  color 
region  segmentation  algorithm,  (c)-(f)  The  four  most 
distinctive  regions  detected  ruing  the  color-saliency  mea¬ 
sure.  The  white  portion  in  the  red  book  appears  so  be¬ 
cause  of  the  white  backgrouna. 


on  some  measure  of  saliency,  such  as  the  structural 
saliency  of  curves  [l6j,  saliency  defined  by  local  differ¬ 
ences  in  contrast,  color,  sise,  etc.  [2],[3],[l5]. 

The  above  discussion  indicates  a  framework  in  which 
data  and  model-driven  selection  can  be  achieved.  But 
how  can  salient  regions  be  found  in  the  image  indepen¬ 
dent  of  the  model,  and  how  can  the  object  model  affect 
the  choice  of  regions?  The  purpose  of  this  paper  is  to 
present  a  method  of  selection  by  restricting  attention  to 
one  particular  feature,  namely,  color.  It  shows  how  color 
regions  can  be  extracted  from  the  image  and  how  they 
can  be  used  to  perform  data-driven  and  model-driven  se¬ 
lection.  To  give  a  flavor  for  the  ensuing  discussions.  Fig¬ 
ures  1-2  show  some  examples  of  the  results  of  data  and 
model-driven  selection  performed  by  our  system.  Figure 
la  shows  an  image  of  a  realistic  indoor  scene  with  shad¬ 
ows,  inter-reflections,  and  consisting  of  many  types  of 
objects.  The  different  color  regions  found  in  this  image 
are  re-colored  and  shown  in  Figure  lb.  The  four  most 
salient  color  regions  found  are  shown  in  Figures  Ic-lf. 
These  regions  span  objects  in  the  scene  that  are  salient 
in  color.  Figure  2  shows  model-driven  selection  using 
color,  using  the  model  object  shown  in  Figure  2a  and 
the  scene  depicted  in  Figure  2c.  The  cluster  of  regions 
found  to  best  satbfy  the  model  color  region  description 
using  our  algorithm  for  model-driven  selection  is  shown 
in  Figure  2f. 

The  rest  of  the  paper  discusses  how  this  kind  of  se- 
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lection  can  be  achieved  using  color.  It  is  organised  as 
f<dlows.  In  Section  2,  we  motivate  the  choice  of  color  as 
a  feature  to  study  selection,  and  outline  the  requirements 
imposed  by  selection  on  any  method  for  the  extraction 
of  color  information.  Based  on  these  guidelines,  an  ap¬ 
proach  to  extracting  color  regions  is  presented  in  Section 
3.  In  section  4,  a  measure  for  expressing  the  saliency  of 
color  regions  is  presented  and  its  effectiveness  for  data- 
driven  selection  is  examined.  Section  5  presents  a  way 
to  perform  model-driven  selection  based  on  the  color  re¬ 
gions.  Finally,  Section  6  summarises  our  approach  to 
color-based  selection. 

2.  COLOR  IN  SELECTION 

2.1  Role  of  Color  in  Selection 

Color  is  known  to  be  a  strong  cue  in  attracting  an 
observer’s  attention.  Humans  often  also  use  color  infor¬ 
mation  to  search  for  specific  objects  in  a  scene.  It  there¬ 
fore  seems  natural  to  use  color  as  a  cue  for  performing 
selection  in  computer  vision.  But  the  strong  motivation 
for  using  color  in  selection  comes  from  the  fact  that  it 
provides  region  information  and  that,  when  specified  ap¬ 
propriately,  it  can  be  relatively  insensitive  to  variations 
in  normal  illumination  conditions  and  appearances  of  ob¬ 
jects.  A  color  region  in  the  image  almost  always  comes 
entirely  from  a  single  object,  giving,  therefore,  more  reli¬ 
able  groups  than  existing  grouping  methods  and  this  can 
be  useful  for  data-driven  selection.  Because  objects  tend 
to  show  color  constancy  under  most  illumination  condi¬ 
tions,  color  can  be  a  stable  cue  for  most  poses  (appear¬ 
ances)  of  objects  in  scenes,  thus  making  it  also  suitable 
for  model-driven  selection. 

2.2  Surface  Color,  Image  Color,  Percep¬ 
tual  Color 

Although  color  is  useful  for  selection,  the  problem  of 
specifying  the  perceived  color  of  objects  from  an  image  of 
the  scene  has  proven  to  be  difficult.  Several  artifacts  such 
as  specularities  (from  shiny  surfaces  in  the  scene),  inter¬ 
reflections,  shading  on  surfaces,  and  shadowing  all  make 
it  difficult  to  recover  the  actual  color  of  objects  in  the 
scene  from  the  image.  Existing  approaches  have  mainly 
focused  on  the  problem  of  color  constancy,  where  the 
goal  was  to  extract  surface  color,  i.e.,  surface  reflectance 
properties  of  objects,  in  order  to  obtain  a  stable  percep¬ 
tion  of  the  color  of  an  object  under  varying  illumination 
conditions.  As  this  problem  is  under-constrained,  most 
methods  make  some  assumptions  about  either  the  sur¬ 
face  being  imaged  [12],  or  about  the  illumination  con¬ 
ditions  [13], [9].  Other  approaches  also  exist  that  try 
to  recover  image  color,  i.e.,  the  color  of  the  objects  as 
they  appear  under  the  present  illumination  conditions, 
accounting  separatelv  for  artifacts  such  as  specularities 
on  shiny  surfaces  [ll].  These  methods,  however,  cannot 
ensure  that  the  color  extracted  matches  the  perceived 
color  of  regions. 

For  the  purposes  of  selection,  what  kind  of  color  infor¬ 
mation  should  be  extracted  from  regions?  Is  recovering 
image  color  sufficient  or  should  one  attempt  to  recover 
surface  color?  We  propose  that  for  both  data  and  model- 
driven  selection,  it  is  sufficient  if  a  region  could  be  spec¬ 
ified  by  its  perceived  color,  and  the  effects  of  artifacts 


Figure  2:  niustration  of  color-bated  model-driven  selec¬ 
tion.  (a)  The  object  serving  at  the  model,  (b)  Its  color 
description  produced  by  the  segmentation  algorithm  of 
Section  S.  (c)  A  cluttered  scene  in  which  the  object  ap¬ 
pears.  (d)  Regions  selected  bated  on  unary  color  con¬ 
straint.  (e)  Regions  of  (d)  pruned  after  using  the  unary 
size  constraint,  (f)  Regions  corresponding  to  the  best 
subgraph  that  matched  the  model  specifications. 


such  as  specularities  could  be  separately  accounted  as 
was  done  by  image  color  recovery  methods.  Using  the 
perceptual  color,  two  adjacent  color  regions  would  be 
distinguished  if  their  perceived  colors  were  different,  and 
this  is  sufficient  for  data-driven  selection.  Because  ob¬ 
jects  tend  to  obey  color  constancy  under  most  changes  in 
iUumination,  their  perceived  color  remains  remains  more 
or  less  the  same  thus  making  it  sufficient  abo  for  model- 
driven  selection.  But  can  perceptual  color  be  quantified 
at  ail?  In  general,  several  effects  such  as  simultaneous 
color  contrast,  color  filling,  etc.  have  been  known  to  in¬ 
fluence  human  perception  of  color.  Fortunately,  (as  we 
will  explain  later,)  these  factors  axe  not  very  critical  for 
selection. 

2.3  Perceptual  Color  Specification  by  Cat¬ 
egories 

In  this  section  we  present  a  method  of  specifying  the 
perceptual  color  of  image  regions  from  the  colors  of  their 
constituent  pixels.  The  color  of  pixels  in  images  is  de¬ 
scribed  by  a  triplet  <R,G,B>  (called  specific  color  hence¬ 
forth),  representing  the  components  of  image  intensity  at 
that  point  along  three  wavelengths  (usually  red,  green 
and  blue  as  dominant  wavelengths  to  correspond  to  the 
filters  used  in  the  color  cameras).  When  all  possible 
triples  are  mapped  into  a  3-dimensional  color  space  with 
axes  standing  for  the  pure  red,  green  and  blue  respec¬ 
tively,  we  get  a  color  space  that  represents  the  entire 
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spectrum  of  computet  recordable  colors.  Such  a  color 
space,  must  therefore,  be  partitionable  into  subspaces 
where  the  color  remains  perceptually  the  same,  and  is 
distinctly  different  from  that  of  neighboring  subspaces. 
Such  subspaces  can  be  called  perceptual  color  categoriei. 
Now  each  pixel  in  the  image  maps  to  a  point  in  this  color 
space,  and  hence  will  iall  into  one  of  these  categories. 
The  perceptual  color  of  this  pixel  can,  therefore,  be  spec¬ 
ified  by  this  color  category.  To  get  the  perceived  colors 
of  regions  from  the  perceptual  color  of  their  constituent 
pixek,  we  observe  the  following.  Although  the  individ¬ 
ual  pixels  of  an  image  color  region  may  show  consider¬ 
able  variation  in  their  specific  colors,  the  overall  color  of 
the  region  is  fairly  well-determined  by  the  color  of  the 
majority  of  pixels  (called  dominant  color  henceforth). 
Therefore,  the  perceived  color  of  a  region  can  be  speci¬ 
fied  by  Vie  color  category  corresponding  to  the  dominant 
color  in  the  region. 

The  category- based  specification  of  perceptual  color 
(of  pixels  or  regions)  is  a  good  compromise  between 
choosing  the  specific  color  (which  is  extremely  unsta¬ 
ble  w.r.t  to  changes  in  illumination  conditions,  etc.)  and 
surface  color  (whose  recovery  is  hard).  Since  the  cate¬ 
gories  indicate  the  perceptual  color,  they  have  the  same 
beneficial  effect  as  recovering  perceptual  color,  on  both 
data  and  model-driven  selection,  such  as  giving  a  re¬ 
liable  segmentation  of  image  into  color  regions,  being 
stable  under  changes  in  illumination  conditions,  etc.  In 
addition,  since  the  perceptual  categories  depend  on  the 
color  space  and  are  independent  of  the  image,  they  can 
be  found  in  advance  and  stored  in,  say,  a  look-up  ta¬ 
ble.  Finally,  a  category-based  description  is  in  keep¬ 
ing  with  the  idea  of  perceptual  categorisation  that  has 
been  explored  extensively  through  psychophysical  stud¬ 
ies.  These  studies  concluded  that  ^though  humans  can 
discriminate  between  several  thousand  nuances  of  colors, 
psychophysically,  we  seem  to  partition  the  color  space 
into  relatively  few  distinct  qualitative  color  sensations 
or  categories  [l7]. 

2.4  Categorization  of  Color  Space 

The  above  discussion  argued  for  the  viability  of  an 
approach  that  recovers  a  color  to  within  a  category.  Be¬ 
fore  this  can  be  turned  into  a  computational  method  of 
color  recovery  one  needs  to  address  the  issue  of  how  such 
categories  may  be  found.  Previous  work  on  color  cate¬ 
gorization  involved  experiments  of  naming  the  color  us¬ 
ing  a  limited  vocabulary,  or  identifying  colors  using  the 
Munsell  color  charts  [20].  But  for  computational  color 
recovery,  we  need  a  way  to  convert  the  camera  recordable 
red,  green  and  blue  components  of  colors  into  computer 
recordable  perceptual  color  categories.  This  was  done 
by  performing  some  rather  informal  but  extensive  psy¬ 
chophysical  experiments  that  systematically  examined  a 
color  space  and  recorded  the  places  where  qualitative 
color  changes  occur,  thus  determining  the  number  of  dis¬ 
tinct  color  categories  that  can  be  perceived.  For  this,  the 
hue-saturation-value  color  space  was  used  as  it  specifies 
a  given  color  in  terms  of  its  hue,  purity  and  brilliance  - 
attributes  that  have  been  found  to  give  a  perceptual  de¬ 
scription  of  color  [lO].  The  details  of  these  experiments 


will  be  skipped  here,  except  to  mention  the  following. 
The  entire  spectrum  of  computer  recordable  colors  (2*^ 
colors)  was  quantized  into  720C  bins  corresponding  to 
a  5  degree  resolution  in  hue,  and  10  levels  of  quanti¬ 
zation  of  saturation  and  intensity  values.  The  color  in 
each  such  bin  was  then  observed  by  displaying  a  mon- 
drian  (a  uniform  color  patch)  of  that  color  on  a  monitor 
screen  and  observing  it  under  dark  room  conditions  with 
appropriate  monitor  calibration.  From  our  studies,  we 
found  about  220  different  color  categories  were  sufficient 
to  describe  the  color  space.  The  color  category  informa¬ 
tion  was  then  summarized  in  a  color-look-up  table.  Al¬ 
though  it  is  true  that  a  finer  level  of  quantization  would 
have  yielded  more  categories,  a  smaller  set  u  actually 
more  useful  since  it  gives  a  reasonably  coarse  descrip¬ 
tion  of  the  color  of  a  region  thus  allowing  it  to  remain 
the  same  for  some  variations  in  imaging  conditions.  In 
fact,  by  the  above  method  we  can  also  determine  which 
categories  can  be  grouped  to  give  an  even  rougher  de¬ 
scription  of  a  particular  hue.  This  was  done  and  stored 
in  a  category-look-up  table  to  be  indexed  using  the  color 
categories  given  by  the  color-look-up  table. 

3.  COLOR  REGION  SEGMENTATION 

The  previous  section  described  how  to  specify  the 
color  of  regions,  when  once  they  have  been  isolated.  But 
the  more  crucial  problem  is  to  identify  these  regions.  In 
this  section,  we  show  that  the  perceptual  categorization 
principle  can  be  used  to  determine  which  pixels  can  be 
grouped  to  form  regions  in  an  image.  If  each  surface 
in  the  scene  were  a  mondrian,  then  all  its  pixels  would 
belong  to  single  color  category,  so  that  by  grouping  spa¬ 
tially  close  pixels  belonging  to  a  category,  the  desired 
segmentation  of  the  image  can  be  obtained.  But  real 
surfaces  being  hardly  mondrians,  it  is  rare  that  pixels 
of  a  region  from  such  surfaces,  aU  belong  to  the  same 
color  category.  They  could  show  considerable  variation 
in  color  with  bright  and  dark  pixels  intermixed,  and  with 
possibly  spurious  pixels  also  being  present.  We  now  anal¬ 
yse  some  of  the  color  variations  across  an  image  that  can 
result  from  imaging  a  colored  surface  in  the  scene. 

3.1  Variation  of  Color  Across  an  Image  of 
3D- Surface 

In  this  section  we  use  some  assumptions  to  show  that 
the  color  variations  across  an  image  of  surface  is  mostly 
in  intensity.  When  a  surface  is  imaged,  the  light  falling 
on  the  image  plane  (image  irradiance)  is  related  to  the 
physical  properties  of  the  scene  being  imaged  via  the 
image  irradiance  equation  given  below: 

7(A,r)=p(A,r)f(k,n,z)£:(A,r)  (1) 

where  A  is  the  wavelength,  r  is  the  spatial  coordinate 
and  r  is  its  projection  in  the  image,  E(X,  r)  is  the  inten¬ 
sity  of  the  ambient  illumination,  p(A,  r)  is  the  component 
of  surface  reflectance  that  depends  only  on  the  material 
properties  of  the  surface  (and  hence  specifies  its  surface 
color),  while  /'(k,  n,  s)  is  the  component  of  surface  re¬ 
flectivity  that  depends  on  surface  geometry,  with  k.s.n 
being  the  viewer  direction,  the  source  direction  and  the 
surface  normal  respectively.  Although  the  image  irradi¬ 
ance  equation  assumes  that  all  surfaces  in  a  scene  reflect 


light  governed  by  a  single  reflectivity  function,  we  can 
easily  reinterpret  this  equation  to  represent  image  irra- 
diance  of  a  single  surface.  Under  the  assumption  of  a 
single  light  source,  the  surface  illumination  E(A,  r)  can 
be  separated  as  a  product  of  two  terms  Ei(A)  and  Ei(r), 
and  since  F(k,  n,  s)  is  a  function  of  position  r  =  F(r) 
the  image  irradiance  equation  can  be  re-written  as 

I(X,r)  =  p(X,T)F(T)E^(\)Ei{t)  (2) 

The  surface  reflectance  and  hence  the  resulting  ap¬ 
pearance  of  a  surface  is  determined  by  the  composition 
as  well  as  the  concentration  of  the  pigments  of  the  mate¬ 
rial  constituting  the  surface.  For  most  surfaces,  the  com- 
posiuon  of  the  pigments  can  be  considered  independent 
of  their  concentration  so  that  the  spectral  reflectance 
p{A,r)  can  be  written  as  a  product  of  two  terms  pi(A) 
and  p3(r).  Note  that  this  assumption  is  less  restricting 
than  the  assumption  of  homogeneity  that  has  been  used 
before  [9].  With  this  simplification,  (and  grouping  the 
product  of  terms  dependent  on  A  and  r  separately)  the 
image  irradiance  equation  becomes 

/(A,r)  =  ff(r)I(A)  (3) 

Now,  if  we  consider  the  filtered  version  of  this  sig¬ 
nal,  i.e.,  the  image  irradiance  in  three  channels,  say 
the  red,  green  and  blue  channels  with  their  asso¬ 
ciate  transfer  functions  hs(A),  the  specific 

color  at  each  pixel  location  r  is  specified  by  the  triple 
<R(t),G(t),B(r)>  where 


R(r)  = 

I{X,r)hn(X)dX  =  H{t)R, 

(4) 

G(r)  = 

/*/(A.r)AG(A)dA  = 

(5) 

B(r)  = 

/*/(A,r)AB(A)dA  = 

(6) 

This  shows  that  under  the  given  assumptions  (which 
include  non-homogeneous  surfoces,)  the  color  of  a  sur¬ 
face  can  vary  only  in  intensity.  In  practice,  even  when 
the  separability  assumption  on  reflectance  is  not  satis¬ 
fied,  or  there  is  more  than  one  light  source  in  the  scene, 
the  general  observation  is  that  the  intensity  and  purity 
of  colors  get  affected  but  the  hue  still  remains  fairly  the 
same.  In  terms  of  categories,  this  means  that  different 
pixeb  in  a  surface  belong  to  compatible  categoriee,  i.e. 
have  the  same  overall  hue  but  vary  in  intensity  and  sat¬ 
uration.  Conversely,  if  we  group  pixels  belonging  to  a 
single  category,  then  each  physical  surface  is  spanned  by 
multiple  overlapping  regions  belonging  to  such  compati¬ 
ble  color  categories.  These  were  the  categories  that  were 
grouped  in  the  category-Iook-up-table  mentioned  in  Sec¬ 
tion  2.4.  The  next  section  describes  how  these  concepts 
can  be  put  together  to  give  a  color  image  segmentation 
algorithm. 

3.2  Color  Region  Segmentation  Algorithm 
The  algorithm  for  color  image  segmentation  performs 
the  following  steps. (1)  First,  it  maps  all  pixels  to  their 
categories  in  color  space.  (2)  It  then  groups  pixels  be¬ 
longing  to  the  same  category,  (3)  and  finally  merges  over¬ 
lapping  regions  in  the  image  that  are  of  compatible  color 
categories. 

1.  Mapping  pixels  to  categories:  This  is  done  by  a  sim- 
ple  indexing  of  the  color-Iook-up-tabie  by  the  color  of 


the  pixel  specified  in  terms  of  its  hue,  saturation,  and 
brightness  components.  These  components  can  be  de¬ 
rived  from  the  specific  color  as  described  in  [2l].  This 
step  takes  time  =  0(N)  where  N  is  the  size  of  the  image. 

2.  Grouping  pixels  of  same  category:  The  image  is  di¬ 
vided  into  small  non-overlapping  bins  of  fixed  size  (say, 
8x8)  and  the  color  categories  found  in  the  bins  are 
recorded.  Then  a  sequential  non-recursive  labeling  al¬ 
gorithm  is  used  to  simultaneously  assemble  all  the  over¬ 
lapping  connected  components  using  the  category  de¬ 
scription  in  the  bins.  This  algorithm  is  an  extended  ver¬ 
sion  of  the  labeling  algorithm  for  binary  images  earlier 
described[5],  and  uses  the  union-find  data  structure  to 
efficiently  merge  category  labels  into  connected  compo¬ 
nents  taking  time  =  0(k^  M)  where  M  =  window-size, 
and  k  =  maximum  number  of  categories  present  in  the 
window  (=  0(1)  for  small  window-sizes,  eg.,.  M  =  64). 
The  resulting  labels  are  propagated  back  to  the  pixels 
to  give  the  precise  boundaries  of  color  regions  of  single 
color  categories.  The  color  of  the  region  is  then  specified 
by  the  color  category  and  specific  color  of  the  dominant 
color  in  the  re^ons  as  described  in  Section  2.3. 

3.  Merging  overlapping  regions:  The  general  problem  of 
determining  which  regions  overlap  in  image  can  be  a 
computationally  intensive  operation  as  it  involves  de¬ 
termining  which  polygonal  regions  intersect  and  finding 
their  regions  of  intersection.  But  by  using  the  bin-wise 
representation  of  connected  components,  we  can  detect 
and  combine  overlapping  regions  with  greater  ease.  From 
the  discussion  in  Section  3.1,  a  shaded  region  maps  to 
categories  in  color  space  that  are  compatible,  i.e.,  have 
the  same  overall  hue.  The  categories  that  are  compatible 
are  available  from  the  category  look-up-table  described 
in  Section  2.4.  To  find  all  such  regions  that  have  com¬ 
patible  categories  and  overlap  in  image  space,  the  algo¬ 
rithm  examines  each  window  of  the  image  to  see  if  it 
contains  the  interior  portions  of  regions  of  compatible 
color  categories.  Such  overlap  regions  are  grouped  as 
in  step  2.  This  step  again  takes  O(k^M)  time.  Finally, 
the  window-level  color  labels  are  propagated  back  to  the 
corresponding  pixels  to  give  an  accurate  localization  of 
the  color  region  boundaries. 

The  algorithm  for  color  image  segmentation  thus 
makes  only  a  constant  number  of  passes  through  the  im¬ 
age,  each  being  linear  in  the  size  of  the  image. 

3.4  Results 

Figures  3-4  demonstrate  the  color  region  segmentation 
algorithm.  Figure  3a  shows  a  256  x  256  pixel  size  image 
of  a  color  pattern  on  a  plastic  bag.  The  folding  on  the 
bag  and  its  plastic  material  together  give  a  glossy  ap¬ 
pearance  in  the  image  as  can  be  seen  in  the  big  S  and  Y. 
The  result  of  step-2  of  the  algorithm  is  shown  in  Figure 
3b,  and  there  it  can  be  seen  that  the  glossy  portions  on 
big  blue  Y  and  the  red  S  cause  overlapping  color  regions. 
These  are  merged  in  step  3  emd  the  result  is  shown  in 
Figure  3c.  As  can  be  seen  in  the  figure,  the  algorithm 
achieves  a  fairly  good  segmentation  of  the  scene  for  such 
surfaces.  Figure  4a  shows  another  image  consisting  of 
colored  pieces  of  cloth  with  the  textured  regions  having 
several  small  colored  regions  within  them.  The  results  >f 
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the  algoiithm  (Figure  4b)  show  that  even  such  colored 
regions  can  be  reliably  isolated.  Another  example  (Fig¬ 
ure  1)  of  color  region  extraction  was  mentioned  earlier 
in  Section  1.  Notice  in  the  segmented  image  of  Figure 
lb  that  adjacent  objects  of  same  perceptual  color  are 
merged  (grey  books).  This  is  to  be  expected  because  the 
grouping  of  regions  is  based  on  color  information  alone. 

4.  COLOR.BASED  DATA-DRIVEN 
SELECTION 

The  segmentation  algorithm  described  above  gives  a 
large  number  of  color  regions.  Some  of  these  may  span 
more  than  one  object,  while  some  come  from  the  scene 
clutter  (background,  etc.)  rather  then  objects  of  inter¬ 
est  in  the  scene.  It  would  be  useful  for  the  purposes  of 
recognition,  to  order  and  consider  only  some  of  these  re¬ 
gions  so  that  by  isolating  data  subsets  from  such  regions, 
the  search  can  be  focused  on  key  groups  of  features  thus 
excluding  much  of  the  scene  clutter.  Based  on  the  ratio¬ 
nal  given  in  Section  1,  we  propose  that  the  color  regions 
be  ordered  by  their  saliency,  i.e.,  by  how  distinctive  they 
appear.  The  method  of  color-based  selection,  therefore, 
is  to  extract  color  regions  from  the  image,  order  them 
based  on  a  measure  of  color-saliency  and  then  select  a 
few  most  salient  regions  to  be  given  to  any  recognition 
system.  In  this  section  we  first  describe  a  measure  of 
expressing  color  saliency,  and  then  examine  the  utility 
of  salient-region  selection  in  recognition. 

4.1  Finding  Salient  Color  Regions  in  Im> 
ages 

In  trying  to  express  distinctiveness,  one  encounters  the 
question:  Is  distinctiveness  expressible  at  all?  In  general, 
any  judgement  of  distinctiveness  has  both  a  sensory  and 
a  subjective  component.  Thus  for  example,  while  most 
of  us  can  perceive  brighter  colors  more  easily  than  duller 
colors,  the  judgement  of  which  of  two  hues  of  the  same 
brightness  and  saturation  are  more  salient  can  be  sub¬ 
jective.  The  aim  here  is  to  focus  on  the  sensory  com¬ 
ponent  of  distinctiveness  and  hence  extract  properties  of 
regions  that  are  general  enough  to  be  perceived  by  most 
observers.  Accordingly,  we  pr^'oose  that  the  saliency  of 
a  color  region  be  composed  of  two  components,  namely, 
self-saliency  and  relative  saliency.  Self-saliency  deter¬ 
mines  how  conspicuous  a  region  is  on  its  own  and  mea¬ 
sures  some  intrinsic  properties  of  the  region,  while  rela¬ 
tive  saliency  measures  how  distinctive  the  region  appears 
when  there  are  regions  of  competing  distinctiveness  in 
the  neighborhood. 

In  order  to  develop  such  a  measure  for  color- region 
saliency  one  has  to  ask  the  following  questions;  What 
features  in  regions  determine  their  saliency?  How  can 
they  be  measured  to  reflect  our  sensory  judgments?  an 
finally,  how  can  they  be  combined  to  give  the  saliency 
measure?  We  now  address  these  questions  and  derive  a 
measure  of  color-saliency. 

4.1.1  Features  used  for  measuring  saliency 

Since  the  saliency  of  a  color  region  depends  on  the  re¬ 
gion  features  used,  they  must  be  carefully  selected.  Such 
features  should  be;  (i)  perceptually  important,  (ii)  eas¬ 
ily  measurable,  and  (iii)  fairly  general  to  avoid  subjective 


bias. 

1.  Color:  The  color  of  a  region  is  an  intrinsic  prop¬ 
erty  and  affects  a  region’s  self-saliency.  It  is  specified 
by  (s(R),v(R)),  where  s(R)  =  saturation  or  purity  of 
the  color  of  region  R,  and  v(R)  =  brightness,  and  0  < 
s(R),v(R)  <  1.0.  The  hue  of  colors  is  not  considered  to 
avoid  subjective  bias. 

2.  Region  size:  The  size  of  a  region  is  again  an  intrin¬ 
sic  property  and  affects  its  self-saliency.  It  is  chosen  as 
feature  based  on  the  observation  that  regions  that  are 
either  very  small  in  extent,  or  that  are  large  enough  to 
cover  the  entire  field  of  view,  do  not  often  attract  our 
attention.  Also,  very  large  regions  can  potentially  span 
more  than  one  object,  making  them  unsuitable  for  se¬ 
lection.  The  size  feature  is  expressed  by  the  normalized 
size  r(R)  =  Size(R)/Image-size. 

3.  Color  contrast:  The  color  contrast  a  region  shows  with 
its  neighbors,  affects  its  relative-saliency.  The  ratio¬ 
nale  behind  choosing  color  contrast  is  that  even  if  a 
region  has  an  interesting  intrinsic  color  it  may  not  be 
distinctive  if  all  its  neighbors  also  have  equally  inter¬ 
esting  colors,  unless  it  showed  the  greatest  contrast. 
It  is  difficult  to  express  color  contrast  in  a  numerical 
measure  that  can  account  for  the  variations  in  an  ob¬ 
server’s  judgement  with  the  conditions  of  observation, 
size,  shape,  and  absolute  color  of  the  stimuli.  Among 
the  empirical  formulas  designed  to  predict  the  observed 
color  differences,  the  cie-distance  d(Cfi,Cr)  between  two 
color  region  R  and  T  with  ^ecific  colors  as  Cji  = 
(yp. go, <>o)^  and  Ct  =  (r,g,by  given  by  cI(Cr,Ct)  = 

7+^)*  +  -  Mrf+s)* 

has  gained  acceptance.  But  even  this  measure  does  not 
the  take  into  account  the  hues  of  the  colors  explicitly. 
In  the  color  contrast  measure  we  chose,  it  is  first  ascer¬ 
tained  whether  the  hues  of  the  two  regions  are  differ¬ 
ent,  and  then  the  extent  of  difference  is  judged  used  the 
cie-distance  measure  d(Cjt,CT)  in  such  a  way  that  the 
contrast  between  regions  of  different  hue  is  emphasized. 
The  measure  is  given  by  c(R,T)  below: 


f  kid(Cji,  Ct)  if  R  and  T  are  of  same  hue 
I  k2  +  kid(R,T)  otherwise 


where  ki  =  ^  and  kj  =  0.5,  so  that  0  <  c(R,T)  <  1,0. 

4.  Size  contrast:  The  size  contrast  is  a  feature  for  deter¬ 
mining  relative  saliency  and  is  chosen  because  it  deter¬ 
mines  if  a  region  is  mostly  in  the  background  or  in  the 
foreground.  The  size  contrast  of  a  region  R  with  respect 
to  an  adjacent  region  T  is  simply  the  relative  size  (area) 


and  is  given  by  t{R,  T)  =  min 


Since  a  region  R  has  several  neighboring  regions  in 
general,  the  color  contrast  c(/I)  and  size  contrast  t(R) 
of  a  region  R  are  measured  relative  to  a  best  neighbor 
Tst.t  for  each  region,  so  that  c(R)  =  c(R,Ts,,t),  and 
t{R)  =  t(R,Ti,,t).  Ti„t  is  the  neighboring  region  that  is 
ranked  the  highest  when  all  neighbors  are  sorted  first  by 
size,  then  by  extent  of  surround,  and  finally  by  contrast 
(size  or  color  contrast  as  the  case  may  be). 


4.1.2  Combining  features  for  self-saliency:  To  ci,- 
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termine  self-saliency  from  the  chosen  features,  they 
are  weighted  appropriately  to  reflect  their  importance. 
The  self-saliency  measure  chosen  emphasises  purer  and 
brighter  colors  over  darker  and  duller  colors  by  choos¬ 
ing  the  weighting  functions  for  saturation  and  brightness 
as  /i(s(R))  =  0.5s(R),  and  /j(v(R))  =  0.5v(R)  respec¬ 
tively.  The  sise  of  a  region  is  given  a  non-linear  weight  to 
deemphasise-emphasise  both  very  small  and  very  large 
regions  as  they  do  not  often  attract  our  attention.  The 
corresponding  weighting  function  has  sharp  as  well  as 
and  smoothly  rising  and  falling  phases  determined  by  the 
breakpoints  ti,t2,ts,t4  as  given  in  the  equation  below. ^ 
Here  n  stands  for  the  region  sise  r(R). 


Mn) 


tn(l— n) 

«i 

1  -  c-“" 

'  *1  -  C3tn{l  -  n  +  ti) 

,3g-«4(n-«,) 

.  0 


0  <  n  <  <1 
ti  <  n  <ti 
Ij  <  n  <  ta 
<3  <  n  <  <4 
t4  <  n  <  1.0 


(8) 


where  ti  =  0.1,  tj  =  u.^,  13  =  u.a,  14  = 
sj  =  1.0,  S3  =  0.7,  S4  =  10"®  and  ci  =  —  '',ci  = 

_?2iip>l,C3  =  and  n  =  sise 

of  region  R  =  r(R). 


4.1.3  Combining  features  for  relative  saliency; 

Once  again,  the  chosen  features  are  weighted  appro¬ 
priately  to  determine  relative  saliency.  The  color  con¬ 
trast  is  weighted  linearly  by  a  function  /4(c(R))  =  c(R), 
to  emphasise  regions  showing  greater  contrast.  The 
relative  sise  is  exponentially  weighted  by  a  function 
fs{i(R))  =  1  -  to  favor  situations  in  which  a  re¬ 

gion  and  its  best  neighbor  have  approximately  the  same 
sise.® 


4.1.4  Finding  self  and  relative  saliency 

Once  the  various  features  determining  self  and  rela¬ 
tive  saliency  are  appropriately  weighted,  they  reinforce 
each  other  so  that  the  self  and  relative  saliencies  can  be 
given  by  simple  additive  combinations  of  their  individ¬ 
ual  features.  The  self-saliency  of  a  region  R  denoted  by 
SS(R)  is  given  as  /i(s(R))  +  /j(v(R))  +  /s(r(R)).  Simi¬ 
larly,  the  relative  saliency  of  the  region  R,  RS(R)  is  given 
by  /4(c(R))  -f-  /5(t(R)).  Finally,  the  overall-saliency  of 
a  region  R  is  expressed  by  a  linear  combination  of  self 
and  relative  saliency  as  SS(R)  +  RS(R),  using  the  fol¬ 
lowing  rationale.  Any  combination  method  should  be 
flexible  enough  to  allow  a  region  to  be  declared  salient  if 
it  shows  good  contrast  (i.e.  high  relative  saliency)  even 
though  it  may  not  be  interesting  on  its  own.  Conversely, 
a  region  that  is  interesting  on  its  own  but  fails  to  become 
interesting  in  the  presence  of  neighboring  regions  should 
not  b  ^  chosen.  On  the  basis  of  these  observations  alone, 
nonlinear  combining  methods  such  as  (55(R)  *  RS(R)) 

'Such  a  Amction  along  with  the  thresholds  and  rates  of 
change  w«  empirically  derived  from  informal  psychophysical 
experiments  (whose  details  will  be  skipped  here)  performed 
using  color  regioiu  of  various  sises. 

®Once  agidn  this  function  was  obtained  by  performing  in¬ 
formal  psychophysical  experiments. 


or  max(55(R),  R5(R))  are  not  suitable.  If  a  region  is 
both  interesting  on  its  own  as  well  as  in  the  presence  of 
other  regions  in  the  scene,  then  it  must  be  given  more 
importance.  All  three  criteria  are  satisfied  when  the  two 
saliency  components  are  linearly  combined.  The  color 
saliency  of  a  region  R  is  therefore  given  by 

Color-saliency(R)  =  /i(s(R))  /2(v(R))  +  faMR))  4- 

Mc(R))  Mi(R))  (9) 

The  saliency  measure  described  above  does  not  take 
into  account  perceptual  effects  such  as  simultaneous 
color  contrast,  color-filling,  etc.  Because  such  effects  do 
not  greatly  undermine  a  region  that  is  already  very  out¬ 
standing  (very  salient),  and  because  saliency  is  being 
used  to  rank  the  regions,  we  have  ignored  these  effects. 

The  color  regions  in  the  image  can  now  be  ordered 
using  the  saliency  measure  and  a  few  most  significant 
regions  can  be  retained  for  selection  (called  salient  re¬ 
gion,  henceforth).  The  number  of  salient  regions  to  be 
retained  can  be  determined  when  the  selection  mecha¬ 
nism  is  integrated  with  a  recognition  system  to  perform 
a  specific  task,  and  is  therefore,  left  unspecified  here. 

4.1.5  Results 

We  now  illustrate  the  ranking  of  regions  produced  by 
the  color  saliency  measure  derived  above.  Figure  Ic-lf 
show  the  four  most  distinctive  regions  found  by  apply¬ 
ing  the  color-saliency  measure  to  all  the  color  regions 
extracted  from  the  scene  shown  in  Figure  la.  Figures 
3d-3f,  4c-4f,  show  the  few  most  salient  regions  found  in 
their  respective  scenes.  In  the  experiments  done  so  far, 
the  color-saliency  measure  was  found  to  select  fairly  large 
bright-colored  regions  that  showed  good  contrast  with 
their  neighbors,  and  appeared  perceptually  significant. 

4.2  Use  of  Salient  Color-based  Selection 
in  Recognition 

Data-driven  selection  based  on  salient  color  regions 
is  primarily  useful  when  the  object  of  interest  has  at 
least  one  of  its  regions  appearing  salient  in  the  given 
scene.  In  such  cases,  the  search  for  data  features  that 
match  model  features  can  be  restricted  to  the  salient 
regions,  thus  avoiding  needless  search  in  other  areas  of 
the  image.  By  selecting  salient  color  regions,  we  get 
a  small  number  of  groups  (a  region  is  itself  a  group), 
containing  several  features.  It  was  shown  in  [22]  that 
such  large-sized  groups  are  useful  for  indexing,  i.e.  to 
determine  which  regions  from  models  in  a  library  could 
correspond  to  a  given  group.  But  when  the  task  is  to 
recognize  a  single  object,  it  is  desirable  to  have  small¬ 
sized  groups.  For  this,  existing  grouping  techniques  can 
be  applied  to  the  data  features  found  within  the  color 
regions  to  obtain  reliable  small-sized  groups. 

We  now  estimate  the  search  reduction  that  can  be 
achieved  with  such  a  selection  mechanism.  Let  (M,N) 

=  total  number  of  fr^tures  (such  as  edges,  lines,  etc.) 
in  the  model  and  image  respectively.  Let  (Afn,JV/i)  = 
total  number  of  color  regions  in  the  model  and  image  re¬ 
spectively.  Let  Ns  =  number  of  salient  regions  that  are 
retained  in  an  image.  Let  g  =  average  size  of  a  group  of 
data  features,  within  a  model  or  image.  Let  (Gj*f,G.v  ' 
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=  number  of  groups  formed  (using  any  existing  grouping 
scheme)  in  the  model  and  image  respectively.  Finally, 
let  Gm  be  the  number  of  groups  in  the  salient  image 
region  i.  Using  the  alignment  method  of  recognition  [6], 
at  least  three  corresponding  data  features  are  needed  to 
solve  for  the  pose  (appearance)  of  the  model  in  the  im¬ 
age.  If  no  selection  of  the  data  features  is  done,  then  the 
brute-force  search  required  to  try  all  possible  triples  is 
0(M*1V*).  If  selection  is  done  by  only  grouping  meth¬ 
ods  (i.e.,  without  color  region  selection),  then  the  num¬ 
ber  of  matches  that  need  to  be  tried  is  0{GjifGffg^g^) 
since  only  triples  within  groups  need  to  be  tried.  But 
as  we  mentioned  before,  grouping  methods  often  make 
mistakes,  so  that  not  all  groups  contain  features  belong¬ 
ing  to  a  single  object.  In  at  least  one  such  study  [l]  out 
of  the  150  or  so  groups  isolated,  about  83  groups  actu¬ 
ally  came  iiom  single  objects.  Most  of  the  remaining  67 
groups  would  not  yield  any  consistent  match  and  would 
represent  fruitless  search.  Consider  the  case  when  group¬ 
ing  of  data  features  is  done  within  all  the  color  regions. 
With  this,  the  grouping  is  more  reliable,  and  also,  the 
number  of  groups  is  smaller  (as  groups  straddling  re¬ 
gions  are  not  considered),  so  that  the  overall  effect  is 
to  reduce  the  search.  For  example,  with  M  =  200,  N  = 
3000,  g  =  7,  and  Gu  =  30,  Gn  =  430  (these  numbers  are 
typical  of  indoor  scenes),  the  search  reduction  assuming 
70%  reliability  in  simple  grouping  to  >  95%  reliability  in 
grouping  within  color  regions  is  %  0.25  *  10*  which  is  a 
considerable  improvement.  Consider  next,  when  group¬ 
ing  is  restricted  to  salient  color  regions.  The  number  of 
matches  further  reduces  to  0(53j^i  Gf/jGug^g^),  si..ce 
only  the  groups  in  the  salient  regions  need  be  tried. 

To  get  an  estimate  of  the  number  of  matches  and 
time  taken  for  matching  in  reel  scene  when  color-based 
selection  is  used,  we  recorded  the  number  of  regions 
(obtained  by  applying  the  segmentation  algorithm  of 
Section3),  and  the  number  of  data  features  within  re¬ 
gions  in  some  selected  models  and  scenes  (Figure  1  and 
2  show  typical  examples  of  models  and  scenes  tried). 
The  regions  were  ordered  using  the  color  saliency  mea¬ 
sure  and  the  four  most  salient  regions  were  retained. 
Then  search  estimates  were  obtained  using  the  above 
formulas,  and  assuming  a  grouping  scheme  that  gives 
a  number  of  groups  within  regions  that  is  bounded  by 
the  number  of  features  in  a  region  -ri.-  ■  j 

average  size  olthe  groups  in  a  region  ^ 

bound  on  the  number  of  groups  produced  using  sim¬ 
ple  grouping  schemes  such  as  grouping  ’g’  closely-spaced 
parallel  lines  in  the  region.  The  result  of  such  studies 
is  shown  in  Table  I.  As  can  be  seen  from  this  table,  the 
number  of  matches  is  always  smaller  when  salient  color 
regions  are  used  for  selection.  But  the  ultimate  utility 
of  such  a  selection  mechanism  can  be  accurately  guaged 
only  after  it  is  integrated  with  a  recognition  system.  Cur¬ 
rent  research  is  being  directed  towards  this  effort. 

5.  COLOR-BASED  MODEL-DRIVEN 
SELECTION 

The  previous  section  described  a  data-driven  selection 
mechanism  that  was  meant  for  an  object  of  interest  hav¬ 
ing  some  salient  color  regions.  This  will  not  be  of  much 
help  when  the  object  of  interest  is  not  salient  in  color 


(but  salient  in  some  other  domain,  say  texture)  or  is  not 
salient  at  all.  In  such  cases,  the  color  description  of  the 
model  can  be  used  to  perform  selection.  We  now  describe 
one  such  color-based  model-driven  selection  mechanism. 
Here,  given  a  color-based  description  of  a  model  object, 
the  task  is  to  search  and  locate  color  regions  that  satisfy 
this  description.  The  use  of  model  information  to  con¬ 
strain  the  matching  of  model  features  to  image  features 
is  not  new.  Several  model-driven  search  restriction  tech¬ 
niques  such  as  generalized  Hough  transforms,  heuristic 
termination,  focal  features,  etc.  have  evolved  [4].  The 
emphasis  in  these  methods  was  on  geometric  constraints 
that  can  prune  the  search  space  during  the  matching 
stage  of  recognition.  The  approach  we  present  here,  on 
the  other  hand,  emphasizes  some  global  relational  infor¬ 
mation  about  model  color  regions  to  prune  the  search 
space  prior  to  matching.  It  also  provides  possible  cor¬ 
respondences  between  model  and  image  regions.  Such 
a  correspondence  can  further  reduce  the  complexity  of 
recognition  because  the  search  for  pairing  model  features 
to  data  features  can  be  restricted  now  to  these  corre¬ 
sponding  regions  rather  than  all  image  regions.  Color 
information  in  the  model  object  has  been  used  before  to 
search  for  instances  of  the  object  in  the  given  image  of 
a  scene  [l9],[l8].  These  approaches  represent  model  and 
image  color  information  by  color  histograms  and  per¬ 
form  a  match  of  the  histograms.  Such  approaches  usu¬ 
ally  cause  a  lot  of  false  positive  identifications,  and  do 
not  explicitly  address  some  of  the  problems  that  arise 
in  going  from  a  model  object  to  its  instance  in  a  scene. 
Also,  since  they  do  not  supply  correspondence  between 
model  and  image  regions,  they  are  not  as  useful  for  re¬ 
ducing  the  search  in  recognition. 

In  order  for  any  scheme  for  model-driven  selection  to 
be  effective  for  reducing  the  search  in  recognition,  it  must 
meet  two  requirements;  (i)  it  must  be  sufficiently  selec¬ 
tive  to  avoid  lot  of  false  positive  identifications  that  cause 
needless  search  for  matches,  (ii)  it  must  be  sufficiently 
conservative  to  avoid  lot  of  false  negatives,  causing  recog¬ 
nition  to  fail  when  it  should  have  succeeded.  A  selection 
scheme  can  make  false  negatives  if  it  does  not  adequately 
take  into  account  the  various  problems  that  arise  in  go¬ 
ing  from  a  model  object  to  its  image  in  the  scene.  An 
object  may  not  appear  the  same  in  the  scene  as  it  was  in 
the  model,  because  it  has  undergone  pose  changes,  or  be¬ 
cause  it  is  occluded,  or  its  colors  appear  different  in  the 
current  illumination  conditions.  In  addition,  artifacts 
such  as  specularities,  inter-reflections,  shadows  may  also 
cause  changes  in  the  appearance  of  the  object.  So  how 
can  a  model-driven  selection  mechanism  meet  these  two 
apparently  conflicting  requirements?  We  now  describe 
an  approach  to  model-driven  selection  that  meets  some 
of  these  requirements.  It  makes  a  particular  choice  of 
model  description  and  assumes  that  this  is  made  avail¬ 
able  to  it  for  selection.  Since  this  model  description  af¬ 
fects  the  way  our  approach  formulates  the  color-based 
model-driven  selection  problem,  it  is  described  first. 

5.1  Model  Description 

The  color  region  information  in  the  model^  (in 

*The  model  description  specifies  a  color  view,  that  is,  a 
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an  image  oi  view  of  the  model,  that  is)  is  repre¬ 
sented  as  a  region  adjacency  graph  (RAG)  Mq  =< 
V„,E^,C„ 

I  t  t  Sfm  where  V„|  —  color  re¬ 

gions  in  the  model,  Em  =  adjacencies  between  color  re¬ 
gions,  Cm(u)  =  color  of  region  u  €  Vm.  Rm(u,v)  =  rel¬ 
ative  sise  of  region  ’v’  w.r.t  region  u.  Sm{u)  =  sise  of 
region  u,  and  Brm  =  &  bound  on  the  relative  sise  of  re¬ 
gions  given  by  Am,  and  B,m  =  a  bound  on  the  absolute 
sise  of  regions  given  by  5m- 

The  above  description  exploits  features  of  regions  that 
tend  to  remain  more  or  less  invariant  in  most  scenes 
where  the  model  appears.  If  the  color  of  a  model  region 
is  specified  by  its  color  category,  then  as  we  discussed 
before,  it  tends  to  remain  relatively  stable  (or  changes 
i?.  a  predictable  way)  under  variations  in  illumination 
conditions,  and  pose  changes.  Similarly,  the  adjacency 
information  between  two  color  regions  tends  to  remain 
more  or  less  invariant  in  the  different  appearances  of  the 
object,  as  long  as  the  two  regions  are  visible  in  the  given 
image  and  there  are  no  occlusions.  Fin-'illy,  the  relative 
sise  of  regions  is  preserved  under  changes  of  scale.  But 
it  can  undergo  considerable  changes  if  the  pose  of  the 
object  changes,  say  when  a  region  goes  partially  out  of 
view.  The  bound  on  the  relative  sise  changes  in  each 
pair  of  adjacent  region,  Bm  indicates  the  extent  of  pose 
changes  that  a  selection  mechanism  is  expected  to  toler¬ 
ate.  Relative  sise  changes  can  also  occur  due  to  occlu¬ 
sions.  By  placing  some  loose  bounds  on  the  absolute  sise 
changes  as  given  by  B,m,  the  model  description  restricts 
the  changes  that  can  be  tolerated  in  the  presence  of  oc¬ 
clusions.  For  sise  changes  in  a  region  that  go  beyond  the 
bounds,  that  region  will  be  considered  no  longer  recog¬ 
nisable,  and  then  the  selection  will  have  to  depend  on 
the  evidence  for  other  model  regions  in  the  image. 

This  description  is  not  very  impoverished  and  has 
some  structural  information  about  color  regions  that 
can  be  used  to  restrict  the  number  of  false  positives, 
and  some  constraints  on  the  relative  and  absolute  size 
changes  that  can  be  used  to  restrict  the  number  of  false 
negatives  made  by  the  selection  mechanism. 

Finally,  the  model  description  gives  a  way  to  anal¬ 
ogously  organise  the  color  region  information  in  the 
image  as  an  image  region  adjacency  graph  as  la  =< 
Vi,Ei,Ci,Ri,Si  >,  where  each  term  has  a  meaning 
analogous  to  <  Vn,  Em,  Rm,  Sm  >  respectively. 

5.2  Formulation  of  the  Color-based 
Model-driven  Selection  Problem 

In  this  section  we  will  formulate  the  color-based 
model-driven  selection  problem  as  a  type  of  subgraph 
matching  problem.  Given  the  image  region  adjacency 
graph,  the  model  object  if  present  in  the  scene  repre¬ 
sented  in  the  image  will  form  a  subgraph  in  la-  The  loca¬ 
tion  strategy  can  be  regarded  as  the  problem  of  searching 
for  suitable  subgraphs  that  satisfy  the  model  description. 
Any  such  subgraph  /,  =<  Vg,  Eg,Cf,  Rf,Sf  >  such 

range  of  2D  views  of  the  model  in  which  one  or  more  of 
the  color  regions  described  in  the  model  are  visible.  If  the 
model  has  some  views  showing  an  entirely  different  set  of 
color  regions,  then  they  must  be  specified  as  separate  color 
views. 


that  ||Vj||  <  IIKnII.  It-£^sll  <  ll-fi'mll,  has  associated  with 
it  a  node  correspondence  vector  T  =  {(u.„,  tt,)]u„,  € 
Im.Vu,  G  V,  U  {±},  {±}  is  a  null  match}.  Although 
there  are  an  exponential  number  of  such  subgraphs,  not 
all  of  them  correspond  to  model  RAG.  From  the  model 
description  a  set  of  unary  and  binary  constraints  could 
be  derived  (as  is  described  later)  that  make  only  some 
subgraphs  feasible.  A  feasible  subgraph  is,  therefore,  a 
subgraph  that  has  all  its  nodes  satisfying  unary  and  bi¬ 
nary  constraints.  For  model-driven  selection,  since  it  is 
desirable  to  have  at  most  one  image  subgraph  match¬ 
ing  the  model  RAG,  we  can  select  from  among  these 
subgraphs,  a  subgraph(s)  that  in  some  sense,  best  satis¬ 
fies  the  model  description.  Here  we  formulate  the  color- 
based  model-driven  selection  as  the  problem  of  choosing 
a  feasible  subgraph(s),  Ig  that  minimizes  the  following 
measure: 


SCORE(/,)  = 


^  IIKnir 


where  Rr„g(u„,  v„,  Ug,  Vg)  with  T(tt„)  =  u,,  T(Vn)  = 
Vg  expresses  the  change  in  the  relative  size  when 
adjacent  model  regions  (Um.Vm)  are  paired  to  cor¬ 
responding  image  regions  (ug,  Vg)  and  is  given 

by  Rmg{^,Vm,^g,Vg)  =  ’ 

SCORE(/,)  emphasizes  rewards  for  making  as  many  cor¬ 
respondences  as  possible  as  indicated  by  the  first  term, 
called  Match(Jf),  and  penalties  for  a  mismatch  of  the 
relative  size,  as  indicated  by  the  second  term,  called 
Deviation(f,),  which  measures  the  mean  square  devia¬ 
tion  of  the  relative  sizes.  Since  the  subgraphs  are  all 
fesisible,  the  deviation  accounts  for  occlusions  and  pose 
changes  in  a  more  refined  way  than  the  binary  con¬ 
straints  alone.  Another  advantage  of  this  measure  is 
that  it  can  be  incrementally  computed  from  individual 
region  matches,  so  that  a  branch-and-bound  search  for¬ 
mulation  can  be  used  to  reduce  considerably  the  search 
involved  in  finding  the  best  subgraph  (i.e.  the  one  with 
the  lowest  score).  Finally,  the  above  formulation  is  based 
on  the  hypothesis  that  at  least  one  of  the  regions  in  the 
isolated  subgraph  corresponds  to  a  model  region.  It  is 
also  designed  primarily  to  locate  single  instances  of  the 
model  object  in  the  image.  More  instances  can  be  found 
after  removing  the  regions  in  the  found  instance  from 
the  image  RAG. 


5.3  A  Color-based  Model-driven  Selection 
Mechanism 

A  color-based  model-driven  selection  mechanism  was 
built  using  the  above  formulation.  The  mechanism  es¬ 
sentially  uses  a  search  strategy  to  find  the  best  subgraph. 
The  result  of  selection  is  the  correspondence  vector  asso¬ 
ciated  with  the  best  subgraph.  The  search  strategy  used 
the  following  constraints  to  restrict  the  search  among 
feasible  subgraphs. 

1. Unary  constraints:  The  color  and  absolute  region  size 
information  provided  in  the  model  description  were  used 
to  develop  unary  constraints  on  these  features.  The 
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color  Cf{Uf)  of  an  image  region  u,  is  said  to  match 
the  color  Cni(un,)  on  a  model  region  if  these  col¬ 
ors  belong  to  same  category  or  compatible  categories 
(described  in  Section  2.4).  With  this  scheme,  brighter 
colors  (of  a  given  hue)  in  the  model  could  potentially 
match  to  darker  colors  of  the  same  overall  hue  in  the 
image,  thus  accounting  for  simple  lowering  in  iUumina- 
tion  levels.  The  bounds  on  the  absolute  sise  provided  by 
B,m  &ct  as  loose  sise  constraint  to  rule  out  some  clearly 
absurd  scale  changes  (such  as,  say,  a  100  fold  increase 
in  the  smallest  model  region  implying  a  blowup  of  the 
model  outside  the  image  bounds). 

2.  Binary  constraints:  The  adjacency  (as  well  as  non¬ 
adjacency)  and  relative  sise  information  provided  in  the 
model  were  used  as  binary  constraints  to  prune  some 
impossible  subgraphs.  Specifically,  the  lack  of  adjacency 
in  model  regions  is  a  powerful  constraint,  because  two 
adjacent  regions  in  the  image  cannot  correspond  to  two 
regions  that  are  not  adjacent  in  the  given  color  descrip¬ 
tion  (assuming  a  rigid  model).  Two  adjacent  regions  in 
the  model  may,  however,  not  appear  adjacent  in  a  given 
image  due  to  occlusion.  A  simple  analysis  of  occlusions 
could  rule  out  several  false  matches  in  such  cases  (such 
as,  say,  discarding  a  match  if  the  area  spanned  by  the 
occlusion  within  a  rectangle  enclosing  the  candidate  non- 
adjacent  image  regions  far  exceeds  the  combined  sise  of 
the  corresponding  adjacent  model  regions).  The  bound 
on  the  relative  sises  served  as  another  binary  constraint. 
The  bound  Bm  was  used  to  constrain  possible  matches 
by  requiring  isim*(umi »«!• «».  v,)  <  Bm{um,Vn). 

Z.  Searching  for  the  best  subgraph 

The  search  for  the  best  subgraph  (i.e.  the  subgraph 
that  minimise  the  value  of  SCORE),  can  in  principle,  be 
done  by  an  exhaustive  enumeration  of  subgraphs.  But 
with  the  algorithm  described  below.  The  search  required 
is  reduced  to  a  large  extent.  The  algorithm  used  is  es¬ 
sentially  a  variation  of  the  branch  and  bound  interpre¬ 
tation  tree  (IT)  search  [4],  with  the  major  difference  be¬ 
ing  that  no  verification  is  done  when  the  search  reaches 
a  leaf  node  (as  the  task  is  selection  and  not  recogni¬ 
tion).  Each  level  of  the  search  tree  represents  a  possible 
match  for  a  model  region  (this  includes  a  null  match), 
so  that  the  depth  of  the  search  tree  is  fixed  by  the  num¬ 
ber  of  nodes  in  the  model  RAG.  The  unary  constraints 
are  checked  a  priori  to  prune  the  breadth  of  the  search 
tree.  A  subgraph  in  the  image  RAG  that  is  potential 
match  for  the  model  RAG  is  represented  by  a  path  in 
the  IT.  The  value  of  SCORE  is  updated  at  each  node 

as  SCOREj.|.t  =  SCOREj  —  "jj-  +  •  By  keeping 

the  lowest  value  of  SCORE  so  far,  search  can  be  cut  off 
below  any  node  with  a  Deviation(/,)  value  greater  than 
the  lowest  SCORE  value.  In  practice,  the  unary  and  bi¬ 
nary  constraints  prune  the  search  tree  considerably  so 
that  the  average  number  of  full  paths  (up  to  the  leaves) 
explored  are  few  («»  50).  Finally,  after  an  instance  of  the 
model  region  has  been  found  in  the  image,  the  selected 
area  is  removed  and  the  search  repeated  on  the  result¬ 
ing  image  RAG  to  look  for  more  instances  of  the  model 
object. 


5.4  Results 

The  result  of  using  color-based  model-driven  selec¬ 
tion  are  illustrated  in  Figure  2.  Figure  2a  shows  a 
model  object,  and  its  color  description  obtained  by  using 
the  color-region  segmentation  algorithm  of  Section  3  is 
shown  in  Figure  2b.  Here  the  background  was  removed 
by  a  simple  threshold  on  intensities.  This  description  is 
used  to  create  a  model  RAG.  Figure  2c  shows  a  scene  in 
which  the  model  object  occurs.  The  scene  shown  has  sev¬ 
eral  other  objects  with  one  or  more  of  the  model  colors. 
Also,  the  model  appears  in  a  different  pose  here,  being 
rotated  to  the  left  about  the  vertical  axis.  Figure  2d 
shows  the  result  of  applying  the  unary  color  constraints. 
The  big  blue  glass  matches  the  small  blue  flowers  based 
on  color  alone.  Next,  the  unary  constraint  on  absurd 
sise  changes  are  used  to  prune  the  possibilities  and  the 
result  is  shown  in  Figure  2e.  Finally,  the  subgraph  with 
the  lowest  value  of  SCORE  is  shown  in  Figure  2f.  As  can 
be  seen  from  this  figure,  a  region  containing  most  of  the 
model  object  has  been  identified  even  though,  the  color 
image  segmentation  was  not  perfect  (notice  the  a  small 
streak  above  the  white  rim  of  the  cup  that  merges  with 
the  book  in  the  background). 

5.5  Search  Reduction  using  Color-based 
Model-driven  Selection 

The  color-based  model-driven  selection  mechanism 
provides  a  correspondence  of  model  region  to  some  image 
regions.  The  matching  of  model  features  to  image  fea¬ 
tures  can  be  restricted  to  within  corresponding  regions, 
and  this  reduces  the  number  of  matches  that  need  to  be 
tried  for  recognition.  To  reduce  the  search  further,  con¬ 
ventional  grouping  can  be  performed  within  the  selected 
color  regions,  as  described  in  Section  4.2.  To  estimate 
the  search  reduction  in  this  case,  we  continue  with  the 
analysis  done  in  that  section.  Let  Nt  be  the  number 
of  solution  subgraphs  given  by  the  selection  mechanism, 
and  let  Ik  represent  one  such  subgraph  with  the  num¬ 
ber  of  nodes  =  Nk-  L«t  (G«j,G,J  =  the  number  of 
groups  in  region  Uj  of  the  solution  subgraph  Ik,  and  re¬ 
gion  V;  of  the  m<^el  RAG  that  corresponds  to  Uj  as 
implied  by  the  correspondence  vector  T  associated  with 
Ik-  Then  assuming,  as  before,  the  average  sise  of  the 
group  =  g,  the  number  of  matches  that  need  to  be  tried 
are  G,^G,,.p*.p*).  To  compare  this  kind 

of  selection  with  pure  grouping  we  can  take  some  typ¬ 
ical  values  of  these  numbers.  Letting  M  =  200,  N  = 
3000,  g  =  7,  Gut  =  30,  Gff  —  430,  G,^.  =  8,  G®,  =  5. 
Ni  =.  Z,  Nk  =  5,  we  have  the  number  of  matches  with 
grouping  alone  to  be  0{GihGn9^9^)  ^  1-56  •  10*,  and 
using  model-driven  color- based  selection  with  grouping, 
the  number  of  matches  become  %  1.25  *  10*.  Assuming 
1  microsecond  as  time  per  match  this  corresponds  to  re¬ 
duction  in  match  time  from  26  minutes  to  %  2  minutes. 
By  trying  several  modeb  and  images  of  scenes  where 
they  occured,  we  recorded  the  average  number  of  sub¬ 
graphs  generated  by  the  model-driven  selection  mech¬ 
anism.  The  search  estimates  were  obtained  using  the 
above  formula  for  model-driven  selection  with  grouping, 
and  the  formulas  for  other  methods  mentioned  in  Section 
4.2.  The  results  are  shown  in  Table  II.  The  bound  on 
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the  number  of  groups  in  a  region  was  the  same  as  used  in 
Section  4.2.  As  can  be  seen  from  the  table,  the  number  of 
matches  using  correspondence  between  model  and  image 
color  regions  is  always  lower.  A  curious  feature  to  note 
from  the  table  is  that  it  takes  less  number  of  matches 
(and  hence  lesser  time)  for  a  more  complex  model  (en¬ 
try  1  in  Table  11)  containing  several  color  regions,  than 
for  a  simple  object  with  fewer  regions  (entry  2  in  Table 
II).  This  is  understandable  since,  with  a  large  number  of 
regions,  the  constraints  are  stronger  and  hence  the  false 
matches  are  fewer. 

Discussion:  The  above  studies  estimated  the  search  re¬ 
duction  without  actually  integrating  the  selection  mech¬ 
anism  with  a  recognition  system.  Moreover  ,  the  esti¬ 
mated  search  was  based  on  the  assumption  that  there 
were  no  false  negatives  given  by  the  selection  mecha¬ 
nism.  This  can  happen  since  a  subgraph  with  the  low¬ 
est  value  of  SCORE  may  not  always  indicate  a  match 
to  the  model.  To  estimate  the  number  of  false  posi¬ 
tives,  the  number  of  false  negatives,  and  the  reduction 
in  search  that  results  due  to  this  color-based  selection 
mechanism,  we  have  recently  developed  a  3D  from  2D 
recognition  system  and  are  currently  testing  it.  Pre¬ 
liminary  results  on  using  the  selection  mechanism  as  a 
front-end  for  recognition  have  so  far  been  encouraging. 

6.  SUMMARY 

In  this  paper  we  have  shown  how  color  can  be  used 
as  a  cue  to  perform  both  data  and  model-driven  selec¬ 
tion.  Unlike  other  approaches  to  color,  we  have  used  the 
intended  task  to  constrain  the  kind  of  color  information 
to  be  extracted  from  images.  This  led  to  a  fast  color 
image  segmentation  algorithm  based  on  perceptual  cat¬ 
egorisation  of  colors  to  given  perceptually  different  color 
regions.  This  color  description  of  the  image  formed  the 
basis  of  data  and  model-driven  selection.  A  saliency 
measure  was  then  developed  to  rank  the  color  regions 
to  perform  data-driven  selection.  Lastly,  an  approach 
to  model-driven  selection  was  presented  that  exploited 
description  of  model  color  regions  to  locate  instances  of 
model  in  the  image.  Finally,  we  regard  color  as  one  of  the 
many  cues  that  can  be  used  for  selection.  Future  research 
is  directed  towards  using  other  cues  such  as  texture  to 
perform  data  and  model-driven  selection. 
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Figure  3:  Illuatration  of  color  region  segmentation  and 
color-saliency.  (a)  Input  image  consisting  of  regions  of 
3  different  colors:  red,  green  and  blue  against  an  almost 
white  background,  (b)  Result  of  Step  2  of  algorithm  with 
regions  colored  differently  from  the  original  image,  (c) 
Final  segmentation  of  the  image  of  Fig. 3a.  (d)  —  (f) 
The  three  most  distinctive  regions  found  using  the  color 
saliency  measure. 


Figure  4;  Illustration  of  color  region  segmentation  and 
color-saliency  —  Last  example,  (a)  Input  image  depict¬ 
ing  a  scene  of  different  kinds  of  objects  (cloths  and  pol¬ 
ished  book),  (b)  The  color  regions  extracted  from  (a)  us¬ 
ing  the  color  region  segmentation  algorithm,  (c)-(f)  The 
four  most  distinctive  regions  detected  using  the  color- 
saliency  measure. 
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Table  I:  Search  reduction  using  color-based  data-driven  selection.  The  letst  column 
shows  the  match  time  when  color-based  data-driven  selection  is  combined 
with  grouping.  The  color-based  selection  is  done  by  choosing  the  four  most 
salient  regions.  Here  g  =  7,  Time  per  match  =  1  microsecond,  and  the 
grouping  method  is  as  described  in  text. 
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Table  II:  Search  reduction  using  color-based  model-driven  selection.  The  last  column 
shows  the  match  time  when  model-color-based  selection  is  combined  with 
grouping.  Here  g  =  7,  Time  per  match  =  1  microsecond,  and  the  grouping 
method  is  as  described  in  text. 
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Abstract 

We  show  that  the  set  of  2D  images  produced 
by  the  point  features  of  a  rigid  3D  model  can 
be  represented  with  two  lines  in  two  high¬ 
dimensional  spaces.  These  lines  are  the  lowest- 
dimensional  representation  possible.  We  use 
this  result  to  build  a  system  for  representing 
in  a  hash  table  at  compile  time,  all  the  images 
that  groups  of  model  features  can  produce. 

Then  at  run  time  a  group  of  image  features  can 
access  the  table  and  find  all  model  groups  that 
could  match  it.  This  table  is  efficient  in  terms 
of  space,  and  is  built  and  accessed  through  an¬ 
alytic  methods  that  account  for  the  effect  of 
sensing  error.  In  real  images,  it  reduces  the 
set  of  potential  matches  by  a  factor  of  sev¬ 
eral  thousand.  We  also  use  this  representation 
of  a  model’s  images  to  analyze  two  other  ap¬ 
proaches  to  recognition:  invariants,  and  non¬ 
accidental  properties.  These  are  properties  of 
images  that  some  modeb  always  produce,  and 
all  other  models  either  never  produce  (invari¬ 
ants)  or  almost  never  produce  (non-accidental 
properties).  In  several  domains  we  determine 
when  invariants  exist.  In  general  we  show  that 
there  are  an  infinite  set  of  non-accidental  prop¬ 
erties  that  are  qualitatively  similar.^ 

1  Introduction 

Object  recognition  systems  typically  search  for  matches 
between  image  features  and  model  features  that  are  con¬ 
sistent  with  some  transformation  of  the  model  into  the 
image.  This  search  can  require  a  great  deal  of  compu¬ 
tation,  particularly  in  challenging  domains  such  as  the 
recognition  of  3D  objects  from  cluttered  2D  images,  and 
the  identification  of  an  object  from  a  large  data  base  of 
possible  objects.  One  approach  to  handling  this  com¬ 
plexity  b  to  decompose  the  recognition  taisk  so  that  as 
much  of  the  work  as  possible  is  done  on  the  models  alone, 

'Support  for  this  research  was  provided  in  part  by  the 
University  Research  Initiative  under  Office  of  Naval  Research 
contract  N00014-86-K-0685,  and  in  part  by  the  Advanced 
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at  compile  time,  and  on  the  image  alone,  in  a  bottom  up 
process.  The  results  of  these  two  separate  computations 
are  then  combined  in  a  simple  comparison. 

Such  an  approach  to  recognition  can  take  the  form 
of  indexing,  combined  with  grouping.  By  indexing,  we 
mean  that  the  system  places  pointers  to  groups  of  model 
features  in  a  hash  table,  at  compile  time.  At  run  time, 
a  group  of  image  features  accesses  the  hash  table  to  find 
all  the  model  groups  that  might  mcitch  it.  If  we  want 
efficient  run-time  processing,  image  groups  will  each  ac¬ 
cess  the  table  in  only  a  single  place.  That  means  that 
the  table  must,  in  some  form,  represent  the  entire  set  of 
images  that  each  model  group  could  produce.  This  pa¬ 
per  determines  the  most  space-efficient  possible  method 
for  representing  a  3D  model’s  point  features  in  such  a 
hash  table.  It  also  presents  a  method  for  analytically 
determining  which  entries  to  make  in  such  a  table. 

Indexing  may  be  combined  with  grouping.  Grouping 
selects  a  relatively  small  set  of  groups  of  model  features 
to  enter  in  the  hash  table.  It  then  selects  a  small  num¬ 
ber  of  groups  of  image  features  in  a  model-independent, 
bottom  up  process.  This  avoids  the  combinatoric  ex¬ 
plosion  that  occurs  when  all  possible  image  groups  and 
model  groups  are  considered.  With  effective  grouping, 
large  groups  become  desirable  input  to  the  indexing  sys¬ 
tem,  because  they  provide  the  greatest  discriminatory 
power.  That  is,  fewer  model  groups  will  match  a  large 
image  group  than  a  small  image  group.  See  [4]  for  fur¬ 
ther  discussion  of  this  point.  In  this  paper  we  focus  on 
the  problem  of  building  an  indexing  system  appropriate 
for  use  with  grouping. 

The  central  problem  of  indexing  is  to  determine  the 
most  economical  possible  representation  for  the  set  of 
image  groups  that  each  model  group  might  produce.  It 
is  trivial  to  do  indexing  by  making  a  different  entry  in  the 
hash  table  for  every  different  image  group  a  model  group 
can  produce,  but  this  would  require  excessive  space.  So 
instead  we  seek  a  representation  of  images  such  that  min¬ 
imal  space  is  required  to  describe  all  these  images. 

In  2D  domains  this  has  been  done  with  invariant  de¬ 
scriptions,  that  is,  with  descriptions  of  images  that  have 
the  property  that  every  image  of  a  given  model  group 
produces  the  same  description.  Invariants  have  been 
found  that  can  capture  all  the  information  in  the  im¬ 
age  group  that  can  be  used  to  match  it  to  model  groups, 
while  allowing  us  to  represent  each  model  group  with 
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a  single  entry  in  the  indexing  hash  table.  However,  it 
has  been  shown  that  such  invariants  do  not  exist  for 
the  description  of  3D  models[3],  [4],  [16].  Furthermore, 
[4]  shows  that  to  perform  indexing  of  3D  models,  each 
model  must  be  represented  by  a  2D  surface  in  a  single 
index  space.  Unfortunately,  large  amounts  of  space  are 
required  to  accurately  represent  such  a  2D  surface  dis¬ 
cretely.  In  this  paper,  however,  we  show  that  the  images 
of  a  model  group  can  be  represented  as  a  2D  surface 
that  can  be  canonically  decomposed  into  two  ID  sur¬ 
faces,  represented  in  two  orthogonal  index  spaces.  This 
result  makes  the  space  requirements  of  indexing  reason¬ 
able,  because  we  can  discretize  two  ID  surfaces  using 
much  less  space  than  would  be  required  to  discretize  one 
2D  surface. 

Our  results  also  provide  a  simple  conceptualization  of 
the  recognition  problem,  which  we  use  to  analyze  two 
other  approaches  to  recognition.  First,  when  we  con¬ 
strain  our  universe  of  objects  to  contain  only  some  collec¬ 
tions  of  3D  points,  the  question  resurfaces  as  to  whether 
invariant  functions  exist.  We  use  our  previously  stated 
result  to  answer  this  question  in  several  domains.  We 
also  consider  a  second,  related  approach  to  recognition. 
This  approach  uses  non-accidental  properties,  which  are 
defined  as  properties  of  an  image  that  some  models  pro¬ 
duce  from  all  viewpoints,  and  other  objects  only  produce 
from  almost  no  viewpoints.  A  small  set  of  non-accidental 
properties  have  been  used.  We  show  that  they  are  a  few 
instances  of  an  infinite  class  of  such  properties,  all  of 
which  are  qualitatively  similar. 

Another  problem  left  unanswered  in  previous  3D  in¬ 
dexing  systems  was  the  analytic  computation  of  the  en¬ 
tries  to  make  in  the  hash  table.  Previously,  this  has 
been  done  only  by  sampling  the  set  of  possible  view¬ 
points  of  each  model.  In  this  paper  we  present  an  ana¬ 
lytic  method  of  building  the  indexing  table,  which  we 
combine  with  an  analytic  method  of  accounting  for  the 
effects  of  sensing  error.  These  results  have  allowed  us 
to  build  a  practical  indexing  system,  which  we  demon¬ 
strate  using  real  images.  The  resulting  indexing  system 
produces  speedups  of  up  to  a  factor  of  several  thousand 
over  brute  force  search. 

2  Describing  the  Images  a  Model 
Produces 

This  section  describes  a  compact,  analytically  deter¬ 
mined  representation  for  the  set  of  all  images  that  a 
general  3D  model  may  produce.  Models  are  assumed 
to  consist  of  any  arbitrary  collection  of  ordered  3D  point 
features.  We  ignore  sensing  error  when  describing  the 
model  because  we  account  for  error  with  regard  to  a 
particular  image  of  the  model,  at  lookup  time.  We  are 
then  able  to  describe  all  images  of  a  model  with  two 
straight  lines  located  in  two  orthogonal  spaces.  These 
lines  can  easily  be  derived  analytically  from  the  model. 
This  representation  is  optimal  in  the  dimensionality  of 
the  space  required,  and  in  introducing  no  false  positive 
or  false  negative  information. 

In  describing  the  images  that  a  model  produces  we  use 
the  following  novel  model  of  projection.  First,  we  assume 


that  the  object  is  imaged  from  an  arbitrary  viewpoint 
using  orthographic  projection  with  scale.  Orthographic 
projection  with  scale  is  a  common  approximation  to  per¬ 
spective  projection.  Next,  we  allow  an  arbitrary  affine 
transform  to  be  applied  to  the  resulting  image.  Applying 
an  arbitrary  affine  transform  to  an  image  is  equivalent  to 
viewing  that  image  from  an  arbitrary  position,  assuming 
that  this  projection  also  is  a  scaled  orthographic  projec¬ 
tion.  Therefore,  our  projection  model  encompasses  all 
images  that  a  model  might  produce,  as  well  as  all  im¬ 
ages  that  a  photograph  of  the  model  might  produce. 

There  are  several  reasons  for  using  this  projection 
model.  As  we  will  see,  it  is  mathematically  convenient. 
But  in  addition,  it  allows  us  to  build  an  indexing  system 
that  can  recognize  photographs  of  objects.  This  also 
suggests  the  hypothesis  that  human  ability  to  interpret 
photographs  is  an  epiphenomenon  of  the  fact  that,  for 
computational  reasons,  our  visual  system  does  not  make 
use  of  features  of  an  image  that  vary  under  affine  trans¬ 
forms  (see  [6]  for  discussion  of  a  related  hypothesis). 
Finally,  it  may  be  easily  shown  that  this  model  of  pro¬ 
jection  is  equivalent  to  multiplying  3D  model  points  by 
an  arbitrary  two  by  three  matrix,  and  then  adding  an 
arbitrary  translation  vector.  This  seems  to  be  the  sim¬ 
plest  linear  projection  model  from  3D  to  2D  (see  [19]  for 
further  discussion  of  this  projection  model). 

We  now  show  that  under  this  model  of  projection,  the 
set  of  images  produced  by  any  model  is  described  by 
the  cross  product  of  two  lines  in  two  orthogonal  spaces. 
To  do  this,  we  represent  images  as  follows.  As  an  image 
consists  of  2D  point  features,  we  use  the  first  three  points 
to  define  an  affine  basis.  That  is,  if  we  denote  the  image 
points:  (pi,P2 . Pn).  let: 

o  =  Pi  U  =  P2  -  PI  V  =  P3  -  Pi 

Then  we  may  fully  describe  the  locations  of  the  remain¬ 
ing  points  using  affine  coordinates  derived  with  respect 
to  this  basis.  For  example,  we  describe  p4  with  the  pa¬ 
rameters  where: 

P4  =  o+  a4U  +  04-v 

Then  an  image  is  fully  described  by  the  parameters: 
(o,  u,  v,(a4,/?4),  ...(a„,/3„)).  It  is  important  to  what  fol¬ 
lows  that  the  affine  coordinates  of  a  point  are  left  un¬ 
changed  by  any  affine  transform  ([13]). 

Due  to  the  model  of  projection  we  use,  we  may  ignore 
the  first  three  of  these  parameters.  To  see  this,  we  note 
that,  except  in  degenerate  cases,  there  exists  an  affine 
transform  that  will  map  any  three  image  points  to  any 
other  three  image  points.  Therefore,  under  the  type  of 
projection  that  we  consider,  if  a  model  can  produce  the 
image,  (o,u,v,  (04, /?4),  ...(on, /?„)),  it  can  also  produce 
the  image  (o',  u',  v',  (04,  /?4),  0„))  for  any  choice  of 

(o',  u',  v'),  by  combining  the  affine  transform  that  maps 
(o,  u,  v)  to  (o',  u',  v')  with  the  affine  transform  that  was 
part  of  the  projection  that  produced  the  original  image. 
Therefore,  the  parameters  (o,  u,  v)  provide  no  informa¬ 
tion  about  whether  a  model  could  produce  an  image. 

The  remaining  image  parameters  form  what  we  will 
call  an  affine  space.  An  image  with  n  ordered  points  is 
mapped  into  a  point  in  a  2(n— 3)-dimensional  affine  space 
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by  finding  the  affine  coordinates  of  the  image  points, 
using  the  first  three  as  a  basis.  We  divide  the  afhne 
space  into  two  orthogonal  subspaces,  an  a~space,  and 
a  P-space.  The  a-space  is  the  set  of  a  coordinates  of 
the  imago’s  affine  coordinates,  and  the  /9-space  is  simi¬ 
larly  defined.  The  affine  space  is  then  equal  to  the  cross 
product  of  the  a-space  and  the  P-space,  and  each  image 
corresponds  to  a  point  in  each  of  these  two  spaces.  The 
previous  paragraph  states  that  the  images  that  a  model 
can  produce  are  fully  described  by  the  locus  of  points 
these  images  map  to  in  affine  space.  We  now  show,  that 
for  any  model,  these  images  map  to  the  cross  product  of 
lines  in  a-space  and  fi-space. 

First  we  must  note  that  our  model  of  projection  can 
be  decomposed  into  one  part,  which  captures  the  ef¬ 
fects  of  viewing  direction,  and  a  second  part,  which  is  an 
affine  transform  of  the  projected  image.  We  can  think 
of  orthographic  projection  as  projecting  the  model  along 
some  viewing  direction  into  a  plane.  Following  this,  or¬ 
thographic  projection  also  allows  rotation  in  the  plane, 
translation,  and  scale,  but  we  can  ignore  these  parts  of 
the  transformation  by  folding  them  into  the  affine  trans¬ 
formation  that  we  allow  on  the  image,  following  the  or¬ 
thographic  projection.  To  find  the  parts  of  affine  space 
corresponding  to  images  of  a  particular  model,  we  only 
need  to  consider  the  sets  of  affine  coordinates  that  a 
model  may  produce  in  an  image  as  the  viewing  direction 
varies.  The  subsequent  affine  transformation  leaves  the 
affine  coordinates  unchanged. 

We  now  assume  the  model  consists  of  at  least  five 
points.  Call  the  plane  determined  by  the  first  three 
model  points,  the  model  plane.  If  we  project  the  fourth 
model  point,  m^,  perpendicularly  into  the  model  plane, 
we  call  this  point  m^.  Since  is  in  the  plane  of  the 
first  three  model  points,  we  can  discuss  its  affine  coordi¬ 
nates  with  respect  to  these  three  model  points.  We  call 
these  affine  coordinates  (04, 64).  Similarly,  for  the  j’th 
model  point,  ntj,  we  define  and  (aj,6,)  (see  figure 
1).  Without  loss  of  generality,  assume  the  model  plane 
is  2  =  0.  So  Z4  is  the  height  of  m4  above  this  plane,  and 
Zj  is  the  height  of  mj  above  the  model  plane.  We  define 


We  now  show  that  for  any  affine  coordinates  (04,  ^4), 
there  is  a  viewpoint  in  which  the  projection  of  m4  has 
those  affine  coordinates.  We  then  express  the  affine  co¬ 
ordinates  of  the  remaining  projected  model  points  as  a 
function  of  {04,04).  Some  point,  14,  in  the  model  plane 
has  affine  coordinates  (04, 04).  If  we  form  a  line  includ¬ 
ing  1*4  and  m4,  this  line  describes  a  viewing  direction 
from  which  m4  and  14  project  to  the  same  image  point, 
14.  Since  is  coplanar  with  the  first  three  model  points, 
it  has  the  same  affine  coordinates  when  viewed  from  any 
direction,  since  affine  coordinates  of  planar  points  are 
not  changed  by  any  affine  transformation,  and  viewing 
a  planar  object  from  an  arbitrary  viewpoint  is  equiva¬ 
lent  to  applying  an  affine  transform.  So  14  has  affine 
coordinate  {04, 04). 

A  line  parallel  to  the  viewing  direction  will  also  pass 


through  mj ,  and  intersect  the  model  plane  at  a  point  we 
call  ij.  The  affine  coordinate  of  ij,  the  image  of  mj, 


are  the  same  as  the  affine  coordinate  of  ij  in  the  model 


Figure  1;  The  image  points  ij,  fj,  h,  *4.  and  ij  are  the 
projections  of  the  model  points  mi,  m2,  m3,  m4,  and 
mj,  before  the  affine  transform  portion  of  the  projection 
is  applied.  The  value  of  the  image  points  depend  on 
the  pee  of  the  model  relative  to  the  image  plane.  In  the 
viewing  direction  shown,  14  and  m4  project  to  the  same 
image  point.  m\  is  in  the  model  plane,  directly  below 
m4.  Note  that  14  has  the  same  affine  coordinates  as  i'^. 


plane.  Since  the  Une  connecting  mj  to  is  parallel  to 
the  line  connecting  m4  to  i'^,  the  triangle  m4m^i'^  will  be 
similar  to  the  triangle  mjm'jij,  and  scaled  by  a  &ctor  of 
Tj.  In  particular,  this  means  that:  (14  —  m^)  =  — 

m'j),  and  therefore: 


((Qf4,;94)  -  (04,64)) 


(1) 


This  equation  describes  all  image  parameters  that  these 
five  points  may  produce.  For  any  image,  this  equation 
will  hold.  And  for  any  values  described  by  the  equa¬ 
tion,  there  is  a  corresponding  image  that  the  model  may 
produce,  since,  from  above  we  know  that  for  any  values 
(04, 04),  there  is  a  view  of  m4  that  produces  these  val¬ 
ues.  Taking  the  a  component  of  these  equations  we  have 
equations  that  describe  a  line  in  a-space.  We  may  derive 
a  similar  set  of  equations  in  /9-space.  These  equations  are 
independent.  That  is,  for  any  set  of  a  coordinates  that 
a  m^el  may  produce  in  an  image,  it  may  still  produce 
any  feasible  set  of  0  coordinates.  There  are  also  degen¬ 
erate  case.  If  some  of  the  model  points  are  coplanar, 
than  some  of  the  Vj  are  infinite,  and  the  line  is  vertical 
in  those  dimensions.  If  all  the  model  points  are  coplanar, 
the  affine  coordinates  of  the  projected  model  points  are 
invariant,  and  each  model  is  represented  by  a  point  in 
affine  space.  If  the  three  model  points  are  colinear,  then 
the  line  is  undefined. 

Notice  that  for  any  line  in  a-space,  there  is  some 
model  whose  images  are  described  by  that  line.  It  is 
not  true  that  there  is  a  model  corresponding  to  any  pair 
of  lines  in  a-space  and  /?-space  because  the  parameters 
Tj  are  the  same  in  the  equations  for  the  two  lines.  This 
means  that  the  two  lines  are  constrained  to  have  the 
same  directional  vector,  but  they  are  not  further  con¬ 
strained. 

In  the  absence  of  image  error,  indexing  could  proceed 
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in  the  following  way.  Form  two  hash  tables  which  dis¬ 
cretize  a-space  and  /?-space.  For  each  model,  compute 
the  corresponding  line  in  each  of  these  spaces,  and  make 
an  entry  in  each  bucket  that  intersects  one  of  these  lines. 
At  run  time,  given  an  image,  determine  the  points  in  a- 
space  and  /?-space  that  correspond  to  that  image.  Look 
in  the  appropriate  bucket  in  each  of  the  two  spaces,  and 
intersect  the  results.  If  the  buckets  are  made  sufficiently 
small,  this  will  produce  exactly  the  set  of  models  that 
could  produce  the  image. 

To  account  for  error,  for  a  particular  image  we  must 
determine  the  volumes  of  a-space  and  /3-space  that 
might  match  that  image  when  each  possible  set  of  errors 
is  assumed.  In  [12]  and  [9]  we  solved  this  problem  for 
images  consisting  of  four  points.  That  solution  tells  us 
the  range  of  affine  coordinates  that  a  fourth  image  point 
might  have  with  respect  to  the  other  three,  when  we  al¬ 
low  for  bounded  sensing  error.  By  applying  that  solution 
to  each  image  point  in  turn,  we  find  bounds  on  the  set 
of  possible  affine  coordinates  that  an  image  might  pro¬ 
duce.  This  method  is  conservative,  in  that  it  overstates 
the  effects  of  error  in  two  ways.  The  effect  of  error  on 
the  a  and  0  coordinates  is  not  independent,  but  we  sep¬ 
arately  determine  the  ranges  of  possible  a  and  /?  values 
that  can  occur.  Secondly,  we  treat  the  error  as  if  it  had 
an  independent  effect  on  each  pair  of  affine  coordinates. 
However,  error  in  the  three  basis  points  has  a  related 
effect  on  the  affine  coordinates  of  every  other  point.  In 
effect,  we  enclose  the  true  volume  of  affine  space  consis¬ 
tent  with  error  inside  the  smallest  possible  rectanguloid 
that  has  sides  parallel  to  the  axes  of  the  affine  space. 
While  this  may  produce  some  unnecessary  matches,  it 
can  not  cause  us  to  miss  any  correct  matches. 

Representing  each  model  group’s  images  with  two  ID 
surfaces  is  an  important  improvement  over  using  a  single 
2D  surface,  because  it  requires  much  less  space  to  dis¬ 
cretely  represent  a  pair  of  ID  surfaces  than  to  discretely 
represent  a  2D  surface.  There  is  a  run-time  price  that 
must  be  paid  for  this  space,  when  we  intersect  the  results 
of  two  separate  lookups.  However,  this  cost  is  negligible 
in  the  actually  system  we  have  built.  We  note  that  the 
dimensionality  of  the  surfaces  used  to  represent  a  model 
group’s  images  is  now  the  best  that  can  be  done,  since 
[4]  h2is  shown  that  in  a  single  index  space,  a  2D  surface 
is  required  to  represent  these  images,  and  it  is  not  possi¬ 
ble  to  represent  a  2D  surface  as  the  cross-product  of  any 
countable  number  of  countable  sets. 

3  Invariants  and  Non-Accidental 
Properties 

Before  describing  this  indexing  system  further,  we  will 
consider  some  implications  of  this  view  of  the  recognition 
problem.  In  the  error  free  case,  we  have  almost  entirely 
reduced  the  recognition  problem  to  a  very  simple  form, 
in  which  recognition  is  the  problem  of  determining  which 
points  fall  on  which  lines.  To  demonstrate  the  usefulness 
of  this,  we  will  consider  two  influential  approaches  to 
recognition  from  this  point  of  view. 

A  number  of  recognition  systems  have  been  based  on 
invariants.  In  the  context  of  recognition,  an  invariant  is 


a  function  of  the  image  that  has  the  following  property: 
if  /  is  an  invariant  function,  then  for  any  model,  m,  if 
I’l  and  are  images  of  m,  then  /(j'l)  =  /(i2)-  That  is, 
an  invariant  is  a  property  that  is  true  of  all  images  of  a 
model,  and  hence  does  not  vary  under  the  transforma¬ 
tion  that  turns  a  model  into  an  image, 

2D  recognition  systems  have  long  implicitly  relied  on 
the  descriptions  of  2D  objects  that  do  not  vary  as  the  ob¬ 
ject  is  rotated  or  translated  in  the  plane.  More  recently, 
invariants  have  been  used  for  the  recognition  of  planar 
models  from  arbitrary,  3D  views  ([22],  [13],  [14],  [7],  [8], 
[20]).  Recently,  [3],  [4],  and  [16]  have  proven  that  there 
are  no  non-trivial  invariants  when  models  may  consist 
of  arbitrary  collections  of  point  features.  These  proofs 
took  the  following  form.  Given  any  tw’o  models.  Mi  and 
M2,  a  set  of  intermediate  models.  Pi,  P2  ...  Pn  were 
constructed.  By  this  construction.  Mi  and  Pi  produce  a 
common  image,  hence  any  invariant  function  would  have 
to  have  the  same  value  on  any  images  these  models  pro¬ 
duce.  Similarly,  Pi  and  P2  produce  a  common  image, 
and  so  on,  until  Mi  and  M2  are  linked  by  this  series  of 
intermediate  models.  Hence,  the  invariant  function  must 
produce  the  same  value  for  images  of  any  two  models, 
and  is  trivial. 

Our  formulation  of  the  recognition  problem  makes  it 
easy  to  prove  a  number  of  results  about  invariants.  These 
results  will  only  apply  to  our  model  of  projection.  How¬ 
ever,  we  note  that  considering  our  projection  model  is 
equivalent  to  assuming  that  any  invariant  function  will 
be  both  invariant  for  the  orthographic  projection  of  a  set 
of  3D  models,  and  will  also  be  invariant  for  the  ortho¬ 
graphic  projection  of  2D  images  of  these  models. 

We  can  now  show  that  there  are  no  invariant  functions 
of  3D  models  with  a  proof  that  requires  only  two  inter¬ 
mediate  models.  Suppose  model  Afi  corresponds  to  the 
two  lines,  Ai  and  Bi  in  a-space  and  ,3-space  respectively. 
Similarly,  suppose  model  M2  corresponds  to  A2  and  82- 
Then  there  are  an  infinite  number  of  lines  that  intersect 
both  Ai  and  A2.  Choose  one  of  these.  A].  Choose  5J  as 
any  line  that  is  parallel  to  A'l ,  and  intersects  Bi .  Then 
there  is  a  model.  Pi  that  corresponds  to  the  lines  (A], 
B'l).  Pi  has  an  image  in  common  with  Mi,  since  A] 
intersects  Aj,  and  Bi  intersects  S|.  We  may  then  con¬ 
struct  P2  and  its  lines,  (Aj,  Bn),  so  that  B'n  intersects  5[ 
and  B2,  and  so  that  A'n  passes  through  the  point  where 
A'l  and  A2  intersect.  So  P2  will  have  an  image  in  com¬ 
mon  with  Pi  and  A/2.  Therefore,  any  invariant  function 
must  have  the  same  value  for  any  image  of  any  of  the 
four  models. 

We  may  also  use  our  previous  results  to  e.xamine  other 
questions  about  the  occurrence  of  invariants.  As  [16] 
points  out,  there  may  be  invariant  functions  for  a  par¬ 
ticular  set  of  3D  models,  if  the  models  can  be  divided 
into  non-trivial  equivalence  classes,  where  two  models 
are  equivalent  if  they  have  an  image  in  common,  or  are 
both  equivalent  to  another  model.  From  our  previous 
work,  it  becomes  easy  to  form  these  equivalence  cla.sses 
for  a  particular  set  of  models,  because  we  can  tell  that 
two  models  produce  a  common  image  if  their  correspo/id- 
ing  lines  in  a-space  and  in  /3-space  intersect. 

Another  question  that  arises  for  a  restricted  set  of 
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mcxlels  is  whether  there  is  an  invariant  function  that 
will  produce  no  false  positive  matches.  We  can  think 
of  an  invariant  function  as  assigning  values  to  models, 
because  it  assigns  the  same  value  to  every  image  of  a 
model.  An  invariant  function  leads  to  no  false  positives 
if  the  function  maps  a  model  and  an  image  to  the  same 
value  only  when  there  really  is  a  transformation  that  will 
cause  the  model  to  produce  that  image. 

To  answer  this  question  we  consult  our  representation 
of  a  model’s  images  as  planes  in  affine  space,  that  is, 
we  take  the  cross  product  of  the  model’s  pair  of  lines 
in  a-space  and  /3-space.  We  can  see  that  there  is  an 
invariant  function  with  no  false  positives  for  a  specific 
set  of  models  if  and  only  if  no  two  planes  that  correspond 
to  two  models  intersect  without  completely  coinciding. 
If  this  condition  is  met,  we  may  construct  an  invariant 
function  that  assigns  a  common  value  to  all  the  images 
that  fall  on  a  model’s  plane  of  images,  and  assigns  a 
different  value  to  any  two  images  that  lie  on  the  planes  of 
different  models.  Then,  any  model  that  can  produce  one 
image  with  a  particular  value  of  the  invariant  function 
will  be  able  to  produce  exactly  the  set  of  images  that 
have  that  value  of  the  invariant  function.  If,  on  the  other 
hand,  two  planes  intersect  and  do  not  coincide,  then  all 
images  that  either  plane  contains  must  have  the  same 
value  for  any  invariant  function,  but  neither  object  can 
produce  all  these  images,  so  false  positives  will  occur. 

From  this  result  we  can  see  that  a  specific  set  of  mod¬ 
els  can  have  an  invariant  function  with  no  false  positives 
only  if  it  is  a  measure  0  subset  of  the  set  of  all  possi¬ 
ble  models.  There  is  a  correspondence  between  the  set 
of  all  possible  models  and  the  set  of  all  planes  in  affine 
space,  subject  to  the  restriction  that  for  each  plane,  the 
directional  vectors  of  the  two  lines  it  produces  when  pro¬ 
jected  onto  a-space  or  /?-space  must  be  the  same.  Any 
set  of  such  planes  that  do  not  intersect  is  a  measure  0 
subset  of  the  set  of  all  such  planes.  For  example,  if  we 
have  a  restricted  set  of  models  corresponding  to  a  set  of 
non-intersecting  planes  in  affine  space,  this  means  that 
any  point  in  affine  space  can  belong  to  only  one  of  these 
planes,  although  it  belongs  to  uncountably  many  planes 
that  correspond  to  some  model. 

An  especially  interesting  special  case  is  that  of  mod¬ 
els  containing  five  points.  This  is  the  smallest  group 
that  can  produce  invariants,  because  in  general  any  four 
model  points  can  appear  as  any  four  image  points.  Most 
systems  based  on  invariants  have  used  the  smallest  pos¬ 
sible  model  groups,  in  order  to  limit  the  number  of  pos¬ 
sible  model  groups  they  must  consider  (this  is  true  of 
[13],  [8],  [20],  for  example).  A  set  of  model  groups  of 
five  points  each  will  each  produce  a  pair  of  lines  in  2D 
a-space  and  /3-space.  Furthermore,  recall  that  these  two 
lines  will  have  the  same  slope,  which  means  that  two  dif¬ 
ferent  models  will  produce  lines  that  are  either  parallel  in 
both  spaces,  or  that  intersect  in  both  spaces.  Therefore, 
an  invariant  function  for  groups  of  five  points  is  possible 
only  when  all  lines  produced  by  all  models  are  parallel. 
For  example,  if  one  model  produces  lines  not  parallel  to 
the  others,  it  will  have  an  image  in  common  with  each 
of  them,  implying  that  the  invariant  function  must  be 
constant  over  all  images  produced  by  all  models.  The 


Figure  2:  If  an  image  corresponds  to  the  point  (0,0) 
in  a-space,  that  means  that  the  last  two  points  in  the 
image  are  colinear  with  the  first  and  the  third  points. 
This  is  shown  on  the  left,  where  the  first  three  points 
are  shown  as  dots,  and  the  second  two  points,  shown  as 
open  circles,  must  lie  somewhere  on  the  dashed  line.  If 
the  image  corresponds  to  (2, 3)  in  a-space,  the  points 
must  fall  on  the  two  lines  shown  on  the  right.  In  both 
cases  we  have  an  equivalent  non-accidental  property. 

lines  produced  by  all  models  will  be  parallel  only  when 
rs,  (see  equation  1),  is  the  same  for  all  models,  that  is, 
when  the  ratio  of  the  height  above  the  model  plane  of 
the  fourth  point  to  the  height  of  the  fifth  point  is  always 
the  same. 

We  now  turn  to  the  analysis  of  some  non-£iccidental 
properties.  Lowe[15]  first  used  these  in  his  recogni¬ 
tion  system,  SCERPO,  and  Biederman[l]  has  based  his 
GEON  approach  to  recognition  on  these  properties.  A 
non-accidental  property  is  a  property  of  an  image  such 
that  some  models  only  produce  images  with  that  prop¬ 
erty,  while  for  other  models  either  no  or  almost  no  images 
they  produce  have  that  property.  The  work  of  Lowe  and 
Biederman  has  been  ba^  on  identifying  a  small  num¬ 
ber  of  such  properties,  such  as  parallelism,  colinearity, 
or  symmetry.  Our  above  work  now  allows  us  to  see  that 
there  is  an  infinite  class  of  such  properties,  and  that  there 
is  nothing  about  the  inherent  geometry  of  models  which 
makes  one  such  property  qualitatively  different  from  an¬ 
other. 

First,  we  illustrate  this  in  the  simplest  case  of  colinear¬ 
ity.  We  will  show  that  there  are  an  infinite  set  of  proper¬ 
ties  that  are  qualitatively  similar  to  colinearity.  That  is, 
if  we  ignore  the  distribution  of  models  in  our  universe  of 
possible  models,  there  is  nothing  we  can  say  about  colin¬ 
earity  that  we  can  not  also  say  about  an  infinite  number 
of  other  properties.  Suppose  we  have  an  image  group 
with  five  points,  and  the  first  point  and  last  three  points 
are  colinear.  This  is  equivalent  to  saying  that  on  and 
a®  of  the  image’s  affine  coordinates  both  equal  0.  That 
is,  any  image  which  corresponds  to  the  point  (0, 0)  in  a- 
space,  and  to  any  point  in  /3-space,  has  those  four  points 
colinear.  If  we  look  at  this  from  the  perspective  of  non¬ 
accidental  properties,  we  say  that  either  the  four  points 
are  colinear  in  the  model  that  produced  them,  in  which 
case  they  always  appear  colinear,  or  they  are  not  really 
colinear,  in  which  case  there  is  only  a  measure  0  chance 
that  they  would  appear  as  colinear.  We  would  then  infer 
that  we  should  match  these  colinear  image  points  with 
only  similarly  colinear  model  points. 

From  our  new  perspective,  we  would  say  that  there 
are  two  possibilities.  Either  the  model  that  produced 
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this  image  is  planar,  and  always  produces  images  with 
the  values  (0, 0)  in  a-space,  or  the  model  corresponds  to 
a  line  in  a-space  that  passes  through  the  origin.  These 
two  views  of  the  situation  are  equivalent.  However,  we 
may  now  see  that  there  is  nothing  special  about  the  co¬ 
ordinates  (0, 0)  in  a-space.  We  could  say  the  same  thing 
about  any  other  coordinates.  For  example,  we  get  a  sim¬ 
ilar  non-accidental  property  when  we  consider  an  image 
with  affine  coordinates  (2,3)  in  a-space  (see  figure  2). 
Again,  either  the  object  that  produces  such  an  image  is 
planar,  and  always  produces  such  an  image,  or  the  object 
corresponds  to  a  line  in  a-space  that  passes  through  the 
point  (2,3).  So  the  a-coordinates  (2,3),  and  any  other 
pair  of  a-coordinates,  define  a  new  non-accidental  prop¬ 
erty,  and  there  is  no  qualitative  difference  between  these 
properties  and  colinearity  that  can  be  inferred  from  the 
fundamental  geometry  of  objects. 

More  generally,  other  non-accidental  properties,  such 
as  parallelism  or  skew  symmetry,  can  also  be  thought 
of  as  describing  regions  in  affine  space.  The  non- 
accidentalness  of  these  properties  is  equivalent  to  say¬ 
ing  that  a  planar  object  that  produces  these  properties 
from  one  viewpoint  will  always  produce  them,  while  a 
non-planar  object  that  can  produce  the  property  will  be 
equivalent  to  a  pair  of  lines  that  intersects  the  property 
in  only  a  measure  0  portion  of  its  extent.  If  two  prop¬ 
erties  correspond  to  similarly  shaped  regions  in  affine 
space,  then  there  will  be  no  qualitative  difference  be¬ 
tween  them. 

This  argument  does  not  imply  that  non-accidental 
properties  are  not  useful,  or  that  the  non-accidental 
properties  that  have  been  proposed  are  not  more  useful 
than  others.  It  simply  suggests  that  the  special  percep¬ 
tual  saliency  of  some  non-accidental  properties,  such  as 
colinearity,  lies  with  the  particular  nature  of  objects  that 
occur  in  the  world,  and  can  not  be  explained  by  some 
qualitatively  different  characteristic  of  the  geometry  of 
these  properties. 

4  Determining  the  Projected  Model 

Our  indexing  system  produces  matches  between  groups 
of  image  features  and  groups  of  model  features.  While 
there  are  standard  techniques  for  verifying  such  hypothe¬ 
ses,  our  representation  of  models  lends  itself  to  a  partic¬ 
ularly  simple  method  of  determining  the  projection  of 
the  model  into  the  image. 

A  match  produced  by  the  indexing  system  provides 
us  with  lines  in  a-space  and  )9-space  that  represent  the 
model,  and  a  pair  of  points  in  these  spaces  for  the  image. 
By  finding  the  locations  on  the  lines  that  are  closest  to 
these  two  points,  we  find  the  error-free  set  of  affine  coor¬ 
dinates  that  best  fit  our  image.  We  then  can  use  any  one 
of  these  coordinates  to  determine  the  affine  coordinates 
of  any  point  on  the  projected  model,  since,  as  we  pointed 
out  above,  all  further  affine  coordinates  are  functions  of 
any  pair  of  affine  coordinates. 

Let  us  illustrate  this  with  an  example.  Suppose  in¬ 
dexing  matches  model  points  mi,  m2,  m3,  m^,  ms,  me 
to  image  points  ti,i2,>3)t4its<>6)  and  suppose  based  on 
this  match  we  wish  to  project  model  points  mi ,  m2,  ...m„ 
into  the  image.  At  compile  time,  we  use  the  points 


mi, m2,  m3  as  a  basis,  and  compute  the  two  lines  in  the 
affine  spaces  that  describe  all  images  of  the  model  when 
the  image  of  these  three  points  are  used  as  a  basis.  (This 
is  done  at  compile  time  for  all  possible  basis  triples).  Call 
these  two  lines  Li  and  Lj.  These  lines  are  in  (n  —  3)- 
dimensional  affine  spaces,  (04. ..a„)  and  (04. be¬ 
cause  they  represent  the  locations  of  n  —  3  points  using 
the  first  three  points  as  a  beisis.  Our  six  matched  im¬ 
age  points  map  to  two  points  in  the  3-dimensional  affine 
spaces  (04,05,05)  and  (04, 05, 06)-  Call  these  points  pi 
and  p2.  By  projecting  Li  and  Z2  into  these  lower  dimen¬ 
sional  spaces,  we  get  lines  that  describe  the  possible  im¬ 
ages  that  the  first  six  model  points  can  create.  By  finding 
the  point  on  the  projection  of  Li  closest  to  pi  we  find  the 
a  coordinates  of  the  image  of  the  model  that  best  match 
the  image  points.  Similarly,  we  find  the  appropriate  0 
values.  These  values  determine  locations  on  Li  and  L2 
that  tell  us  the  affine  coordinates  of  all  the  model  points 
in  the  image  that  will  best  fit  the  matched  image  points. 
Without  explicitly  computing  the  viewing  direction  we 
have  computed  the  appearance  of  the  model,  when  seen 
from  the  correct  viewing  direction.  (A  different  method 
must  be  used  if  the  matched  model  points  are  coplanar, 
because  in  that  case  their  affine  coordinates  provide  no 
information  about  viewing  direction). 

In  addition  to  determining  the  effects  of  the  viewing 
direction  on  the  image,  we  must  also  allow  for  the  effects 
of  the  affine  transformation  portion  of  the  projection. 
However,  once  we  have  determined  the  affine  coordinates 
of  all  the  projected  model  points,  it  is  straightforward  to 
apply  a  least  squares  method  to  find  the  affine  transfor¬ 
mation  that  optimally  aligns  the  image  points  with  the 
projected  model  points. 

5  Experiments  with  an  Indexing 
System 

5.1  The  Recognition  System 

Our  indexing  system  is  designed  to  match  relatively  large 
groups  of  image  points  to  equally  large  groups  of  model 
points.  Such  an  approach  requires  a  bottom-up  group¬ 
ing  process  to  control  the  combinatorics  of  forming  all 
possible  large  groups  of  image  and  model  points.  While 
we  are  currently  interfacing  this  system  with  a  partic¬ 
ular  grouping  system,  our  purpose  here  is  to  examine 
the  performance  of  the  indexing  system  alone.  So  that 
our  results  will  be  independent  of  the  deficiencies  of  any 
specific  grouping  system,  we  have  tested  the  indexing 
system  with  somewhat  ideal  groups,  in  which  automati¬ 
cally  located  features  are  formed  into  groups  by  hand. 

We  begin  model  building  by  running  an  edge  detector 
on  many  images  of  the  object.  We  find  corner  features 
by  making  straight  line  approximations  to  the  edges,  and 
locating  a  corner  where  nearby  lines  have  a  stable  inter¬ 
section  point,  when  extended.  We  then  form  by  hand 
groups  of  three  to  five  points  that  are  formed  by  a  set 
of  convex  lines  (see  [11]  and  [10]  for  discussion  of  the 
value  of  convex  groups).  The  convexity  of  the  group  or¬ 
ders  the  points,  although  it  does  not  tell  us  which  point 
comes  first.  So  a  different  group  is  formed  for  each  pos¬ 
sible  starting  point.  To  allow  for  the  effects  of  occlu- 


sion,  we  also  form  groups  in  which  any  one  of  the  points 
is  omitted,  as  long  as  the  group  still  contains  at  least 
three  corners.  Corners  that  appear  in  these  groups  are 
matched  by  hand  between  the  images  used  to  build  a 
model. 

Since  three  to  five  points  do  not  provide  sufficient  in¬ 
formation  to  discriminate  between  models,  we  then  form 
all  pairs  of  these  groups.  Each  pair  of  groups  gives  us 
an  ordered  set  of  points.  For  each  set  of  points,  we  cal¬ 
culate  the  lines  in  a-space  and  /?-space  to  which  they 
correspond.  Each  image  that  contains  all  the  corners  ap¬ 
pearing  in  the  group  is  used  to  calculate  two  points,  one 
in  Qt-space  and  one  in  ^-space,  that  describe  the  ordered 
group.  We  then  fit  lines  to  these  points,  to  determine 
the  set  of  all  images  that  the  group  could  produce. 

We  then  compute  which  buckets  of  discretized  a-space 
and  ^-space  these  lines  intersect,  and  make  an  entry  in 
each  bucket,  pointing  to  the  appropriate  group  of  model 
points. 

To  build  a  model  of  the  object’s  line  segments  we 
match  by  hand  corners  that  are  at  the  end  points  of 
line  segments  that  appear  in  a  number  of  images  of  the 
model.  Then,  for  every  triple  of  corners  used  as  a  basis 
for  one  of  the  groups  entered  into  the  lookup  table,  we 
form  lines  in  the  two  affine  spaces  representing  the  pos¬ 
sible  affine  coordinates  of  the  corners  of  ail  the  model’s 
line  segments.  We  use  a  new  hash  table  to  store  these 
pairs  of  lines  for  easy  access. 

At  run  time,  we  take  a  new  picture  of  the  model,  along 
with  occluding  objects.  We  form  groups  from  the  cor¬ 
ners  of  this  picture,  just  as  we  did  for  the  images  used 
to  build  the  model.  These  groups  may  be  missing  some 
of  the  corners  that  appear  in  the  modeled  groups,  due 
to  occlusion  or  due  to  kilure  in  the  corner  finder.  Next, 
as  before,  we  form  all  pairs  of  these  groups.  We  have 
some  freedom  in  how  we  order  the  points  in  these  pairs 
of  groups  before  using  them  for  lookup.  In  building  the 
m(^ei,  each  point  in  a  group  was  used  as  a  starting  point 
for  that  group.  That  point,  and  the  next  two,  were  used 
as  a  basis  for  computing  the  affine  coordinates  of  the 
remaining  points.  Since  we  may  use  any  point  in  the  im¬ 
age  group  as  a  starting  point,  we  select  the  point  which 
gives  us  the  most  stable  set  of  basis  points.  Then,  we 
use  this  ordered  set  of  image  points  to  compute  two  rect- 
anguloids  in  the  affine  spaces,  as  described  above,  and 
find  all  matching  sets  of  model  points.  Since  we  can  only 
represent  a  finite  portion  of  the  affine  spaces,  it  is  possi¬ 
ble  that  the  rectanguloids  produced  by  a  group  will  fall 
outside  the  bounds  of  this  portion  of  affine  space.  We  ig¬ 
nore  such  groups,  and  in  fact,  groups  that  produce  large 
affine  coordinates  are  likely  to  be  unstable,  providing 
poor  candidates  for  matching. 

We  perform  this  indexing  for  each  pair  of  image 
groups.  Ordering  these  pairs  so  that  we  start  with  the 
pair  that  produces  the  fewest  matches,  we  then  perform 
verification  on  each  match  until  the  object  is  found.  We 
use  the  method  described  above  to  determine  the  pro¬ 
jection  of  the  model’s  line  segments  for  each  match.  We 
then  search  the  image  for  line  segments  that  are  near  the 
model’s  projected  line  segments,  and  that  have  roughly 
the  same  orientation,  in  order  to  determine  the  fraction 


Figure  3:  Edges  from  two  of  the  pictures  used  to  build  a 
model  of  the  phone.  Circles  indicate  the  location  of  auto¬ 
matically  found  corners  that  were  then  selected  by  hand 
to  be  in  the  model.  They  are  numbered  for  reference. 

of  the  model  which  the  image  can  explain.  We  stop  when 
a  sufficiently  good  hypothesis  is  found. 

Note  that  in  general,  when  determining  the  appear¬ 
ance  of  a  model  from  a  given  viewpoint,  we  should  elim¬ 
inate  lines  that  are  not  visible  from  that  viewpoint.  To 
avoid  the  need  for  this  and  simplify  our  verification  sys¬ 
tem,  we  have  taken  all  images  from  a  single  aspect  of  the 
object.  That  is,  we  restricted  our  viewpoint  to  about  a 
quarter  of  the  viewing  hemisphere,  in  which  all  the  same 
set  of  points  and  lines  were  visible. 

5.2  Experiments 

In  experimenting  with  the  above  recognition  system,  we 
have  used  the  following  values  for  various  parameters. 
We  allowed  image  error  of  five  pixels  in  indexing.  The 
index  table  represents  all  affine  coordinates  between  —25 
and  25.  Each  dimension  of  the  table  is  divided  into 
100  intervals.  We  make  the  buckets  of  uniform  size  for 
affine  coordinates  between  0  and  1,  and  then  increase 
the  bucket  size  linearly  for  coordinates  above  1  or  be¬ 
low  0.  We  do  this  because  error  has  a  greater  effect  on 
images  with  higher  affine  coordinates,  at  approximately 
this  rate.  A  projected  model  line  is  matched  to  an  image 
line  if  the  image  line  is  no  more  than  10  pixels  from  the 
model  line,  and  if  the  orientations  of  the  lines  differ  by 
no  more  than  A  hypothesis  is  accepted  if  it  accounts 
for  at  least  50%  of  the  model’s  lines. 

We  have  performed  two  sets  of  experiments.  In  one, 
we  build  models  and  recognize  them,  as  described  above. 
In  the  second,  we  perform  indexing  using  randomly  gen¬ 
erated  models  and  images.  This  provides  a  means  of 
more  carefully  measuring  the  discriminatory  power  of 
our  system. 

Figure  3  shows  the  edges  found  in  two  pictures  of  a 
telephone,  and  15  of  the  corners  that  are  located  from 
these  edges.  The  corners  are  numbered  for  reference.  We 
form  groups  containing  the  following  sets  of  corners;  ((0 
1  2  3  4)  (1  2  3  13)  (1  0  9  10)  (14  15  16  18)  (11  9  10  17) 
(12  13  3  4)  (11  17  19))),  and  for  verification  we  use  line 
segments  with  the  following  corners  as  end  points;  ((0 
1)  (1  2)  (2  3)  (3  4)  (0  9)  (9  10)  (9  11)  (11  17)  (10  17) 
(3  13)  (13  12)  (12  4)  (14  15)  (15  16)  (16  18)  (18  14))). 
Out  of  these  primitive  groups,  we  formed  pairs  of  model 
groups  for  entry  in  the  index  table,  as  described  above. 
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Figure  4:  On  the  top  are  two  scenes  containing  the 
l)hone.  Underneath  is  the  first  hypotlietical  projection 
of  the  model  considered  by  the  recognition  system.  Both 
are  correct.  In  the  hypotheses,  edges  are  shown  as  dot¬ 
ted  lines,  projected  line  segments  appear  as  lines,  circles 
represent  the  image  corners  in  the  match,  and  squares 
show  the  location  of  the  projected  model  corners. 


There  were  an  average  of  472  table  entries  for  each  pair 
of  groups. 

Figure  4  shows  pictures  containing  the  telephone,  and 
the  first  hypothesis  of  the  recognition  system  as  to  the 
location  of  the  phone.  For  both  pictures,  the  first  hy¬ 
pothesis  is  correctly  accepted.  In  all,  the  system  was 
tested  on  three  scenes.  Fourteen  groups  of  corners  were 
formed  from  these  images.  Indexing  for  two  of  the  groups 
required  going  outside  the  bounds  of  the  lookup  table,  so 
these  groups  were  ignored.  Indexing  produced  a  correct 
match  in  for  all  of  the  remaining  twelve  groups.  Five 
groups  contained  seven  corners  each,  and  seven  groups 
contained  six  corners  each.  The  groups  with  seven  cor¬ 
ners  produced  a  total  of  two  incorrect  matches  in  addi¬ 
tion  to  the  correct  matches.  Each  group  could  poten¬ 
tially  be  matched  to  any  of  1,386  groups  represented  in 
the  lookup  table  that  contained  seven  points.  So  index¬ 
ing  reduced  the  number  of  incorrect  matches  by  a  factor 
of  3,465.  The  groups  with  six  corners  produced  an  av¬ 
erage  of  16.7  incorrect  matches,  compared  to  the  2,931 
groups  in  the  table  they  could  match.  This  implies  an 
average  speedup  of  a  factor  of  244.  We  can  see  that  in¬ 
dexing  can  produce  tremendous  savings  in  time  in  this 
domain,  particularly  as  larger  groups  are  formed,  and 
when  we  may  choose  among  several  groups,  using  the 
most  distinctive  group  first. 

For  the  synthetic  experiments,  we  generated  random 
model  groups  by  selecting  points  at  random  in  a  cube, 
and  we  generated  random  image  groups  by  selecting 
points  in  a  square,  in  which  the  minimum  and  maxi¬ 
mum  distance  between  the  points  differed  by  no  more 
than  a  factor  of  10.  We  then  matched  different  order¬ 
ings  of  these  groups  using  our  indexing  system,  using 
the  same  parameters  as  above.  This  allowed  us  fo  es¬ 
timate  the  likelihood  that  an  image  group  will  match  a 
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Figure  5:  Experiments  with  randomly  generated  models 
and  images.  Points  are  divided  into  two  groups,  and 
all  orderings  of  these  groups  are  considered.  Error  is 
in  number  of  pixels.  We  give  the  percentage  of  images 
for  which  error  would  require  them  to  acce.ss  the  iiuhw 
table  outside  its  boundaries.  These  images  were  ignored. 
Speedup  indicates  the  ratio  of  total  possible  matches  to 
matches  produced.  “.Y  <”  indicates  that  no  matches 
were  found,  out  of  A'  possible  matches. 

model  group  by  ehance,  as  we  vary  the  size  of  the  group, 
and  the  amount  of  the  error.  For  each  set  of  values  we 
generated  85  model  groups,  and  for  each  model  group 
we  generated  100  image  groups.  The  results  are  sum¬ 
marized  in  figure  5.  For  error  of  five  pixels,  these  results 
are  similar  to  our  results  with  real  images.  They  indicate 
how  dramatically  speedups  increase  with  group  size. 

6  Previous  Work 

Our  work  is  based  on  a  general  approach  to  recognition 
that  links  grouping  and  indexing,  an  approach  developed 
by  [15],  and  later  taken  in:  [11],  [17],  [10],  [21],  [4],  and 
[5]. 

A  number  of  recent  indexing  systems  for  recognizing 
planar  objects,  or  restricted  sets  of  3D  objects,  from  arbi¬ 
trary  3D  views  have  been  based  on  invariant  functions  of 
the  image  [22],  [13],  [14],  [7],  [8],  [20],  For  example,  as  we 
have  noted,  the  affine  coordinates  of  a  planar  model  are 
invariant  under  affine  transformations.  [13], [14]  use  this 
invariance  in  their  system.  Invariants  allow  these  sys¬ 
tems  to  represent  the  image  groups  that  a  model  group 
may  produce  using  a  single  point  in  index  space. 

However,  as  previously  mentioned,  [3],  [4],  and  [16] 
have  recently  shown  that  there  are  no  non-trivial  invari¬ 
ants  when  models  may  consist  of  arbitrary  collections  of 
point  features.  [16]  includes  further,  related  results. 

In  [4]  it  is  proven  that  in  this  domain  each  model  must 
be  represented  by  a  2D  surface  if  a  single  lookup  ta¬ 
ble  is  used.  In  that  paper,  and  in  [18],  [14],  and  [2], 
lookup  tables  are  constructed  that  sample  this  2D  s\ir- 
face  by  sampling  the  2D  viewing  sphere.  The,se  systems 
suffered  from  two  problems.  First,  the  use  of  sampling, 
instead  of  using  analytic  methods  raises  some  dillicvilties. 
In  [4]  we  found  that  to  perform  sufficient  sampling  re¬ 
quired  excessive  compulation  both  at  compile  time  and 
at  run  time.  And  sampling  can  result  in  missing  cor¬ 
rect  matches  ([2]  discusses  a  way  of  bounding  this  er¬ 
ror).  Second,  it  requires  a  good  deal  of  space  to  reime- 


724 


sent  these  2D  surfaces.  For  example,  [4]  required  over 
5,000  table  entries  to  represent  a  model  group  of  five 
points,  and  [18]  required  2,500  entries  to  represent  pairs 
of  vertices.  Our  current  system  requires  about  472  table 
entries  for  groups  containing  mostly  six  or  seven  points. 
This  comparison  may  be  misleading  however.  For  one 
thing,  space  constraints  required  the  system  described 
in  [4]  to  discretize  the  index  space  less  finely  than  one 
would  like.  Each  dimension  of  that  table  was  divided 
into  40  parts,  compared  to  100  parts  in  the  current  sys¬ 
tem.  Also  in  general,  space  requirements  grow  as  group 
size  grows.  Finally,  building  the  lookup  table  with  sam¬ 
pling  undoubtedly  resulted  in  many  table  entries  being 
missed. 

Our  system  is  also  related  to  the  linear  combinations 
work  of  [19].  In  this  paper,  a  method  of  projection  equiv¬ 
alent  to  ours  is  used,  and  it  is  shown  that  any  image  of  an 
object  is  a  linear  combination  of  two  independent  images 
of  that  object,  which  is  also  easily  seen  from  our  result. 
The  focus  of  [19]  is  on  using  this  result  to  determine  the 
projection  of  the  model  from  a  new  viewpoint,  in  a  man¬ 
ner  similar  to  that  described  here.  However,  the  results 
of  [19]  are  not  directly  useful  for  an  indexing  system, 
because  the  high  dimensionality  of  the  linear  subspace 
described  would  require  excessive  storage  space  in  an  in¬ 
dex  table. 

7  Conclusions 

We  have  presented  a  general  method  for  representing 
3D  models  of  point  features  in  an  indexing  table  in  or¬ 
der  to  quickly  match  them  to  2D  images.  In  terms  of  the 
dimensionality  of  the  space  required,  this  method  is  op¬ 
timal.  We  also  present  analytic  methods  of  building  and 
accessing  the  table  which  ensure  that  all  correct  matches 
are  found.  Together,  we  use  these  results  to  build  a  fast 
and  robust  recognition  system. 

In  addition,  we  feel  that  this  work  is  valuable  because 
it  reduces  recognition  to  an  extremely  simple  geometric 
problem.  We  show  that  at  its  core,  the  problem  of  rec¬ 
ognizing  an  object  in  the  absence  of  error  is  equivalent 
to  determining  which  lines  a  point  falls  on.  When  er¬ 
ror  is  present,  recognition  is  equivalent  to  finding  which 
lines  intersect  a  volume.  We  use  this  view  of  recognition 
to  produce  a  novel  analysis  of  non-accidental  properties, 
and  to  answer  some  outstanding  questions  about  invari¬ 
ants. 
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ABSTRACT 

Most  current  object  recognition  systems  are 
based  on  a  3D  model  which  is  used  to  describe 
the  image  projection  of  an  object  over  all  view¬ 
points.  In  this  paper  we  introduce  a  new  tech¬ 
nique  which  can  predict  the  geometry  of  an  ob¬ 
ject  under  projective  transformation.  The  ob¬ 
ject  geometry  is  represented  by  a  set  of  corre¬ 
sponding  features  taken  from  two  views.  The 
projected  geometry  can  be  constructed  in  any 
third  view,  using  a  viewpoint  invariant  derived 
from  the  correspondences. 

1  Introduction 

The  central  focus  of  object  recognition  research 
over  the  last  decade  has  been  on  the  use  of  3D 
geometric  models  as  a  representation  of  objects. 
These  geometric  models  are  assumed  to  char¬ 
acterize  the  significant  features  of  the  object. 
The  main  idea  is  that  the  model  geometry  is 
independent  of  viewpoint  and  that  the  model 
can  be  used  to  predict  features  easily  extracted 
from  the  image.  In  current  practice,  the  model 
and  image  features  are  represented  by  points 
and  lines.  The  image  is  initially  processed  to 
extract  a  2D  geometric  description  of  the  en¬ 
tire  image  plane.  Then  conceptually,  the  3D 
model  is  projected  over  all  viewpoints  and  the 
projected  model  geometry  is  compared  with  the 
geometric  features  extracted  from  the  image. 


In  order  to  recognize  a  number  of  classes  of  ob¬ 
jects  it  is  necessary  to  develop  a  3D  geomet¬ 
ric  model  for  each  one.  In  some  systems,  the 
modeling  is  done  directly  with  a  standard  CAD 
modeling  package.  In  other  sy^t^ms  it  is  possi¬ 
ble  to  acquire  model  representations,  with  some 
manual  interaction,  directly  from  multiple  im¬ 
age  views  ol  the  object.  These  approaches  are 
quite  time  consuming  and  it  is  often  difficult 
to  construct  a  valid  three  dimensional  model  so 
that  the  topologiceil  connections  between  faces, 
edges  and  vertices  are  all  correct. 

In  this  paper  we  focus  on  the  issue  of  model 
construction.  Instead  of  constructing  a  full 
3D  representation  of  the  object,  we  develop  a 
method  which  exploits  correspondences  between 
two  views  of  the  object.  As  we  will  show,  it 
is  sufficient  to  determine  eight  corresponding 
points  between  two  model  images  of  an  object 
and  then  the  projection  of  the  object  in  a  new 
third  view  can  be  determined  from  eight  addi¬ 
tional  correspondences.  The  technique  is  based 
on  the  general  notion  of  viewpoint  invariance  [2, 
3,  5]  where  geometric  properties  of  an  object  eu'e 
derived  which  are  invariant  to  projective  trans¬ 
formation. 

This  approach  is  similar  to  the  work  of  Koen- 
derink  [6]  who  has  dr.scribed  a  technique  for 
transferring  the  shap<  of  three  dimensional 
configuration  of  poi  ii  inder  an  affine  trans¬ 
formation.  He  use-  V  views  of  the  points  to 
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define  the  3D  affine  coordinates  of  the  points. 
The  correspondence  between  four  points  in  two 
images  is  sufficient  to  determine  an  affine  refer¬ 
ence  frame  for  all  of  the  remaining  points.  Once 
such  a  frame  is  established,  it  is  then  possible 
to  transfer  the  projection  of  the  set  of  points 
to  any  new  viewpoint  by  determining  four  new 
corresponding  points.  Our  techniques  use  the 
linear  fractional  or  perspective  transformation, 
rather  than  the  affine  transformation,  in  order 
to  model  the  camera  geometry  more  accurately. 

The  central  idea  is  that  one  can  construct  a  pair 
of  2D  models  for  an  object  (one  from  each  of 
the  model  views)  and  then  use  these  models  for 
registration  and  recognition  in  any  third  view  of 
the  object*  We  develop  the  idea  by  starting  with 
projections  in  one  dimension  and  proceeding  to 
the  two  and  three  dimensional  cases. 


number  of  points  x,-  can  be  transformed  by  equa¬ 
tion  2. 

2.2  Model  Transfer  in  Two  Dimensions 

The  same  process  as  just  described  for  trans¬ 
forming  points  on  a  projective  line  can  be  ex¬ 
tended  to  the  case  of  mapping  between  projec¬ 
tive  planes.  When  the  points  are  represented 
in  homogeneous  coordinates  an  arbitrary  pro- 
jectiv^ransformation  between  planes  is  a  linear 
form,  X  =  TX,  where  X  =  (x,  y,  1)* ,  and  T  is  a 
3x3  matrix.  We  can  define  an  invariant  similar 
to  the  cross  ratio  on  the  line  as  follows. 

Let  (p,  g)  be  the  point  whose  referent  (p,g)  in 
the  new  image  is  to  be  determined.  The  cross¬ 
ratio  of  determinants  is  known  to  be  a  projective 
invariant.  Therefore: 


2  Model  Transfer 


2.1  Projection  on  a  Line 


Consider  a  line  with  three  given  points, 
(Pi.PSiPs)  and  a  general  point,  x.  It  is  a  well 
known  result  in  projective  geometry  that  the 
cross  ratio,  Cr  is  preserved  under  projective 
mappings  of  the  line.  For  example,  in  Figure 
1  the  points  are  mapped  by  a  central  projection 
onto  another  line.  The  projected  points  are  de¬ 
noted,  (pTi^tPS)  and  X.  The  cross  ratio  of  the 
four  points  on  both  lines  is  a  scalar  invariant, 
i.e., 


_  (X-Pi)(p2-P3)  _  (X-Pi)(p2-P3) 

(x  -  P2)(pi  -  Pa)  (x  -  PjXpi  -  Pa) ' 

(1) 


This 

equation  relates  the  postion  of  point,  x  with  the 
position  of  point,  x,  in  terms  of  of  the  point  cor¬ 
respondences,  [(pi,^),(P2,^)i(P3>P3)1-  Equa¬ 
tion  1  can  be  rewritten  as  a  bilinear  transforma¬ 
tion  relating  i  and  x  as  follows, 
ax  -I-  d 

X  = - ^  (2) 

yx  +  6 

Thus  three  correspondences  define  a  mapping 
between  two  lines  as  a  projective  transforma¬ 
tion.  Once  the  correspondences  are  known,  any 


‘There  is  no  notion  of  occlusion  in  this  process, 
so  that  more  than  two  model  views  will  be  required 
in  general  to  capture  all  of  the  features  of  the  object. 
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The  invariant  condition  that 


4C(l,2,3,4,p,g) 

(3) 

iC(l,2,3,4,p,g). 

(4) 


C(l,2,3,4,p,g)  =  C(l,2,3,4,p,g)  (5) 

determines  a  linear  relationship  between  the  un¬ 
known  quantities  (p,g)  and  all  the  other  vari¬ 
ables  which  are  known. 

By  interchanging  points  1  and  2  in  this  proce¬ 
dure,  a  second  equation  is  generated: 

C(2,  l,3,4,p,g)  =  C(2, 1,3,4, p,g).  (6) 


The  intersection  of  these  lines  determines  the 
point  (p,g),  as  illustrated  in  Figure  2.  The 
quantities  C(l,  2, 3, 4,p,  g)  and  C(2, 1,3, 4,p,  g) 
are  projective  coordinates  (with  respect  to  the 
reference  points  1,2, 3, 4)  for  the  point  (p, g). 
This  transfer  technique  exploits  the  invariance 
of  these  projective  coordinates. 


728 


Figure  1;  The  projection  of  one  line  onto  another.  Four  points  define  an  invariant  cross  ratio. 


Note  that  this  method  requires  neither  knowl¬ 
edge  of  the  reference  objects  (xj  1 ...  4, 

nor  of  the  imaging  system  parameters.  The 
transformation  is  defined  by  four  accurately  de¬ 
termined  correspondences  between  reference  im¬ 
age  points  {pj,<lj)  j  =  1...4,  and  their  corre¬ 
sponding  points  (pj,qj)  j  =  1...4  in  the  new 
image. 


3  Model  Transfer  From  Two 
Views  in  Three  Dimensions 

Suppose  we  have  two  reference  images  and  wish 
to  transfer  the  corresponding  reference  image 
point  pairs;  {p,q),  {p,q),  to  a  new  image  point 
(iu,s).  The  camera  models  for  the  two  images 
are  not  given  nor  are  there  ground  control  points 
available  to  estimate  these  models.  Model  trans¬ 
fer  can  still  be  accomplished  by  a  method  involv¬ 
ing  projective  invariants  derived  for  two  views 
with  eight  correspondences:  (Pj,qj),  [Pj,qj)j  = 
1 ...  8,  between  the  two  reference  images,  and 
(wj,Sj)  j  =  1...8,  on  a  new  third  image.  The 
following  development  is  a  natural  extension  of 
the  eight  point  camera  motion  result  of  Longuet- 
Higgens  [7].  He  demonstrated  that  the  relative 


motion  between  two  views  of  a  calibrated  cam¬ 
era  can  be  determined  by  linear  methods  from 
eight  point  correspondences.  Thus  any  ninth 
point  can  be  projected  from  the  known  camera 
motion.  The  result  here  is  a  significant  exten¬ 
sion  in  that  no  camera  calibration  is  required. 

The  projection  equations  of  two  perspective 
cameras  can  be  meuiipulated  to  the  following 
form.  In  the  case  of  two  images  of  the  same 
point,  the  general  form  of  these  equations  is: 


/  P 


an 

Oi2 

013 

Ol4  ^ 

/ 

021 

022 

023 

024 

On 

012 

013 

014 

021 

022 

023 

024  / 

\ 

'j 

/  6i 

62 

^3 

q 

1  1 

62 

63 

P 

1  6i 

62 

^3 

9  / 

\  6i 

62 

^3 

(8) 


729 


Projective 

Mapping 


Figure  2:  TVansfer  of  a  point  in  two  dimensions  using  the  five-point  invariants.  The  two  invariants 
define  a  pjur  of  linear  constraints  which  define  the  point. 


From  the  latter  equation  we  infer  that; 


Note  that  A  and  B  contain  parameters  associ¬ 
ated  with  the  imaging  geometry  of  the  two  cam¬ 
eras  and  {p,q),  ip,q)  are  corresponding  points  in 
the  left  and  right  images.  We  observe  that,  ex¬ 
panding  the  determinant  in  terms  of  the  vari¬ 
ables  {p,q,p,q)  we  have: 

0  =  Q\pqp  q  -1-  a2pqp  + - h  QiaP  +  ais?  +  ais- 

(10) 

Examining  the  minor  determinants,  we  observe 
that  Oi . . .  at7  =  0  since  they  involve  repeated 
instances  of  the  same  row-vector  (in  ai  for  ex¬ 
ample,  the  row-vector  b  is  repeated  twice,  as  is 
b.) 

Thus  the  vectors: 

(Q<8«9Q'10O'n«12«13«14Q'15«16) 

and 

(PjPji  Pj^j’  9jPji  Pi  I  Pjy 


are  orthogonal  for  every  j.  Similarly  a  nine- 
component  vector  j3,  orthogonal  to  the  vector: 

(p^u>j,  PjSj,  qjWj,  qjSj,  Pj,  qj,  Wj,  Sj,  1) 

and  a  third  nine-component  vector  7,  orthogonal 
to  the  vector: 

(P>“'j>  P>«i-  -  -  Pj.  1) 

can  be  found.  The  components  of  these  vec¬ 
tors  may  be  determined  from  the  relations,  given 
eight  corresponding  points  in  eEtch  image: 
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and  similar  relations  for  /^  . . .  Pie  and  ys  -  7i6 
are  determined  from  combinations  of  {pj,qj) 
with  and  {pj,qj)  with  (wj,Sj),  respec¬ 

tively. 

The  Q,P  and  7  vectors  are  readily  seen  to  be 
proportional  to  minors  of  these  determinants. 
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Once  (3  and  7  have  been  determined,  we  observe 
that  transfer  can  be  accomplished,  as  follows: 
Suppose  that  a  corresponding  point  pair  {p,q) 
and  (p,q)  has  been  selected  from  the  reference 
images.  We  wish  to  determine  the  point  (w,  s)  in 
the  new  image  corresponding  to  these  reference 
points.  We  observe  that  the  relation: 

0  =  I3spw  +  P^ps  +  /3io91o  +  /?ii9s  +  /3i2P  + 

As?  +  + /?15S  + /?16  (11) 

when  evaluated,  equals: 

=  0,  (12) 

i.e.  the  (w,s)  solution  lies  on  the  line  in  the 
(w,s)  image  plane. 

Similarly,  using  the  second  reference  image  we 
observe  the  second  relation: 

0  =  yapw  +  79P«  +  7io?«'  +  7i  i?«  +  712P  + 
713?  +  71410 +  7i5S  +  716  (13) 

determines  a  second  line: 

=  0  (14) 

in  the  (to,s)  image  plane.  The  intersection  of 

these  lines,  is  the  solution. 

Note  that  additional  correspondences  can  be 
used  to  define  the  transfer  vectors,a,/?,7.  The 
solution  is  still  linear  and  formulated  in  terms  of 
the  generalized  inverse  of  a  non-square  matrix. 

4  Experiments 

To  demonstrate  the  validity  of  the  new  model 
transfer  approach,  we  have  carried  out  a  num¬ 
ber  of  experiments  on  aerial  photographs  of  mil¬ 
itary  and  urban  sites.  It  is  emphasized  that 
no  ground  control  points  or  three  dimensional 
models  were  available.  The  model  projections 
are  achieved  without  knowledge  of  the  camera 
model  for  any  of  the  images. 

4.1  Ideal  Data 

The  first  experiment  involves  an  ideal  geomet¬ 
ric  wireframe  model.  The  eight  correspondences 
are  selected  as  the  base  structure  in  the  upper 
right  and  lower  left  images  in  Figure  3.  The 
vertices  of  the  small  cube  are  then  transferred 
to  the  third  view  at  the  upper  left.  The  relative 


positions  of  the  views  is  similar  to  two  nearly 
nadir  views  of  a  structure  resulting  in  the  for¬ 
mation  of  a  synthetic  oblique  view.  The  wire 
frame  model  is  expressed  solely  in  terms  of  the 
two  dimensional  vertex  locations  in  each  view. 

4.2  Lockheed  Site 

In  this  experiment,  we  demonstrate  an  impor¬ 
tant  application  of  model  transfer  to  the  prob¬ 
lem  of  change  detection.  We  have  two  views  of 
a  building  on  the  Lockheed  site  in  Sunnyvale, 
CA.  In  a  third  view,  taken  much  earlier,  the 
building  was  not  yet  in  existence.  We  use  manu¬ 
ally  selected  correspondences  from  the  buildings 
nearby  to  generate  the  transfer  vectors.  Then 
points  from  the  footprint  of  the  new  building 
are  transferred  to  the  old  image.  The  building 
location  at  the  time  of  the  original  image  was  an 
empty  field. 

In  Figure  4  we  show  the  two  original  images  and 
a  few  segments  of  the  footprint  of  the  building 
projected  onto  the  older  image.  The  two  out¬ 
lines  illustrate  the  effect  of  using  8,  10  and  12 
reference  points  for  the  transfer.  The  results  for 
10  and  12  points  lie  practically  on  top  of  each 
other.  In  most  of  the  experiments,  there  is  a 
noticeable  improvement  in  projection  accuracy 
for  8  and  10  points,  but  there  is  little  further 
change  in  going  from  10  to  12  reference  points. 
Again,  in  this  experiment,  the  reference  points 
were  selected  manueilly. 

4.3  Segmentation-Guided  Transfer 

We  then  investigated  the  use  of  automatic  seg¬ 
mentation  techniques  to  derive  the  position  of 
reference  points  in  each  image.  These  experi¬ 
ments  were  carried  out  on  views  of  the  Schenec¬ 
tady  109th  Tactical  Airlift  Base. 

The  images  were  segmented  using  a  modified 
form  of  the  Canny  [4j  edge  detector.  The  modi¬ 
fied  version  carries  out  a  more  thorough  analysis 
of  the  intensity  near  corners  than  the  original 
algorithm.  Vertices  are  derived  by  breaking  the 
boundary  edge  chains  at  points  of  high  local  cur¬ 
vature  (ij.  An  example  of  the  segmentation  is 
shown  in  Figure  5.  The  location  of  these  ver¬ 
tices  is  used  as  the  position  of  reference  points 
in  the  model  transfer  calculations. 
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Figure  3:  Model  transfer  on  ideal  data.  The  image  at  lower  left  is  a  oblique  view  of  the  small  cube 
which  illustrates  the  model  transfer  process. 


Figure  4:  Two  recent  views  of  the  Lockheed  site  are  used  to  define  the  location  of  a  building  in  an 
earlier  view  prior  to  the  building  construction.  The  lower  left  view  illustrates  the  transfer  for  8,  10 
and  12  reference  points;  the  upper  right  view  is  zoomed  out  to  provide  more  context. 
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Figure  5:  The  segmentation  of  intensity  boundaries  in  views  of  the  Schenectady  Air  Base.  The 
segmentation  is  the  result  of  a  modified  Canny  edge  detector  followed  by  corner  detection,  based  on 
curvature. 


In  this  experiment,  the  goal  is  to  transfer  an 
edge  of  the  main  command  building  from  two 
given  model  views  into  a  third,  quite  oblique 
view.  In  an  actual  application,  a  complete 
model  of  the  building  can  be  constructed  and 
transferred.  Here  we  focus  on  one  edge  of 
the  building  to  illustrate  the  accuracy  of  the 
method. 

In  Figure  6  we  illustrate  the  worst  case  perfor¬ 
mance  of  the  segmentation  generated  vertices. 
In  this  view,  the  edge  of  the  building  was  in  a 
shadow,  so  one  boundary  was  not  recovered  by 
the  segmentation  algorithm.  Instead,  the  vertex 
was  associated  with  a  nearby  feature.  The  re¬ 
sulting  trihedral  junction  for  the  corner  of  the 
building  is  shown  in  the  top  right  image. 

Even  with  this  error,  the  model  transfer  was 
achieved  with  quite  reasonable  accuracy.  In  Fig¬ 
ure  7  the  transfer  of  one  edge  of  the  building  is 
shown  for  8,  10  and  12  reference  points.  Again 
there  is  a  significant  change  from  8  to  10  points 
with  little  change  in  going  to  12  points. 


5  Conclusions 

These  experiments  demonstrate  the  practical 
feasibility  of  the  proposed  model  transfer  tech¬ 
nique.  The  segmentation  experiments  were 
quite  encouraging  and  we  have  demonstrated 
that  it  will  be  feasible  to  automate  the  process. 
There  are  a  number  of  obvious  extensions  which 
we  are  now  pursuing. 

•  An  initial  model  transfer  is  used  to  guide 
the  segmentation  of  additional  features. 
In  this  way  we  could  have  used  a  more 
sensitive  edge  extraction  threshold  in  the 
shadow  region  of  the  building  in  the  last 
example. 

•  It  is  also  possible  to  derive  an  equivalent 
form  of  the  transfer  relation  for  lines  instead 
of  points.  Given  that  line  segments  are 
more  accurately  and  consistently  extracted 
by  automatic  segmentation  techniques,  the 
performance  of  the  technique  should  be  im¬ 
proved. 

•  Use  the  invariant  relation  to  automatically 
locate  corresponding  features.  It  is  not  nec- 
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Figure  6:  The  corner  of  a  building  is  located  using  the  vertices  defined  by  automatic  segmentation. 
Note  that  in  the  upper  right  image,  one  edge  of  the  building  is  in  shadow  and  the  associated  vertex 
locations  are  not  correctly  extracted. 


Figure  7:  The  transfer  based  on  segmentation  derived  features.  The  edge  of  the  motor  pool  is 
transferred  from  two  reference  views  into  a  third.  Each  pair  of  images  is  used  to  map  onto  the 
remaining  image.  The  results  are  shown  for  8,  tO,  and  12  reference  points. 
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essary  to  consider  the  features  as  isolated, 
unrelated  points.  In  many  cases,  the  foot¬ 
print  of  a  building  or  other  structure  can  be 
extracted  in  a  partially  connected  topology 
which  provides  some  grouping  of  the  fea¬ 
tures  to  reduce  the  combinatorial  cost.  The 
consistency  of  feature  correspondences  can 
be  determined  by  evaluating  the  invariant 
conditions. 
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Visual  measurements  of  modelled  3D  landmarks  pro¬ 
vide  strong  constreunts  on  the  location  and  orientation 
of  a  mobile  robot.  To  make  the  landmark-based  robot 
navigation  approach  widely  applicable,  it  is  necessary  to 
be  able  to  automatically  build  the  landmark  models.  A 
substantial  amount  of  effort  has  been  invested  by  com¬ 
puter  vision  researchers  over  the  past  ten  years  on  devel¬ 
oping  robust  methods  for  computing  3D  structure  from 
a  sequence  of  2D  images.  However,  robust  computation 
of  3D  structure,  with  respect  to  even  small  amounts  of 
input  image  noise,  has  remained  an  open  problem.  The 
approach  adopted  in  this  paper  is  one  of  model  extension 
and  refinement.  A  partial  model  of  the  environment  is 
assumed  to  exist  and  this  model  is  extended  over  a  se¬ 
quence  of  frames.  As  will  be  shown  in  the  experiments, 
the  prior  knowledge  of  the  small  partial  model  greatly 
enhances  the  robustness  of  the  3D  structure  computa¬ 
tions.  The  initial  3D  model  may  have  errors  and  these 
are  also  refined  over  the  sequence  of  frames. 

1  INTRODUCTION 

An  important  problem  in  vision  is  to  automatically  build 
3D  models  of  objects  and  scenes.  In  a  previous  paper 
[10],  least-squares  and  robust  methods  were  presented 
for  determining  the  location  and  orientation  of  a  robot 
from  visual  measurements  of  modeled  3D  landmarks. 
However,  building  the  3D  landmark  models  is  a  time 
consuming  and  tedious  affair.  For  the  landmark-based 
navigation  methods  to  be  widely  applicable,  automatic 
methods  have  to  be  developed  to  build  and  enhance  the 
3D  models.  Ideally,  the  robot  would  continuously  build 
and  update  its  world  model  as  it  explores  the  environ¬ 
ment.  This  paper  presents  techniques  to  determine  the 
3D  location  of  image  features  from  a  sequence  of  2D  im¬ 
age  frames  taken  by  a  camera  mounted  on  the  robot.  It 
is  assumed  that  a  prior  partial  model  is  available.  The 
goal  is  to  have  the  robot  extend  and  refine  this  model  as 
it  explores  the  world. 

Extensive  research  has  been  done  in  computer  vision 
to  develop  robust  algorithms  for  extracting  3D  informa¬ 
tion  from  a  sequence  of  2D  images.  Of  the  many  different 

‘This  research  was  supported  by  the  following  Defense 
Advanced  Research  Projects  Agency  grants  DAAE07-91-C- 
R03S,  DACA76-89-C-0017  and  National  Science  Foundation 
grant  CDA-8922572. 


visual  cues  for  extracting  3D  information,  the  two  most 
extensively  researched  are  stereo  and  motion.  The  ba¬ 
sic  principle  exploited  in  both  cues  is  triangulation  (see 
Figure  1).  New  points  are  located  by  triangulating  the 
projection  rays  from  corresponding  points  in  two  or  more 
frames. 

In  applications  involving  stereo,  two  cameras  sepa¬ 
rated  by  a  baseline  are  used  to  do  the  triangulation. 
The  two  cameras  are  fixed  with  respect  to  each  other 
and  therefore  the  relative  orientation  is  determined  dur¬ 
ing  a  prior  calibration  stage.  Thus,  the  main  problem 
and  focus  of  stereo  research  has  been  to  establish  corre¬ 
spondences  [12]. 

In  two-frame  motion  analysis  both  the  correspon¬ 
dences  and  the  relative  orientation  between  the  two  cam¬ 
era  frames  are  unknown.  Research  in  motion  analysis  has 
classically  been  divided  into  two  steps.  In  the  first  step 
inter-frame  image  displacements  of  image  pixels  and/or 
higher  level  tokens  are  computed.  The  second  step,  ako 
known  as  “Structure  from  Motion”  or  “Relative  Orien¬ 
tation”,  is  the  interpretation  of  these  displacements  (or 
correspondences  between  image  tokens)  into  3D  struc¬ 
ture  and  relative  orientation  (rotation  and  translation) 
between  frames  [1,  9]. 

However,  due  to  nobe  in  the  measurement  process, 
results  for  both  stereo  and  motion  analysis  from  using 
just  two  frames  are  not  very  robust  [1,  8].  To  improve 
the  robustness  of  the  results,  the  tra^tional  stereo  and 
structure  from  motion  techniques  have  been  extended  to 
deal  with  multi-frame  image  sequences  [3,  5,  13,  14,  16], 
under  the  assumption  that  temporal  integration  would 
lead  to  more  robust  results. 

The  multi-frame  research  can  be  categorised  into  two 
broad  classes  or  strategies.  The  first  class  assumes  that 
a  model  of  3D  inter-frame  motion  is  known,  rather  than 
assuming  independent  motion  parameters  between  con¬ 
secutive  frames.  Broida  [5]  assumes  constant  velocity 
motion  and  estimates  the  3D  location  of  a  set  of  points 
tracked  over  a  monocular  image  sequence.  Recently, 
Chandrasekhar  et.  al.  [6]  have  extended  Broida’s  tech¬ 
nique  to  deal  with  data  sets  where  the  3D  location  of 
a  few  points  is  known.  The  objective  function,  which 
Broida  and  Chandrasekhar  et.  al.  minimise  has  the  mo¬ 
tion  model  parameters  and  the  unknown  structure  loca¬ 
tions  as  unknowns.  Thus  the  dimension  of  the  objective 
function  grows  with  the  number  of  unknown  points.  An 
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even  mote  basic  limitation  of  this  approach  lies  in  the 
model  of  motion  being  adopted  and  its  suitability  to  the 
motion  being  observed. 

The  second  class  of  techniques  does  not  assume  any 
model  of  motion.  The  rigid  structure  of  the  world  is 
carried  forward  by  the  depth  estimates  from  frame  to 
frame.  These  techniques  are  sequential  in  nature  and 
typically  use  Kalman  Filtering  to  compute  the  depth 
estimatesfS,  7,  13,  14,  16]. 

Both,  Ayache  et.  al.  [3]  and  Zhang  et.  al.  [16]  build 
world  models  using  multi-ftame  stereo  sequences.  Zhang 
et.  al.  [16]  track  3D  line  segments  over  a  sequence  of 
stereo  image  frames  and  use  a  Kalman  Filter  to  integrate 
the  results  for  a  final  3D  estimate  of  the  3D  line  segment. 
To  do  the  temporal  integration,  the  absolute  orientation 
between  successive  stereo-pair  coordinate  frames  is  de¬ 
termined. 

Oliensis  and  Thomas  [13]  use  Horn’s  relative  orien¬ 
tation  algorithm  [9]  to  solve  for  the  motion  parameters 
between  consecutive  image  frames  in  a  monocular  im¬ 
age  sequence.  With  each  image  pair,  new  measurements 
are  made  for  depth  values  of  features  and  these  are  in¬ 
tegrated  with  previous  estimates  in  the  Kalman  Filter 
framework.  The  new  observation  Oliensis  and  Thomas 
[13]  make  is  that  the  depth  estimate  of  different  feature 
points  are  correlated  since  the  same  noisy  motion  pa¬ 
rameters  are  used  to  compute  the  depth.  Becau  .  ci.' 
this  correlation,  they  estimate  the  depth  parameters  of 
all  points  simultaneously.  This  gives  them  fair’y  good 
depth  estimates  for  camera  motions  having  some  T,  (i.e. 
translation  along  the  optical  axis)  component.  The  cost, 
however,  is  that  for  estimating  the  depths  of  “m”  points, 
a  covariance  matrix  of  sise  (3m  x  3m)  must  be  inverted 
with  each  new  frame. 

Sawhney  et.  al.  [14]  also  use  Kalman  Filtering  to  es¬ 
timate  the  depths  of  “shallow  structures”  over  a  monoc¬ 
ular  sequence  of  multiple  image  frames  (shallow  struc¬ 
tures  are  those  whose  extent  in  depth  is  small  compared 
to  their  average  depth  from  the  camera).  The  algorithm, 
however,  cannot  handle  non-shallow  structures.  The  im¬ 
age  motion  of  shallow  structures  can  be  described  by  an 
afBne  transform.  Based  on  the  afline  trackability  of  an 
object,  they  are  able  to  segment  out  different  shallow 
structures  in  the  scene  and  hence  can  potentially  han¬ 
dle  multiple  moving  objects.  In  an  experiment  repotted 
in  the  results  section  of  this  paper,  an  initial  model  is 
built  using  the  3D  points  lying  on  some  of  the  shallow 
structures  recovered  by  their  algorithm.  Using  this  ini¬ 
tial  model,  the  3D  location  of  other  points  in  the  scene 
is  estimated  by  the  techniques  developed  in  this  paper. 
Thus  with  a  combination  of  techniques  presented  in  this 
paper  and  Sawhney  et.  al.’s  [14]  technique  for  3D  recov¬ 
ery  of  shallow  structures,  a  &iriy  robust  general  motion 
technique  may  be  constructed. 

1.1  Our  Approach 

The  approach  adopted  here  is  to  first  begin  with  a  par¬ 
tial  model  (possibly  noisy)  and  to  then  extend  and  re¬ 
fine  it  by  viewing  the  object  over  a  sequence  of  frames. 
Both  modeled  rnd  unmodeled  features  of  the  object  are 


tracked  over  the  image  sequence  by  using  an  optic  flow 
based  line  tracking  algorithm  [2,  15].  Correspondences 
are  obtained  between  the  modeled  3D  features  and  their 
image  projections.  Using  the  flow  of  image  tokens  and 
the  poses  of  the  object  computed  from  model-image  fear 
ture  correspondences  for  a  sequence  of  image  frames, 
new  points  are  located  by  triangulation  (see  Figure  1). 
The  triangulation  process  is  also  used  to  make  new  3D 
meuurements  of  the  initial  model  points.  These  mea¬ 
surements  are  then  fused  with  the  previous  estimates 
to  refine  the  set  of  initial  model  points.  The  approach 
adopted  here  is  basically  induced  stereo.  Tracking  image 
features  over  a  large  sequence  effectively  leads  to  a  large 
baseline  for  stereo  and  improves  the  robustness  of  the 
3D  reconstructions.  Note  that  this  approach  does  not 
require  any  models  of  inter-frame  motion. 

The  key  assumption  made  is  that  a  partial  model 
is  available  at  the  beginning  of  the  process.  Due  to 
the  availability  of  the  partial  model,  new  points  are  lo¬ 
cated  in  a  stable  world  coordinate  system.  The  pose 
computed  for  each  frame  are  independent  of  the  other 
frames,  so  each  frame  provides  an  independent  measure 
to  the  whole  process* .  This  does  not  lead  to  the  cascad¬ 
ing  problems  which  most  of  the  sequential  multi-frame 
“structure  from  motion”  techniques  suffer  from  because 
noisy  prior  estimates  in  the  previous  frame’s  coordinate 
system  are  integrated  with  new  estimates  in  the  current 
frcune’s  coordinate  system. 

The  estimation  of  the  new  3D  points  is  done  using 
both  batch  and  quasi-batch  or  sequential  methods.  Tti- 
anguiation  requires  at  least  two  frames  and  therefore  the 
minimum  batch  sise  is  two.  Results  from  batch  to  batch 
are  integrated  by  the  standard  Kalman  Filter  covari¬ 
ance  based  updating  equations.  Results  are  presented 
for  three  real  data  sequences  where  new  3D  points  are 
located  with  average  errors  less  than  1.7  %  .  These  re¬ 
sults  are  far  superior  to  those  obtained  by  the  traditional 
structure  from  motion  techniques  employed  in  computer 
vision.  This  supports  the  earlier  stated  premise  that 
prior  knowledge  of  a  partial  model  greatly  extends  the 
robustness  of  the  structure  estimates. 

The  errors  in  the  initial  partial  model  are  assumed  to 
be  either  gross  errors  or  gaussian  noise.  If  gross  errors 
are  present  in  the  3D  model,  these  would  be  detected 
as  outliers  by  the  robust  pose  recovery  techniques  de¬ 
veloped  in  our  earlier  paper  [10]  and  would  not  be  used 
for  the  fluid  step  of  least-squares  fitting  to  the  remaining 
non-outlier  data.  Note  that  outliers  can  also  arise  due  to 
incorrect  correspondences.  However,  if  a  modeled  land¬ 
mark  appears  as  an  outlier  over  a  large  number  of  frames, 
then  it  probably  is  due  to  a  gross  error  in  the  3D  model 
and  it  could  eventually  be  removed  from  the  3D  model 
database.  Thus  for  the  remainder  of  this  paper,  the  noise 
in  the  input  3D  model  is  iusumed  to  be  gaussian.  Sec¬ 
tion  2  extends  the  least-squares  algorithms  for  pose  de¬ 
termination  (presented  in  [10])  to  handle  gaussian  noise 
both  in  the  3D  model  and  image  measurements.  Sec¬ 
tion  3  presents  the  mathematics  for  locating  new  points 

*Note,  this  would  not  be  true  if  there  was  significant  noise 
in  the  initial  partial  model. 
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and  lefinic^  old  points  using  the  computed  poses  and 
theii  respective  variances.  Finally,  Section  4  presents 
and  analyses  results  from  real  data  experiments.  Some 
concluding  remarks  are  presented  in  Section  5. 

2  Pose  Determination 

In  an  earlier  paper  [10]  least-squares  techniques  for  pose 
determination  were  developed.  These  techniques  are  op¬ 
timal  with  respect  to  gaussian  noise  in  the  input  image 
measurements.  In  this  section,  the  least-squares  tech¬ 
niques  are  extended  to  handle  gaussian  noise  in  the  3D 
model.  The  techniques  presented  in  this  section  assume 
point  correspondences  but  are  easily  modified  for  line 
correspondences . 

The  rigid  body  transformation  from  the  world  coor¬ 
dinate  system  to  the  camera  coordinate  system  can  be 
represented  as  a  rotation  (R)  followed  by  a  translation 
(T).  The  point  p  in  world  coordinates  gets  mapped  to 
the  point  pc  in  camera  coordinates: 

pc  =  R{p)  +  f  (1) 

Using  equation  (1)  and  assuming  perspective  projec¬ 
tion,  the  pose  constraint  equations  for  the  i’th  point  pi 
in  a  set  of  “m”  points  can  be  written  in  the  foUovring 
manner: 


—Cxi-(Rpi  +  f)  =  0 

(2) 

Peti 

— f)  =  0 

Peti 

(3) 

Cxi  —  (*zi0,  — /x«) 

(4) 

1 

o 

1) 

(5) 

Peti  =  (Rp  +  T), 

(6) 

{Ixi%  I^i)  ia  the  image  projection  of  the  point  and  (s,,  s^) 
is  the  focal  length  in  pixels  along  each  axis. 

Since  both  the  image  measurements  and  the  3D  model 
locations  are  assumed  to  be  noisy,  it  will  not  be  possible 
to  satisfy  the  above  constraint  equations  exactly.  Let 
the  measurement  error  in  pixels  of  image  point  locations 
be  given  by  (AX,  AT)  emd  the  error  in  the  3D  model 
points  be  given  by  Ap.  Given  a  current  estimate  iZ,  T, 
the  constraint  equations  (2,3)  are  linearised  about  the 
estimate: 

(Cxi  •  AT  6u)  •  6i)  - - Cxi  ■  Pei  +  (f) 

Peti  Peti 

- (c,i  •  A'5’ -I- fiw  •  Si)  =  Cji  Pci  +  q,  (8) 

Peti  Peti 

where  qx  are  the  noise  terms  in  the  two  equations, 

qx  and  q^  are  functions  of  both  model  noise  Ap  and 
image  noise  AX,  AY : 

qx  =  AX -I- — Cxi  •  (R(Api))  (9) 

Pcxi 

q,  =  Ay-t--^C,i.(R(ApJ)  (10) 

Peti 

Therefore  for  the  i’th  point,  two  such  equations  (7  and 
8)  can  be  written  and  for  a  set  of  “m”  points,  a  total 


of  “2m”  equations  is  obtained.  This  system  of  ‘‘2m” 
equations  is  similar  to  the  linear  system  of  equations 
(18)  described  in  the  Appendix.  This  linear  system  of 
equations  relate  the  pose  increments  6u  (rotation)  and 
AT  (translation)  to  the  computed  measurement  errors 
using  the  current  pose  estimate.  At  each  iteration  in  the 
minimisation  process,  the  linear  system  of  equations  is 
solved  to  find  the  best  increment  vector.  This  increment 
is  added  to  the  current  pose  estimate  and  the  process 
repeated  until  there  is  convergence. 

In  the  above  system  of  equations,  (qx,>7y)  represents 
the  measurement  noise.  If  the  correct  estimate  of  pose 
were  known,  qx  and  q^  would  be  equal  to  the  sum  of  the 
measurement  error  of  the  image  point  location  and  the 
projection  of  the  error  in  the  model  point  along  the  im¬ 
age  x-axis  and  y-axis  respectively.  The  measurement  of 
the  image  point  location  is  assumed  to  be  corrupted  witl 
sero-mean  independent  gaussian  noise.  In  our  case,  fo. 
lack  of  any  other  knowledge,  it  is  assumed  that  the  noise 
in  the  measurements  is  independent  across  all  points  and 
is  also  the  same.  The  3D  model  points  are  also  as¬ 
sumed  to  be  corrupted  by  sero-mean  independent  gaus¬ 
sian  noise.  Therefore  in  the  ‘‘2m”  system  of  linear  equa¬ 
tions,  the  noise  in  the  two  equations  for  every  point  is 
correlated.  Thus  the  covariance  matrix  “V”  correspond¬ 
ing  to  the  noise  in  the  linear  system  of  equations  (18)  in 
the  Appendix  is  a  band  matrix  in  which  the  non-sero  en¬ 
tries  are  (2  x  2)  matrices  about  the  diagonal.  The  output 
covariance  matrix  for  the  pose  rotation  and  translation 
parameters  is  given  by  equation  (20)  evaluated  at  the 
final  pose  estimate. 

Using  the  formula  for  the  best  linear  unbiased  estimate 
described  in  equation(19)  in  the  Appendix,  the  formula 
for  the  pose  increment  at  any  iteration  is  derived.  If 
the  model  noise  wu  sero  and  the  noise  in  the  image 
measurements  were  assumed  to  be  same  for  all  points, 
then  the  input  covariance  matrix  would  be  an  identity 
matrix  scaled  by  the  standard  deviation  of  image  noise. 

3  Induced  Stereo 

In  this  section,  we  present  techniques  for  computing  3D 
estimates  of  new  points  in  the  world  coordinate  sys¬ 
tem  &om  their  tracked  image  locations  over  a  multi- 
bame  sequence.  The  mathematics  for  both  extending 
the  model  and  refining  the  initial  modeled  points  is  pre¬ 
sented.  Computed  with  the  estimate  of  each  new  model 
point  is  an  estimate  of  the  covariance  of  its  error.  These 
covariances  are  functions  of  the  input  image  measure¬ 
ment  covariances  and  the  initial  3D  model  point  covari¬ 
ances. 

Image  features  (both  new  features  and  modeled  image 
features  appearing  in  the  images)  are  tracked  over  a  se¬ 
quence  of  frames  using  the  computed-  optic  flow  between 
pairs  of  successive  frames  [15].  Typically  corners  (de¬ 
fined  by  the  intersection  of  two  image  lines)  are  tracked 
although  any  image  feature  which  can  be  reliably  tracked 
may  be  used.  The  initial  matching  of  image  features  to 
the  partial  model  for  the  first  frame  may  be  done  by  a 
matching  process  such  as  in  [4].  Combining  the  results  of 
the  initial  matching  and  the  feature  tracking,  correspon- 
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Figute  1:  Model  Extension  and  Refinement. 


dences  between  image  features  and  the  partial  model  for 
each  frame  are  established.  Using  these  correspondences, 
pose  estimation  is  done  for  each  frame  using  the  method 
presented  in  the  previous  section. 

The  image  projection  ray  for  an  image  point  in  a 
particular  frame  is  defined  as  the  ray  ori^ating  from 
that  frame’s  optic  center  and  passing  through  the  image 
point.  Given  the  pose  estimates  for  each  frame,  the  vec¬ 
tors  corresponding  to  these  projection  rays  in  the  world 
coordinate  system  can  be  obtained.  The  3D  estimate 
of  the  point  is  the  pseudo-intersection  of  all  the  image 
projection  rays  for  a  tracked  image  point.  In  order  to 
combine  3D  measurements  from  a  sequence  of  frames,  a 
stable  coordinate  frame  should  be  used;  a  nice  property 
of  the  system  described  here  is  that  the  pose  estimation 
process  provides  the  world  coordinate  frame  as  this  sta¬ 
ble  coordinate  frame.  Independent  measurements  can 
be  made  relating  the  coordinate  system  of  each  frame  in 
the  sequence  to  the  world  coordinate  frame. 

Points  are  located  by  the  psuedo-intersection  process 
in  two  steps.  In  the  first  step,  a  3D  error  function  is  min¬ 
imised  to  find  an  initial  estimate  of  the  point’s  location. 
This  step,  however,  does  not  yield  the  optimal  estimate 
since  the  various  error  terms  are  not  weighted  by  the 
input  covariances.  In  the  second  step,  an  image-based 
error  function  is  optimised  in  which  the  error  terms  are 
inversely  weighted  by  a  combination  of  the  input  covari¬ 
ances  of  the  pose  estimate  and  the  image  measurements. 

Let  Ti  be  the  unit  vector  corresponding  to  the  image 
projection  ray  for  an  image  point  in  the  i’th  frame.  The 
pose  estimation  for  this  frame  is  given  by  the  rotation  Ri 
and  translation  7{  (see  equation  (1)).  Since  the  image 
projection  rays  do  not  intersect  at  a  unique  point^,  the 
3D  pseudo-intersection  point  pi  is  obtained  by  minimis- 


*Dne  to  noise  both  in  image  measurements  and  pose 
estimates. 


ing  an  error  function  E: 

^  =  +  (11) 

t=i 

Therefore  the  3D  error  function  E  (used  in  the  first  step) 
is  the  sum  of  squares  of  the  perpendicular  distances  from 
the  psuedo-intersection  point  p  to  the  image  projection 
rays.  Differentiating  E  with  respect  to  the  unknown  vari¬ 
able  p  leads  to  a  set  of  linear  equations,  which  are  then 
solved  to  give  the  initial  estimate  for  p. 

In  the  second  step,  the  pose  constraint  equations  (2, 
3)  are  used  to  formulate  image-based  error  equations  for 
the  X  and  Y  projections  of  the  model  points. 


—CxiRiip)  = 

-—Cxifi+Cx 

(12) 

Pet 

Pet 

— •  Ri{p)  = 

Pet 

-—C^ifi  +  Cr 

Pet 

(13) 

where  (x  &nd  (y  are  the  noise  terms  in  the  two  equa¬ 
tions.  (x  Aiul  (r  functions  of  both  noise  in  pose  ATj 
and  Su>i  and  image  noise  (AX,  AY): 

(x  =  AX  H - Cxi  •  ATj  -1 - Su>i  •  bi  (14) 

Pes  Pen 

Cy  =  AY  + —C,i  ■  iA'i  +  ~6ui  •  bi  (15) 

Pet  Pet 

In  this  case  the  3D  model  point  p  is  the  unknown  vari¬ 
able.  The  denominator  Pet  in  the  equations  (12  and 
13)  corresponds  to  the  depth  of  the  point  and  is  a  func¬ 
tion  of  the  unknown  variable  p.  Therefore  for  each  frame 
over  which  the  point  is  tracked,  two  non-linear  constraint 
equations  (12  and  13)  are  obtained  An  iterative  proce¬ 
dure  is  employed  to  solve  the  system  of  non-linear  equa¬ 
tions.  At  each  iteration,  the  denominator  pcj  is  held 

*A  minimum  of  two  frames  is  needed  to  solve  the  system 
of  equations. 
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constant  using  the  previous  estimate  of  p  and  the  result¬ 
ing  linear  system  of  equations  is  solved  using  equation 
(19)  (see  Appendix).  The  iterative  procedure  is  repeated 
until  there  is  convergence.  In  practice,  we  have  found 
one  iteration  is  suiHcient  for  robust  results.  The  input 
covariance  matrix  V  required  for  in  equation  (19)  is  ob¬ 
tained  &om  the  expressions  derived  above  for  the  noise 
terms  Cx>Cy-  The  output  covariance  of  the  3D  point 
estimate  is  given  by  equation  (20)  in  the  Appendix. 

In  the  batch  method,  information  from  all  frames 
is  used  simultaneously  to  estimate  the  3D  locations  of 
tracked  image  points.  However,  it  may  be  desired  to 
sequentiaUy  update  the  location  of  new  points  after  ev¬ 
ery  pair  (or  a  larger  set)  of  frames.  In  the  sequenticil  or 
quasi-batch  mode,  equations  (12  and  13)  are  again  used 
to  estimate  the  3D  location  of  image  points  tracked  over 
the  current  set  of  frames.  However,  these  new  estimates 
must  be  fused  with  the  previous  estimates  to  obtain  the 
current  optimal  estimate.  Associated  with  each  estimate 
is  a  covariance  matrix  representing  the  uncertainty  in  the 
estimate.  These  covariance  matrices  are  used  to  fuse  the 
two  estimates  and  provide  a  new  uncertainty  matrix  us¬ 
ing  the  standard  Kalman  Filtering  equations. 

Let  the  estimate  of  the  point’s  3D  location  and  its  co- 
variance  at  frame  “ti”  be  p{ti)  and  Aj,(ti)  respectively. 
A  new  3D  location  measurement  Q  with  uncertainty  (co- 
variance  matrix  Ag)  is  computed  from  a  batch  of  “n” 
image  frames.  The  fused  location  estimate  p(tn)  and  up¬ 
dated  covariance  matrix  Ap(t„)  at  frame  “t„”  are  given 
by: 

^t„)  =  Ap(t„)(Ap(ti)-'^ti)  +  AQ'g)  (16) 

Ap(tn)  =  (Ap(ti)  ^  +  Ag  )  ^  (17) 

This  same  method  is  used  for  model  ref.iicment.  Ini¬ 
tial  model  points  have  associated  with  them  their  input 
covariance  matrices.  When  the  model  b  tracked  over  a 
new  batch  of  frzunes,  3D  measurements  can  also  be  made 
for  the  model  points  by  the  above  psuedo-intersection 
procedure.  These  new  measurements  are  fused  with  the 
old  estimate  using  the  above  equation. 

S.l  Model  Extension  and  Refinement 
Algorithm 

The  algorithm  for  model  extension  and  refinement  using 
a  current  batch  size  of  “n”  (n  >  2)  frames  can  be 

summarbed  as  follows: 

Step  1  Given  a  partial  3D  model  and  an  image,  estab¬ 
lish  correspondences  between  model  points  and  im¬ 
age  points  using  a  matching  technique  such  as  in 

[4]. 

Step  2  Track  image  points  over  the  batch  of  “n”  frames 
using  the  computed  optic  flow  between  successive 
pairs  of  images  [15]. 

Step  3  Using  the  correspondences  established  above  be¬ 
tween  model  points  and  image  points,  compute  the 
pose  for  each  image  frame  using  the  method  de¬ 
scribed  in  Section  4.2. 


Step  4  Estimate  the  3D  location  of  both  new  points 
and  initial  model  points  in  world  coordinates  using 
the  two-step  approach  developed  in  Section  4.3  and 
the  feature  correspondences  established  in  Step  2  for 
the  current  batch  of  “n”  f  ames. 

Step  5  Fuse  initial  estimates  of  both  the  new  points  and 
the  model  points  with  any  previous  estimates  using 
equations  (16,17). 

4  Experimental  Results 

This  section  presents  experimental  results  of  applying 
the  model  extension  and  refinement  algorithms  to  two 
real  data  image  sequences.  Figures  2  and  4  show  exam¬ 
ple  images  from  the  BOX  and  A211  sequences  respec¬ 
tively.  In  all  experiments  the  image  center  was  assumed 
to  be  at  the  center  of  the  image  frame  and  the  effective 
focal  length  was  calculated  from  the  manufacturers  spec¬ 
ification  sheets.  In  another  paper  [11],  we  have  shown 
that  errors  in  the  image  center  do  not  significantly  affect 
the  location  of  new  points  in  a  world  coordinate  system. 

4.1  Box  Sequence 

The  BOX  sequence  w<ks  generated  by  rotating  the  box  (in 
Fig.  2)  about  its  centred  vertical  axb  while  the  camera 
was  kept  stationary.  Consecutive  images  in  the  sequence 
were  taken  after  a  rotation  of  approximately  3.6  degrees. 
In  the  first  frame,  the  camera  was  about  650  mm  dbtant 
from  the  top  front  corner  of  the  box.  The  location  of 
30  points  (marked  in  Fig.2  by  circles  and  crosses)  in  a 
world  coordinate  system  was  measured  to  an  accuracy 
of  approximately  1  mm  along  each  axis.  The  depth  of 
the  points  (in  the  first  frame’s  coordinate  system)  used 
in  our  experiment  varied  from  575  mm  to  700  mm.  The 
thirty  points  were  treteked  over  the  set  of  8  frames. 

The  fifteen  points  marked  by  crosses  in  Figure  2  were 
used  as  the  initial  model  for  pose  estimation  [10]  in  each 
frame.  Various  experiments  were  performed  with  dif¬ 
ferent  amounts  of  synthetic  uniform  noise  added  to  the 
measured  3D  locations  of  the  cross  points.  Using  the 
computed  poses,  3D  estimates  of  the  remaining  15  points 
(marked  by  circles  in  Figure  2)  were  computed.  In  ad¬ 
dition,  the  initial  model  of  15  (cross  marked)  points  was 
refined.  The  algorithm  described  in  Section  4.3.1  was 
run  in  a  batch  mode  over  all  8  frames  to  perform  these 
experiments;  the  results  are  reported  in  Table  1.  The 
first  column  of  Table  1  gives  the  range  of  noise  iidded  to 
the  initial  model  points.  Thus  a  10  mm  entry  in  the  first 
column  means  uniform  noise  in  the  range  of  -f/-  10  mm 
was  added  to  each  of  the  3D  coordinates  of  the  model 
points.  The  average  error^  of  the  15  initial  model  points 
for  each  experiment  (prior  to  any  refinement)  is  given 
in  the  second  column  of  Table  1.  The  third  column  in 
the  table  shows  the  results  of  the  model  refinement  pro¬ 
cess;  it  gives  the  average  output  error  of  the  15  (now 
refined)  initial  model  points.  The  fourth  column  in  the 
table  shows  the  results  of  the  model  extension  process; 


*The  average  error  is  the  root  mean  square  (RMS)  value 
of  the  3D  location  error  of  all  points. 
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it  gives  the  average  output  error  of  the  15  new  (circle) 
points. 

As  can  be  seen  from  the  first  row  in  Table  1,  the  aver¬ 
age  error  for  model  extension  when  there  is  no  noise  in 
the  initial  model  is  1.38  mm.  The  mcizimum  error  was 
2.6  mm  and  the  minimum  error  was  0.44  mm.  The  aver¬ 
age  percentage  error  was  0.25  %.  The  percentage  error 
is  crdculated  by  dividing  the  absolute  3D  error  by  the 
depth  of  the  point  bom  the  origin  of  the  camera  in  the 
first  image’s  coordinate  frame.  As  the  noise  in  the  ini¬ 
tial  model  increases,  the  errors  in  model  extension  and 
refinement  also  increases.  However,  except  for  the  first 
two  cases  in  Table  1,  the  average  output  error  for  both 
model  extension  and  refinement  were  significantly  lower 
than  the  average  input  error  of  the  initial  model  points. 

The  model  extension  and  refinement  algorithm  was 
also  run  in  a  sequential  mode,  where  new  3D  locations 
were  computed  after  every  new  pair  of  frames  and  the  re¬ 
sults  were  fused  with  previous  estimates.  Figure  3  shows 
the  results  of  such  an  experiment.  For  this  experiment, 
the  range  of  input  noise  was  5mm  and  the  average  error 
of  the  initial  model  points  was  4.49  mm  (corresponding 
to  the  fifth  row  in  Table  1).  The  average  output  error 
in  location  of  both  the  initial  model  points  and  the  new 
(circle)  3D  points  is  plotted  for  every  image  frame  in  the 
sequence.  As  can  be  seen  in  the  figure,  the  3D  error  in 
both  the  initial  model  points  and  the  unknown  points 
monotonically  decreases  across  all  frames.  The  average 
error  of  the  new  points  is  reduced  from  6.5  mm  after  the 
first  pair  of  frames  to  about  3.7  mm  at  the  end.  The 
average  error  of  the  initial  points  is  reduced  from  4.49 
mm  to  about  2.8  mm. 


Figure  2:  Box  Image.  The  points  marked  by- 
crosses  were  used  to  compute  the  SD  pose  for 
each  frame.  Using  these  poses,  the  SD  location 
of  the  numbered  points  marked  by  circles  is  com¬ 
puted. 


Figure  3:  Box  Sequence.  Plot  of  average  error 
over  the  frame  sequence  for  for  the  new  points 
(Model  Extension)  and  for  the  initial  model 
points  (Model  Refinement). 

4.2  A211  sequence 

The  A211  sequence  was  generated  by  taking  images  from 
a  camera  mounted  on  a  mobile  robot.  The  robot  was 
translated  roughly  along  the  optical  cuds  of  the  camera 
and  10  image  frames  were  taken  after  every  0.38  feet. 
Thus  the  total  translation  of  the  camera  was  3.42  feet. 
Figure  4  shows  the  first  frame  in  the  image  sequence. 
Objects  in  the  scene  ranged  from  8  feet  to  20  feet  away 
in  the  first  image  frame. 

The  initial  model  in  this  experiment  was  built  using 
Sawhney’s  [14]  algorithm  for  segmenting  and  locating 
sh2dlow  structures^ .  Seven  points  (the  points  marked  by 

^Shallow  structures  are  those  whose  extent  in  depth  is 
small  compared  to  their  average  depth  from  the  camera. 


Table  1:  Computed  average  output  SD  location  er¬ 
rors  for  model  extension  process  with  noisy  input 
model  points  for  the  Box  Sequence  of  8  frames. 
Input  Noise  to  model  is  synthetic  uniform  noise. 


Range 

Input 

Noise 

Average 

Input 

Noise 

Average  Output  Noise 

Initial 

Points 

New 

Points 

mm 

mm 

mm 

mm 

0 

0.00 

0.00 

1.38 

1 

1.02 

1.01 

1.69 

2 

1.95 

1.52 

1.92 

3 

3.06 

2.00 

2.23 

5 

4.49 

3.00 

3.78 

7 

6.96 

3.32 

3.84 

10 

10.25 

4.16 

6.31 

20 

17.29 

10.32 

16.23 
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crosses  in  Figure  4)  lying  on  shallow  structures  recov¬ 
ered  by  this  algorithm  were  used  as  the  initial  model 
points.  The  3D  model  locations  were  constructed  by  ex¬ 
tending  the  image  projection  rays  in  the  first  image’s 
coordinate  frame  of  the  seven  points  to  the  depth  com¬ 
puted  by  Sawhney’s  algorithm.  Thus,  the  model  coor¬ 
dinate  frame  is  the  same  as  the  first  image’s  coordinate 
frame. 


Figure  4:  A211  Image.  The  points  marked  by 
crosses  were  used  to  compute  the  SD  pose  for 
each  frame.  Using  these  poses,  the  SD  location 
of  the  numbered  points  marked  by  circles  is  com¬ 
puted. 

The  model  extension  and  refinement  algorithm  was 
run  in  a  sequential  mode.  Table  2  shows  the  result  of  lo¬ 
cating  the  13  new  points  (circled  and  numbered  from  8 
to  20  in  the  Figure  4)  and  refining  the  seven  initial  model 
points.  The  ground  truth  available  for  the  experiment 
was  only  the  depths  (as  opposed  to  3D  location)  of  the 
points  in  the  first  image’s  coordinate  frame.  Thus  the  re¬ 
sults  shown  in  Table  2  compare  the  measured  depth  value 
(ground  truth)  with  the  recovered  depth  value.  Column 
2  in  the  table  shows  the  measured  depth  of  the  point 
in  the  first  image  coordinate  frame.  Columns  3  and  4 
show  the  input  error  and  percentage  error  in  depth  (be¬ 
fore  model  refinement  and  extension)  respectively.  Thus, 
for  the  new  points  (Nos.  8  to  20)  these  two  columns 
are  blank,  since  no  prior  estimate  is  assumed  for  them. 
Columns  5  and  6  show  the  input  error  and  percentage 
error  in  depth  (after  model  refinement  and  extension) 
respectively.  The  percentage  error  in  depth  is  computed 
with  respect  to  the  depth  in  the  first  image’s  coordinate 
frame. 

The  average  input  error  in  depths  of  the  seven  model 
points  was  0.4  feet  (1.85  %  error).  At  the  t 'd  of  the  ten 
frames,  the  average  error  of  the  7  initial  po.  nt ,  was  0.37 
feet  (1.76  %).  The  thirteen  new  points  wei  «>cated  to 
an  average  accuracy  of  0.4  feet  (1.63  '/r).  in  this 


Table  2:  Absolute  and  Percentage  SD  location  er¬ 
rors  for  points  in  A211  sequence  (see  Fig.  4.) 


INPUT 

1  OUTPUT 

Pt. 

No. 

Depth 

ft. 

Abs. 

Err. 

ft. 

% 

Err. 

Abs. 

Err. 

ft. 

% 

Err. 

Initial  Points 

1 

13.4 

0.24 

1.80  % 

0.24 

1.78  % 

2 

14.6 

0.19 

1.31  % 

0.20 

1.34  % 

3 

19.0 

0.74 

3.88  % 

0.66 

3.46  % 

4 

19.0 

0.16 

0.86  % 

0.11 

0.60  % 

5 

20.4 

0.13 

0.62  % 

0.17 

0.86  % 

6 

20.4 

0.39 

1.90  % 

0.32 

1.60  % 

7 

20.4 

0.49 

2.38  % 

0.46 

2.25% 

New  Points 

8 

13.4 

- 

- 

0.11 

0.79  % 

9 

13.4 

- 

- 

0.00 

0.01  % 

10 

14.6 

- 

- 

0.53 

3.65  % 

11 

19.0 

- 

- 

0.73 

3.86  % 

12 

19.0 

- 

- 

0.54 

2.82  % 

13 

19.0 

- 

- 

0.11 

0.59  % 

14 

19.0 

- 

- 

0.07 

0.34  % 

15 

20.4 

- 

- 

0.23 

1.13  % 

16 

20.4 

- 

- 

0.27 

1.32  % 

17 

20.4 

- 

- 

0.12 

0.57  % 

18 

20.4 

- 

- 

0.34 

1.65  % 

19 

20.4 

- 

- 

0.62 

3.02% 

20 

20.4 

- 

- 

0.59 

2.92% 

experiment  there  was  only  slight  improvement  for  the 
model  refinement  process;  however  the  model  extension 
process  was  fairly  accurate  in  locating  new  points.  If 
the  initial  model  given  to  the  model  extension  process 
is  noise  free,  then  the  average  error  in  recovering  the 
thirteen  new  points  is  0.2  feet  (0.94  %). 

The  robust  recovery  of  the  location  of  new  3D  points 
depends  on  the  camera  motion.  Optimal  angles  for  tri¬ 
angulation  are  achieved  when  there  is  significant  trans¬ 
lation  parallel  to  the  image  plane.  In  the  A211  sequence, 
the  translation  of  the  camera  is  mostly  edong  the  optical 
axis.  Thus,  the  FOE  (focus  of  expansion)  lies  on  the 
image  plane.  Points  close  to  the  FOE  have  hardly  any 
disparity  and  their  depths  cannot  be  reliably  estimated. 
In  the  BOX  experiment,  the  higher  accuracy  with  which 
3D  parameters  of  the  new  points  were  computed  is  due 
primarily  to  the  fact  that  a  large  component  of  the  mo¬ 
tion  over  the  sequence  is  approximately  parallel  to  the 
image  plane.  Such  motion  is  best  for  accurate  triangu¬ 
lation. 

5  Conclusions 

The  techniques  presented  in  this  section  are  preliminary 
efforts  for  model  extension  and  refinement  of  point  data. 
The  experimental  results  show  that  knowledge  of  a  few 
points  can  greatly  increase  the  accuracy  of  3D  recovery  in 
comparision  to  traditional  algorithms  from  motion  and 
stereo  analysis.  However,  the  accuracy  of  the  model  ex- 
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tension  process  depends  on  the  initial  accuracy  of  the 
model  points.  To  make  the  system  less  sensitive  to  the 
initial  accuracy  of  the  model  points,  one  possible  solu¬ 
tion  would  be  to  couple  methods  of  motion  analysis  with 
those  of  pose  recovery. 

If  the  initicd  model  points  have  a  large  amount  of  noise, 
then  the  poses  determined  for  any  batch  of  frames  will  be 
highly  correlated.  In  this  case,  the  3D  location  estimates 
of  new  points  will  be  correlated  both  across  all  points 
and  also  £dl  frames.  To  fully  account  for  this  correlation, 
covariance  matrices  equal  to  the  size  of  number  of  points 
times  number  of  frames  will  have  to  be  inverted.  In  our 
case,  it  is  assumed  that  the  initial  points  do  not  have 
significant  noise  2md  hence  the  cross-correlations  can  be 
ignored.  But  for  larger  amounts  of  noise,  it  may  not  be 
possible  to  ignore  these  effects.  These  cross-terms  are 
exactly  what  Oliensis  and  Thomas  [13]  incorporate  in 
their  motion  analysis  paper. 

Finally,  the  terms  model  extension  and  refinement  are 
slightly  abused  in  this  paper.  Model  extension  and  re¬ 
finement  are  not  limited  to  just  locating  new  points  in 
the  scene.  Ultimately,  it  is  desired  to  build  3D  surface 
and  volumetric  models  and  integrate  the  new  3D  mea¬ 
surements  with  the  existing  higher  order  models;  this 
has  been  left  for  future  work. 

Appendix 

Some  facts  from  linear  system  estimation  theory  are  re¬ 
viewed.  An  unknown  parameter  vector  £  with  “p”  ele¬ 
ments  is  related  to  a  set  of  “n”  noisy  observations  by 
the  following  equation: 

Af  =  V  +  q  (18) 

where  if  is  zero-mean  Gaussian  noise  with  covariance  ma¬ 
trix  V.  Assume,  that  this  set  of  equations  is  an  over¬ 
constrained  system.  Then  the  Best  Linear  Unbiased  Es¬ 
timate  (BLUE)  of  the  unknown  vector  £  is  given  by; 

i  =  (A^V"-‘A)-^A'^V-‘y  (19) 

The  covariance  matrix  “P"  of  the  output  parameters  is 
given  by: 

P  =  (A’’V-^A)-‘  (20) 
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Abstract 

A  method  of  evaluating  hypotheses  about  the 
pose  of  a  known  object  in  an  image,  posterior 
marginal  pose  estimation,  is  described.  A  de¬ 
tailed  metrical  object  model  is  assumed.  A 
probabilistic  model  of  image  features  is  com¬ 
bined  with  a  simple  prior  on  both  the  pose  and 
the  feature  interpretations  to  yield  a  pose  ob¬ 
jective  function.  The  parameters  that  appear 
in  the  probabilistic  models  can  be  derived  from 
images  in  the  application  domain.  By  eztrem- 
ising  the  objective  function,  an  estimate  of  the 
pose  of  the  object  in  the  image  results. 

Within  this  framework,  good  modeb  of  feature 
uncertainty  allow  for  robustness  despite  inaccu¬ 
racy  in  feature  detection.  In  addition,  the  rel¬ 
ative  likelihood  of  features  arising  from  either 
the  object  or  the  background  can  be  evaluated 
in  a  rational  way.  It  captures  important  aspects 
of  recognition:  the  amount  of  the  image  that  is 
explained  in  terms  of  the  model,  as  well  as  the 
metrical  consistency  of  the  hypothesis.  It  also 
allows  these  two  aspects  to  be  traded  off  in  a 
rational  way  based  on  domain  statistics. 

The  objective  function  takes  a  simple  form 
when  feature  deviations  are  modeled  by  nor¬ 
mal  densities  and  the  projection  model  is  linear. 
Several  linear  projection  and  featu  e  models  are 
discussed. 

A  preliminary  evaluation  of  the  posterior 
marginal  pose  estimation  objective  function  in 
a  domain  of  synthetic  range  discontinuity  fea¬ 
tures  is  described.  In  that  domain  the  objective 
function  has  a  prominent  sharp  peak  near  the 
correct  pose.  Some  local  maxima  are  also  ap- 
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telligence  Laboratory  of  the  Massachusetts  Institute  of  Tech¬ 
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of  the  Department  of  Defense  under  Army  contract  number 
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tract  N00014-85-K-0134.  Summer  support  was  provided  by 
Group  53  at  the  Massachusetts  Institute  of  Technology,  Lin¬ 
coln  Laboratory. 


parent.  Relation  to  other  work,  and  possible 
extensions  are  discussed. 

1  Introduction 

Vision  problems  are  sometimes  usefully  posed  as  opti¬ 
mization  problems  where  solutions  to  an  objective  func¬ 
tion  are  sought.  Such  an  approach  is  used  by  Barnard 
in  stereo  matching  [l],  Blake  and  Zisserman  [2]  in  image 
restoration  and  Beveridge,  Weiss  and  Riseman  [3]  in  line 
segment  based  model  matching.  One  advantage  of  an  ob¬ 
jective  function  approach  is  that  algorithm  design  can  be 
separated  from  the  specification  of  the  computation  to  be 
performed.  Useful  objective  functions  may  be  designed 
based  on  ad-hoc  considerations.  In  such  cases,  plausi¬ 
ble  forms  for  components  of  the  objective  function  are 
often  summed  using  trade-off  parameters.  These  trade¬ 
off  parameters  are  determined  either  empirically  or  by 
guess. 

Statbtical  estimation  has  proven  useful  as  a  theoreti¬ 
cal  framework  for  deriving  objective  functions  for  some 
areas  of  vision.  The  work  of  Yuille,  Geiger  and  Bulthoff 
on  stereo  [4]  is  one  example,  while  the  work  of  Geman 
and  Geman  on  image  restoration  [5]  is  another.  One 
advantage  of  deriving  objective  functions  from  statisti¬ 
cal  theories  is  that  the  forms  of  the  objective  function 
components  are  clearly  related  to  specific  probabilistic 
models.  Another  advantage  is  that  the  trade-off  param¬ 
eters  can  then  be  derived  from  measurable  statistics  of 
the  domain. 

In  a  previous  paper  on  MAP  model  matching  [6]  the 
model  matching  problem  was  posed  as  an  optimisation 
problem  resulting  from  a  statistical  theory.  There  the 
solution  to  the  matching  problem  was  the  set  of  parame¬ 
ters  of  the  occurrence  of  some  known  object  in  an  image. 
These  parameters  were  the  position  and  orientation,  or 
pose,  of  the  object,  as  well  as  the  correspondences  be¬ 
tween  model  and  image  features.  The  method  was  shown 
to  provide  effective  evaluations  of  match  and  pose  hy¬ 
potheses.  Tree  search  was  explored  as  a  mechanism  for 
the  optimization  problem.  This  route  has  proved  to  be 
computationally  burdensome. 

The  research  reported  here  is  aimed  at  easing  the  com¬ 
putational  cost  of  the  search.  The  idea  is  to  generate  can¬ 
didate  starting  configurations  for  the  search,  as  a  way  of 
limiting  the  total  amount  of  the  search  space  that  is  ex¬ 
plored.  The  method  involves  looking  first  for  candidate 
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poses  to  seed  the  tree  search  over  match  aad  pose.  This 
sort  of  approach  has  been  used  in  previous  recognition 
systenu,  see  the  work  of  Crimson  [f]  for  an  example. 

In  the  next  section  the  basics  of  the  MAP  model 
matching  theory  are  reviewed,  the  following  section  then 
extends  the  development  to  posterior  marginal  pose  es¬ 
timation. 

2  MAP  Model  Matching 

In  this  section  an  objective  function  for  model  matching 
will  be  derived  using  a  MAP  criterion. 

Briefly,  probability  densities  of  image  features,  condi¬ 
tioned  on  the  parameters  of  match  and  pose  (‘Hhe  pa¬ 
rameters”),  are  combined  with  prior  densities  on  the 
parameters  using  Bayes’  rule.  The  result  is  a  poste¬ 
rior  probability  density  on  the  parameters,  given  an  ob¬ 
served  image.  An  estimate  of  the  parameters  is  then 
formulated  by  choosing  them  so  as  to  maximise  their  ar 
posteriori  probability.  (Hence  the  term  MAP.  See  Beck 
and  Arnold’s  textbook  18]  for  a  discussion  of  MAP  esti¬ 
mation.)  MAP  estimators  are  especially  practical  when 
used  with  normal  probability  densities. 

In  this  section  the  domain  is  matching  among  gener¬ 
alised  v-dimensional  point  features.  (Section  4  discusses 
several  such  models.)  Specific  probability  densities  are 
assumed  here  in  the  interest  of  simplicity  and  concrete¬ 
ness;  image  features  are  assumed  to  be  mutually  inde¬ 
pendent;  and  matched  image  features  are  assumed  to  be 
normally  distributed  about  their  predicted  positions  in 
the  image. 

Additionally,  unmatched  (background)  features  are 
assumed  to  be  uniformly  distributed  in  the  image.  These 
densities  are  combined  with  a  simple  prior  on  the  pa¬ 
rameters.  A  linear  projection  model  is  introduced  and  a 
simple  MAP  estimator  results. 

Let  the  image  that  is  to  be  analysed  be  represented 
by  a  set  of  v-dimensional  point  features 

Y  =  {YtY2...Yn}  , 

The  model  to  be  matched  is  also  described  by  a  set  of 
point  features,  these  are  represented  by  real  matrices: 

Section  4  describes  examples  of  this  type  of  representa¬ 
tion. 

The  parameters  to  be  estimated  in  matching  are  the 
correspondences  between  image  and  model  features,  and 
the  pose  of  the  model  instance  in  the  image.  The  corre¬ 
spondences  are  described  by  an  interpretation  vector 

r  =  [rir,...rn]  ,  r<€fifu{±} 

Here  =  Mj  means  that  image  feature  t  corresponds 
to  model  feature  j,  and  =±  means  that  image  feature 
i  is  due  to  the  background. 

The  pose  of  the  model  instance  in  the  image,  P,  is 
a  real  vector.  An  associated  projection  function  P  is 
defined: 

P{Mi,p)  €  IT 

P  maps  model  features  into  the  v-dimensional  image  ac¬ 
cording  to  the  model’s  pose. 


Based  on  the  above,  the  probability  density  function 
on  image  features  may  be  written 


p(y<  I  r,^) 


={ 


1 


IT, 

-  P{Ti,p)) 


if  Ti  =± 
otherwise 


(1) 


where  iV^.(*)  =  (2x)~i exp(  — Here 
i/>i  is  the  covariance  matrix  associated  with  image  fea¬ 
ture  i.  Thus  image  features  arising  from  the  background 
are  uniformly  distributed  over  the  space  of  the  image 
(the  width  of  the  image  space  along  dimension  t  is  given 
by  Wi),  and  matched  image  features  are  normally  dis¬ 
tributed  about  their  predicted  locations  in  the  image.  In 
some  applications  ^  could  be  a  constant  -  an  assumption 
that  the  feature  statistics  ate  stationary  in  the  image. 

Assuming  independent  features  we  have 


v{Y  I  r,^)  =  np(r<  I  T,p) 

i 


=  n 


1 

WiWi'-W, 


n 


N^,{Yi-P{Ti,P)) 


Next,  a  simple  prior  probability  density  function  is  de¬ 
fined  on  the  pose  and  interpretation.  The  probability 
that  a  image  feature  belongs  to  the  background  is  B;  the 
remaining  probability  is  uniforntly  distributed  for  corre¬ 
spondences  to  the  m  model  features. 

p(r.)  =  {  ie  (’) 

Prior  information  on  the  pose  is  assumed  to  be  sup¬ 
plied  as  a  normal  density. 


where  exp(— .  Here 

is  the  covariance  of  the  pose  estimate  and  the  dimen¬ 
sionality  of  is  denoted  by  z.  With  this  choice  for  the 
form  of  the  pose  prior  the  system  is  closed  in  the  sense 
that  the  resulting  pose  estimate  will  also  be  normal.  This 
is  convenient  for  coarse-fine,  as  discussed  in  Section  8.  If 
little  is  known  about  the  pose  a-priori,  the  prior  may  be 
made  quite  broad.  This  is  expected  to  often  be  the  case. 

Then  assuming  independence  of  the  correspondences 
and  the  pose  (before  the  image  is  seen),  the  composite 
prior  is 

p{T,p)=N^,(p-po)  n  ^  n 

ri=x  Tijsx 


Now  using  Baye’s  rule,  the  a- posteriori  probability 
density  of  F  and  P  may  be  calculated: 

where  C  b  a  normalising  constant  independent  of  F  and 

P- 

The  MAP  strategy  b  used  to  obtain  estimates  of  the 
correspondences  and  pose  by  maximising  the  posterior 
density  with  respect  to  F  and  P,  as  follows 


T,P  =  argmM  p{T,p  \  Y) 
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Posterior  Marginal  Pose  Estimation  Continuing  in  similar  fashion  yields 


The  foUowing  method  was  motivated  by  the  observa¬ 
tion  that,  in  tree  searches  of  the  objective  function 
of  MAP  model  matching,  hypotheses  having  ‘^oor’* 
matches  scored  poorly  in  the  objective  function.  The 
implication  was  that  summing  posterior  probability  over 
all  matches  (at  a  specific  pose)  might  provide  a  good 
pose  evaluator.  This  has  proven  to  be  the  case  in  the  ex¬ 
periment  described  in  Section  6.  Additional  motivation 
was  provided  by  the  work  by  Yuille,  Geiger  and  Bulthoff 
on  stereo  [4].  They  discussed  computing  disparities  in  a 
statistical  theory  of  stereo  where  a  marginal  is  computed 
over  matches. 

The  essence  of  posterior  marginal  pose  estimation  is  to 
choose  the  pose  that  maximises  the  posterior  probability 
density  of  the  pose,  given  an  image: 

0  =  tagn^p{fi\Y) 

The  posterior  probability  density  of  the  pose  is  computed 
from  the  jmnt  posterior  probability  on  pose  and  match, 
by  taking  the  marginal  over  the  possible  matches; 

Pi0\Y)  =  Y,pi^^p\y)  • 

r 

Using  Bayes’  rule,  as  in  Section  2,  the  posterior  marginal 
may  be  written  as 

Next,  the  models  for  the  joint  probability  densities 
of  image  features,  Y,  and  the  priors  for  V  and  0,  as 
described  in  Section  2,  are  used  to  express  the  posterior 
marginal  of  0  in  terms  of  the  component  densities: 

p(/3  in  =  i  E  •  •  •  E  np<y‘  I  np(rOp(^) 


p{/9 1  y)  =  ^  n  Ep(5"i  I  r.-,^)p(r.) 

<  Lr, 

Splitting  the  F,  sum  into  its  cases, 

I  y)  = 

^  n  I  *'•  =-L./?)p(r.-  =-L)  + 

t 

5];p(y<  I  Ti  =  MjMVi  =  Mj)] 


Now  substituting  the  densities  assumed  in  the  model  of 
Section  2  in  Equations  1  and  2  yields 

p(0\y)  = 

p(^)  TT  r___L__  n  4- 


Re-arrangement  gives 

p(p\y)  = 


(w^y 


Wf-W,  1-B 


-  P(Mi,0)) 


n 


Next  the  objective  function  for  posterior  marginal  pose 
estimation  is  defined  as  the  scaled  logarithm  of  the  pos¬ 
terior  marginal  probability  of  the  pose,  as  follows, 


Breaking  one  factor  out  of  the  product  gives 

p{0\y)  = 

r,  Tn 

n— 1 

niiKKi  I  ri,0)p{Ti)]p{Yn  I  r„,0)p(r„) , 


p(^\y)  = 

r, 

n-l 

niiKy*  I  ri,0)p(ri)]  ^p(Yn  1  r„,/3)p(r„) . 


or,  finally, 

H0)  =  lnp(0)+  (3) 

i  xfT  ^ 


4  Linear  Projection  and  Feature 
Models 

As  noted  above,  the  objective  function  assumes  a  simple 
form  when  the  projection  model  is  linear  in  the  parame¬ 
ters  of  the  projection.  Two  different  feature  and  projec¬ 
tion  models  that  meet  this  criterion  are  described  here. 
They  share  a  model  of  projection,  rotation,  and  scal¬ 
ing  in  the  2-D  plane.  The  features  are  local  and  could 
represent  fragments  of  extended  features  such  as  curves. 
Methods  for  3-D  are  discussed. 
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4.1  2-D  Point  Feature  Model 

A  linear  projection  model  takes  the  following  form: 

rH  =  P(Mi,0)  =  Mi0  .  (4) 

The  poae  of  the  object  is  represented  by  the  model 
feature  i  by  and  tit  is  the  projection  of  the  model 
feature  into  the  image  by  pose  /3. 

One  2-D  point  feature  and  projection  model  is  defined 
by 


Vi  = 


Mi 


Pi»  —Pit  1  0 
Pit  Pirn  0  1 


and 


The  coordinates  of  model  point  t  are  p,-.  and  p,y.  The 
coordinates  of  the  model  point  t,  projected  into  the  im¬ 
age  by  pose  are  and  pj^.  This  transformation  is 
equivalent  to  a  translation  by  T,  rotation  by  6,  and  scal¬ 
ing  by  s  where 


$  =  y/fp  -I-  I/* 


6  =  arctanf— ) 


This  model  was  used  by  Faugeras  and  Ayache  in  their 
vision  system  HYPER  [9].  This  representation  has  an 
nn-symmetrical  way  of  representing  the  two  classes  of 
features,  which  seems  odd  due  to  their  essential  equiva¬ 
lence,  however  the  trick  facilitates  the  linear  formulation 
of  projection  fpven  in  Equation  4. 


4.2  2-D  Oriented-Range  Feature  Model 

A  2-D  projection  and  feature  model  that  incorporates 
local  information  about  the  coordinates,  normal,  and 
range  at  a  point  along  a  curve  of  range  discontinuity, 
is  defined  by 


■  pfi. ' 

'  Pirn 

-Pit 

1 

0  ■ 

Vi  = 

Mi  = 

PiV 

Cim 

Pirn 

-dt 

0 

0 

1 

0 

L  <t  J 

^im 

0 

0 

The  pmnt  coordinates  and  P  are  as  above.  Ci,  and  ciy 
are  a  vector  who’s  direction  is  perpendicular  to  the  range 
discontinuity  and  pointing  away  from  the  discontinuity, 
while  the  length  of  the  vector  is  the  inverse  of  the  range 
at  the  discontinuity.  The  counterparts  in  the  image  are 
(pven  by  <!^^  and  The  aggregate  feature  translates, 
rotates  and  scales  correctly. 


4.3  Linear  3-D  Projection  Models 

The  technique  used  in  the  above  two  methods  amounts 
to  making  linear  combinations  of  the  model  with  a  copy 
rotated  90  degrees  in  the  plane. 

In  their  p^>er,  "Recognition  by  Linear  Combination 
of  Models”  [lO],  Ullman  and  Basri  describe  a  scheme 
for  synthesising  views  under  3D  orthography  with  ro¬ 
tation  and  scale  that  has  a  linear  parameterisation. 
Their  method  may  prove  useful  within  the  framework 
described  here. 


5  Feature  Deviation  Models 

In  this  section,  feature  model  stationarity  and  normal 
feature  deviation  models  are  discussed. 

In  the  formulation  presented  in  Sections  2  and  3,  the 
image  feature  covariance  depends  only  on  the  image  fea¬ 
ture  index.  A  simple  extension  would  also  index  the 
covariance  on  the  corresponding  model  feature  in  the  in¬ 
terpretation.  This  becomes  a  useful  method  when  the 
image  counterparts  of  some  model  features  are  known  to 
fluctuate  more  than  others.  In  the  experiments  a  form 
of  stationarity  was  used.  This  is  described  next. 

5.1  Oriented  Stationary  Statistics 

The  image  feature  statistics  appear  in  the  objective  func¬ 
tion  of  Equation  4  as  the  covariance  matrices  Al¬ 
though  such  flexibility  can  be  useful,  substantial  simpli¬ 
fication  results  by  assuming  the  features’  statistics  are 
stationary  in  the  image.  In  its  strict  form  this  assump¬ 
tion  may  be  too  limiting,  however.  This  section  outlines 
a  compromise  approach  that  was  used  in  the  implemen¬ 
tation  described  in  Section  6. 

This  method  involves  attaching  a  coordinate  system  to 
each  image  feature.  The  coordinate  system  has  its  origin 
at  the  point  location  of  the  feature,  and  is  oriented  with 
respect  to  the  direction  of  the  underlying  curve  at  the 
feature  point.  When  (stationary)  statistics  on  feature 
deviations  are  measured,  they  are  taken  relative  to  these 
coordinate  systems. 

When  a  similar  image  is  presented  for  recognition,  the 
constant  feature  covariance  is  specialised  by  rotating  it 
to  orient  it  with  respect  to  each  image  feature. 

5.2  Normal  Model  for  Feature  Deviation 

In  the  experimental  domain  of  the  previous  work  de¬ 
scribing  MAP  model  matching  [6]  empirical  evidence  was 
found  that  normal  densities  are  a  reasonable  model  for 
feature  fluctuations.  (The  domain  was  matching  among 
edge  fragments  from  video  imagery.)  Although  the  do¬ 
main  of  the  experiments  reported  here  is  different,  it  is 
believed  that  normal  densities  are  a  reasonable  model 
here  as  well. 

0  Experimental  Results 

A  preliminary  experiment  investigating  the  utility  of  the 
posterior  marginal  pose  estimation  approach  is  described 
in  this  section. 

The  objective  function  of  Equation  4  was  sampled  in 
a  domain  of  synthetic  range  imagery. 

0.1  Preparation  of  Features 

The  features  used  in  the  experiment  were  oriented-range 
features,  as  described  in  Section  4.2.  Two  sets  of  features 
were  prepared,  the  "model  features”,  and  the  "image 
features”. 

The  model  features  were  derived  from  a  synthetic 
range  image  of  a  truck,  that  was  created  using  a  modi¬ 
fied  version  of  the  ray  tracing  program  associated  with 
the  BRL  CAD  Package  [ll].  First  range  discontinuities 
were  located  in  the  range  image  by  thresholding  neigh¬ 
boring  pixels,  yielding  range  discontinuity  curves.  These 
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local  maxima  aie  also  apparent. 

7  Related  Work 


curves  were  then  segmented  into  approximately  20-pixel- 
long  segments  via  a  process  of  line  segment  approxima¬ 
tion.  The  line  segments  (each  representing  a  fragment 
of  a  range  discontinuity  curve)  were  then  converted  into 
oriented-range  features  in  the  following  manner.  The  X 
and  y  coordinates  of  the  feature  were  obtained  from  the 
mean  of  the  pixel  coordinates.  The  normal  vector  to 
the  pixels  was  gotten  via  least-squares  line  fitting.  The 
range  to  the  feature  was  estimated  by  taking  the  mean 
of  the  pixel  ranges  on  the  near  side  of  the  discontinuity. 
This  information  was  packaged  into  an  oriented-range 
feature,  as  described  in  Section  4.2.  The  model  features 
are  shown  in  the  first  image  of  Figure  1.  Each  line  seg¬ 
ment  represents  one  oriented-range  feature,  the  ticks  on 
the  segments  indicate  the  near  side  of  the  range  discon¬ 
tinuity.  There  are  113  such  model  features. 

A  set  of  ‘^oisy  features’*  was  generated  in  the  foUow- 
ing  manner.  The  synthetic  range  image  described  above 
was  corrupted  with  simulated  laser  radar  sensor  nmse, 
using  a  sensor  noise  model  that  is  described  by  Shapiro, 
Reinhold,  and  Park  [12].  The  resulting  simulated  radar 
range  image  was  “restored”  via  a  statistical  restoration 
method  of  Wells  and  Menon  [13].  The  noisy  features 
were  extracted  from  the  restored  image  in  the  manner 
described  above  for  the  model  features.  The  noisy  fea¬ 
tures  appear  in  the  second  image  of  Figure  1.  There  are 
62  nruy  features.  Some  features  have  been  lost  due  to 
the  corruption  and  restoration  of  the  range  image.  The 
set  of  “image  features”  was  prepared  from  the  noisy  fea¬ 
tures  by  transforming  the  features  according  to  a  test 
pose,  randomly  deleting  half  of  the  features,  and  adding 
sufficient  randomly  generated  features  so  that  4  of  the 
features  are  due  to  the  model.  The  248  image  features 
appear  in  the  third  image  of  Figure  1. 

0.2  Sampling  The  Objective  Function 

The  objective  function  of  Equation  4  was  sampled  along 
four  straight  lines  pasnng  through  the  location  in  pose 
space  of  the  test  pose.  Oriented  stationary  statistics 
were  used,  as  described  in  Section  5.1.  The  stationary 
feature  covariance  was  estimated  from  a  “hand  match” 
done  with  a  mouse  between  the  “model  features”  and 
the  “noisy  features.”  The  background  rate  parameter  B 
was  set  to  |. 

Samples  taken  along  a  line  parallel  to  the  X  axis  are 
shown  in  Figure  2.  The  first  graph  shows  samples  taken 
along  a  100  pixel  length  (the  image  is  256  pixels  squarel. 
The  second  graph  of  Figure  2  shows  samples  taken  along 
a  10  pixel  length,  and  the  third  graph  shows  pixels  taken 
along  a  1  pixel  length.  The  X  coordinate  of  the  test  pose 
is  55.5,  the  third  graph  shows  the  peak  of  the  objective 
function  to  be  in  error  by  about  one  twentieth  pixel. 

Samples  taken  along  a  line  parallel  to  the  /x  axis  are 
shown  in  Figure  3. 

Each  of  the  above  graphs  represents  50  equally  spaced 
samples.  The  samples  are  joined  with  straight  line  seg¬ 
ments  for  clarity.  Sampling  was  also  done  parallel  to  the 
y  and  1/  axes  with  similar  results. 

The  sampling  described  in  this  section  shows  that  in 
the  experimental  domain  the  objective  function  has  a 
prominent  sharp  peak  near  the  correct  location.  Some 


Relationships  between  MAP  model  matching  and  previ¬ 
ous  work  were  described  previously  [6];  including  work 
by  Ayache  and  Faugeras  [9],  Goad  [14],  Lowe  [15],  Cass 
[16],  the  work  of  Beveridge,  Weiss  and  Riseman  [3],  Han¬ 
son  and  Fua  [l7]  [18],  and  Yuille,  Geiger  and  Bulthoir[4]. 

There  is  a  strong  similarity  between  posterior  marginal 
pose  estimation  and  Hough  transform  (HT)  methods. 
Roughly  speaking,  HT  methods  evaluate  parameters  by 
accumulating  votes  in  a  discrete  parameter  space,  based 
on  observed  features.  (See  the  survey  paper  by  Illing¬ 
worth  and  Kittler  [l9]  for  a  discussion  of  Hough  meth¬ 
ods.) 

In  a  recognition  application,  as  described  here,  the  HT 
method  would  evaluate  a  discrete  pose  by  counting  the 
number  of  feature  pairings  that  ate  exactly  consistent 
somewhere  within  the  cell  of  pose  space.  As  stated,  the 
HT  method  has  difficulties  with  noisy  features.  This 
is  usually  addressed  by  counting  feature  pairings  that 
are  exactly  consistent  somewhere  nearby  the  cell  in  pose 
space. 

The  performance  of  the  HT  in  the  presence  of  noise  is 
debatable,  as  discussed  in  [20],  pp.  220.  Perhaps  this  is 
due  to  a  poor  implicit  noise  model. 

Posterior  marginal  pose  estimation  evaluates  a  pose  by 
accumulating  the  logarithm  of  posterior  marginal  prob¬ 
ability  of  the  pose  over  image  features. 

The  connection  between  HT  methods  and  parameter 
evaluation  via  the  logarithm  of  posterior  probability  has 
been  described  by  Stephens  [2l].  Stephens  proposes  to 
call  the  posterior  probability  of  parameters  given  im¬ 
age  observations  “The  Probabilistic  Hough  TVansform”. 
He  provided  an  example  of  estimating  line  parameters 
where  image  feature  point  probability  densities  were  de¬ 
scribed  as  having  uniform  and  normal  components.  The 
work  described  here  differs  in  the  underlying  probabilis¬ 
tic  model  used.  The  present  work  uses  the  posterior 
probability  of  a  pose,  given  an  image,  computed  as  the 
mar^nal  over  all  possible  matches  to  a  set  of  model  fea¬ 
tures. 


This  paper  has  described  a  preliminary  evaluation  of  the 
utility  of  the  posterior  marginal  pose  estimation  objec¬ 
tive  function.  In  the  domain  of  the  experiments  the  ob¬ 
jective  function  is  seen  to  have  a  sharp  peak  in  the  proper 
place,  and  some  local  maxima  are  also  apparent.  An  im¬ 
portant  area  of  further  research  is  finding  effective  opti¬ 
misation  methods  for  the  objective  function,  especially 
methods  amenable  to  parallel  implementation.  Multiple 
scale  search  methods  may  prove  effective  here. 

Applying  the  technique  to  problems  of  3-D  from  2-D 
is  an  important  application  area.  One  approach  to  the 
3-D  problem  is  the  multiple-view  method.  Here,  several 
instances  would  be  run,  each  assuming  a  different  view. 
The  result  from  the  instance  achieving  the  best  score 
would  be  chosen.  This  method  avoids  the  problem  of 
recognition- time  computation  of  feature  visibility.  The 
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Figate  1:  Model  Featntes,  N<My  Features,  and  Image  Features 


Figure  2:  Objective  Function  Samples  Along  X-Oriented  Line  Through  Test  Pose,  Lengths;  100  Pixels,  10  Pixels,  1 
Pixel 


linear  projection  models  discussed  in  Section  4  can  be 
used  for  specific  views  in  3-D  orthography  with  scale. 
3-D  orthography  with  scale  is  often  a  good  approxima¬ 
tion  of  perspective  projection.  As  an  alternative  to  mul¬ 
tiple  views,  the  linear  combination  of  modeb  method 
presented  by  Ullman  and  Basri  [lO]  could  be  used.  A 
hybrid  approach  that  might  work  well  within  a  coarse- 
fine  scheme  would  use  the  multiple  view  method  at  low 
resolution  and  linear  interpolation  between  near  views  at 
high  resolution.  Another  possible  method  would  involve 
the  conventional  (non-linear)  formulation  of  perspective 
projection. 


More  accurate  objective  functions  would  likely  result 
fitom  partial  relaxation  of  the  independence  assumptions. 
One  compromise  approach  would  be  Markov-style  mod¬ 
eling  of  the  correspondence  probabilities.  The  proba¬ 
bility  of  a  feature  matching  some  model  element  would 
be  conditioned  on  the  match  state  of  its  nearest  neigh¬ 
bor  features  in  the  image.  Thb  would  provide  a  better 
model  of  occlusion,  since  occlusion  b  correlated  locally. 
Markov-style  modeling  of  feature  deviations  is  another 
likely  improvement. 


9  Summary 

A  method  of  estimating  the  pose  of  an  object  in  an  im- 
^e,  posterior  marginal  pose  estimation,  has  been  de¬ 
scribed.  The  resulting  objective  function  was  seen  to 
have  a  ample  form  when  normal  feature  deviation  mod¬ 
eb  and  linear  projection  modeb  are  used.  Several  linear 
feature  and  projection  modeb  were  described.  Experi¬ 
mental  results  were  shown  indicating  that  in  the  experi¬ 
mental  domain  of  synthetic  range  discontinuity  features, 
the  objective  function  has  a  prominent  sharp  peak  near 
the  correct  pose.  Some  local  maxima  are  were  apparent. 
Relation  to  other  work,  and  possible  extensions  were  dis¬ 
cussed. 
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Abstract 

In  this  paper  we  find  the  pose  of  an  object 
from  a  single  image  when  the  relative  geom¬ 
etry  of  four  or  more  noncoplanar  visible  fea¬ 
ture  points  is  known.  We  first  describe  an  algo¬ 
rithm,  POS  (Pose  from  Orthography  and  Scal¬ 
ing),  that  solves  for  the  rotation  matrix  and 
the  translation  vector  of  the  object  by  a  lin¬ 
ear  algebra  technique  under  the  scaled  ortho¬ 
graphic  projection  approximation.  We  then  de¬ 
scribe  a  second  algorithm,  POSIT  (POS  with 
ITerations),  that  uses  the  pose  found  by  POS 
to  remove  the  perspective  distortions  from  the 
image,  then  applies  POS  to  the  corrected  im¬ 
age  instead  of  the  original  image.  POSIT  con¬ 
verges  to  accurate  pose  measurements  after  a 
few  cycles  of  image  corrections  and  POS  com¬ 
putations,  even  in  conditions  where  perspective 
distortions  are  large.  POSIT  can  be  used  with 
many  feature  points  at  once  for  added  insensi¬ 
tivity  to  measurement  errors  and  image  noise. 
POSIT  can  be  implemented  in  25  lines  or  less 
in  Mathematica;  the  code  is  provided  in  an  Ap¬ 
pendix. 

1  Introduction 

Computation  of  the  position  and  orientation  of  an  ob¬ 
ject  (object  pose)  using  images  of  feature  points  when  the 
geometric  configuration  of  the  features  on  the  object  is 
known  (a  model)  has  important  applications,  such  as  cal¬ 
ibration,  cartography,  tracking  and  object  recognition. 
Fischler  and  Bolles  [3]  have  coined  the  term  Perapective- 
n-Poini  problem  (PnP)  for  this  type  of  problem  with  n 
feature  points. 

Researchers  have  formulated  closed  form  solutions 
when  feature  points  are  considered  in  coplanar  and  non¬ 
coplanar  configurations  [1,  3,  5,  6].  However  these  so¬ 
lutions  may  give  a  false  sense  of  safety  in  practice.  For 
example  the  P4P  problem  with  four  coplanar  points  has 
a  single  solution.  However,  if  the  four  points  are  not 
close  to  the  camera,  a  pose  that  is  the  mirror  image  of 
the  found  pose  with  respect  to  a  plane  parallel  to  the  im¬ 
age  plane  will  project  into  almost  the  same  image.  Thus, 
with  a  small  amount  of  added  random  error,  the  exact 
analytical  solution  will  flip  to  either  pose,  and  will  have  a 


good  chance  of  ending  with  the  wrong  pose.  An  analysis 
which  applies  a  scaled  orthographic  projection  approx¬ 
imation  is  less  deceiving  in  these  situations  because  it 
will  clearly  show  the  two  solutions. 

Pose  computations  which  make  use  of  numbers  of  fea¬ 
ture  points  larger  than  can  be  dealt  with  in  closed  form 
solutions  are  bound  to  be  more  robust  because  the  mea¬ 
surement  errors  and  image  noise  average  out  between  the 
feature  points  and  because  the  pose  information  content 
becomes  highly  redundant.  Notable  among  these  com¬ 
putations  are  the  methods  proposed  by  Tsai  [9]  and  by 
Yuan  [1 1]  (these  papers  also  provide  good  critical  reviews 
of  photogrammetric  calibration  techniques).  The  ana¬ 
lytical  expressions  in  these  methods  are  independent  of 
the  number  of  feature  points  selected  on  the  object.  The 
methods  proposed  by  Tsai  are  especially  useful  when  the 
focal  length  of  the  camera,  the  lens  distortion  and  the 
image  center  are  not  known.  When  these  parameters 
have  already  been  calibrated,  the  method  proposed  by 
Yuan  is  sufficient. 

The  method  we  describe  here  is  related  to  Yuan’s 
method  in  the  sense  that  it  is  also  based  on  linear  al¬ 
gebra  techniques  and  is  used  with  noncoplanar  points, 
but  our  approach  to  solving  the  problem  is  very  differ¬ 
ent.  We  proceed  in  two  phases. 

•  In  the  first  phase  we  work  with  a  scaled  orthographic 
projection  approximation  (SOP).  Finding  the  rota¬ 
tion  matrix  and  translation  vector  with  this  assump¬ 
tion  is  dramatically  simpler  than  with  the  “true” 
perspective  projection  (TPP)  used  by  Yuan.  We 
call  this  algorithm  “POS”  (Pose  from  Orthography 
and  Scaling).  Our  equations  are  similar  to  those 
given  by  Tomasi  [8]  in  Section  7.1.2.  of  his  thesis 
on  Shape  and  Motion  from  Image  Streams  (when 
he  uses  the  shape  found  from  previous  images);  one 
difference  is  that  we  also  include  the  scalirg  of  the 
projection  and  recover  the  three  translr'lon  com¬ 
ponents  of  the  pose.  Ullman  and  Basri  [10]  use 
the  same  equations  as  we  do  (including  scaling)  but 
instead  of  applying  them  to  computing  the  object 
pose,  they  use  them  to  recognize  new  images  of  an 
object  using  stored  images. 

•  In  a  second  phase  we  iteratively  refine  the  pose  us¬ 
ing  a  somewhat  unusual  approach:  Since  the  POS 
algorithm  requires  an  SOP  image  instead  of  a  TPP 
image  to  produce  an  accurate  pose,  we  try  to  syn- 
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thesize  an  SOP  image  from  the  TPP  image.  The 
approximate  pose  from  the  first  phase  or  from  the 
previous  step  allows  us  to  “correct”  the  perspective 
image  into  an  SOP  image.  We  call  this  iterative  al¬ 
gorithm  “POSIT”  (POS  with  ITerations).  Four  or 
five  iterations  are  typically  required  to  converge  to 
an  accurate  pose.  In  this  paper  we  characterize  the 
performance  of  the  algorithm  by  applying  the  algo¬ 
rithm  in  a  large  set  of  simulated  situations  with  in¬ 
creasing  amounts  of  random  image  perturbation  [4]. 
In  all  these  situations  the  algorithm  appears  to  re¬ 
main  stable  and  to  degrade  gracefully  as  image  noise 
is  increased. 

2  Notations 

In  Figure  1,  we  show  the  classic  pinhole  camera  model, 
with  its  center  of  projection  O,  its  image  plane  G  at  a 
distance  /  (the  focal  length)  from  O,  its  axes  Ox  and 
Oy  pointing  along  the  rows  and  columns  of  the  camera 
sensor,  and  its  third  axis  Oz  pointing  along  the  optical 
axis.  The  unit  vectors  for  these  three  axes  are  called  i,  j 
and  k. 


Figure  1:  Perspective  projection  and  scaled  orthographic 
projection  for  an  object  point  A/,-  and  a  reference  point 
Mo. 

An  object  with  feature  points  Mo,  Mj , . . . ,  Mj, . . . ,  M„ 
is  positioned  in  the  field  of  view  of  the  camera.  The  coor¬ 
dinate  frame  of  reference  for  the  object  is  centered  at  Mo 
and  is  (Mo«,  Mov,  Mow).  We  call  Mo  the  reference  point 
for  the  object.  Only  the  object  points  Mo  and  Mi  are 
shown  in  Figure  1.  The  shape  of  the  object  is  assumed 
to  be  known;  therefore  the  coordinates  (f/,-,  Vi,Wi)  of  the 


point  Mi  in  the  object  coordinate  frame  of  reference  are 
known. 

3  Scaled  Orthographic  Projection  and 
Perspective  Projection 

3.1  Analytical  Definition 

Scaled  orthographic  projection  (SOP)  is  an  approxima¬ 
tion  to  “true”  perspective  projection  (TPP).  For  a  given 
object  in  front  of  the  camera,  one  assumes  that  the 
depths  Zi  of  different  points  A/,-  of  the  object  with  cam¬ 
era  coordinates  (X,,  are  not  very  different  from 

one  another,  and  can  all  be  set  to  the  depth  Zo  of  the 
reference  point  Mq  of  the  object  (see  Figure  1).  In  SOP, 
the  image  of  a  point  Mj  of  an  object  is  a  point  pi  of  the 
image  plane  G  which  has  coordinates 

i'  =  fXi/Zo,  i/i  =  fYi/Zo, 

while  for  TPP  an  image  point  mi  would  be  obtained 
instead  of  p,  ,  with  coordinates 

Xi  =  fXi/Zi,  Vi  =  fYilZi 

The  ratio  s  =  //Zo  is  the  scaling  factor  of  the  SOP.  The 
reference  point  Mo  has  the  same  image  mo  with  coordi¬ 
nates  xo  and  yo  in  SOP  and  TPP.  The  image  coordinates 
of  the  SOP  projection  pi  can  also  be  written  as 

xl-  =  fXo/Zo  +  f(Xi  -  Xo)/Zo  =  xo  -I-  s(X,-  -  Xo) 

y;  =  yo  +  s(v;  -  Vo)  (1) 

3.2  Geometric  Construction  of  SOP 

The  geometric  construction  for  obtaining  the  TPP  im¬ 
age  point  mi  of  M,-  and  the  SOP  image  point  mj  of  A/j 
is  shown  in  Figure  1.  Classically,  the  TPP  image  point 
m,-  is  the  intersection  of  the  line  of  sight  of  M,-  with  the 
image  plane  G.  In  SOP,  we  draw  a  plane  K  through 
Mo  parallel  to  the  image  plane  G.  This  plane  is  at  a 
distance  Zo  from  the  center  of  projection  O.  The  point 
Mi  is  projected  on  K  at  Pi  by  an  orthographic  projec¬ 
tion.  Then  Pi  is  projected  on  the  image  plane  G  at  m* 
by  a  perspective  projection.  The  vector  mom,  is  parallel 
to  Mo  Pi  and  is  scaled  down  from  Mo  Pi  by  the  scaling 
factor  s  =  //Zo.  Equation  (1)  simply  expresses  the  pro¬ 
portionality  between  these  two  vectors. 

4  Object  Pose 

4.1  Problem  Definition 

In  a  pose  computation  problem,  we  look  at  the  inverse 
process:  the  images  mo  and  m,-  are  given  and  we  try 
to  find  where  Mo  and  A/,  are  required  to  be  to  create 
the  images  mo  and  m,-.  In  model-based  pose  computa¬ 
tion,  the  relative  geometric  configuration  of  the  points 
Mo  and  Mi  is  known.  For  example,  the  coordinates 
(t^i.VijW,)  of  Mi  are  given  in  the  coordinate  frame  of 
reference  (Mo«,  Mov,  Mow)  of  the  object  centered  at  the 
reference  point  Mo.  The  goal  is  then  to  find  the  rotation 
matrix  R  and  translation  vector  T  of  this  object  in  the 
camera  coordinate  system. 

The  rotation  matrix  R  for  the  object  is  the  matrix 
whose  rows  are  the  coordinates  of  the  unit  vectors  i,j,k 
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of  the  camera  coordinate  system  expressed  in  the  ob¬ 
ject  coordinate  system  {Mqu,  Mqv,  Mqw).  The  rotation 
matrix  can  be  written  as 


R  = 


*ti 

ju  jv  jw 

kli  kff  kuu  J 


4.2  Approximate  Pose  from  SOP  (POS) 

First  we  find  an  approximate  pose  by  assuming  that  the 
TPP  image  points  m,-  can  be  approximated  by  the  SOP 
image  points  p,-  (Figure  1).  Our  goal  is  to  recover  the 
coordinates  of  the  three  unit  vectors  i,  j,k  in  the  object 
coordinate  system  using  the  SOP  approximation,  since 
these  coordinates  are  the  elements  of  the  rotation  matrix. 
The  translation  vector  for  the  object  is  the  vector  OMq. 
Once  we  find  the  scaling  factor  of  the  SOP,  this  vector 
OMq  is  simply  a  scaled  up  version  of  the  image  vector 
Omo.  We  call  this  pose  calculation  method  POS  (Pose 
from  Orthography  and  Scaling). 

We  modify  the  two  expressions  of  Equation  (1).  The 
coordinates  Xi  —  Xq  and  Yi  —  Vq  of  the  vector  MoM^  can 
be  expressed  as  dot  products  of  MqM,-  with  unit  vectors 
i  and  j: 

-  xo  =  si  •  MoMj,  Pi  -  xo  =  sj  •  MflMj 

We  define  I  and  J  as  scaled  down  versions  of  the  unit 


vectors  i  and  j 

I  =  si, 

J  =  fij 

(2) 

which  yields 

—  a:o  =  I  •  MoMi , 

yi-yo=3  ■  MoMj 

(3) 

The  unknown  scaling  factor  s  no  longer  appears  explic¬ 
itly  in  Equation  (3).  It  will  be  recovered  as  the  norm  of 
I  or  J  once  these  vectors  have  been  computed. 

We  express  the  dot  products  of  Equation  (3)  in  terms 
of  vector  coordinates  in  the  object  coordinate  system: 

[Ui  Vi  Wi][iu  h  = *<  -  *^0, 

[Ui  Vi  Wi][J^  =  yi-yo 

These  are  linear  equations  where  the  unknowns  are  the 
coordinates  of  vector  I  and  vector  J.  The  other  param¬ 
eters  are  known:  Xi,yi,Xo>yo  the  known  coordinates 
of  mi  and  mo  (images  of  Mi  and  Mo)  in  the  camera  coor¬ 
dinate  system,  and  Ui,  Vi,  Wi  are  the  known  coordinates 
of  the  point  Mi  in  the  object  coordinate  system. 

Writing  Equation  (4)  for  the  object  point  Mo,  for  n 
object  points  Mi,  M2,  Mi,. . M„  and  their  images,  we 
generate  linear  systems  for  the  coordinates  of  the  un¬ 
known  vectors  1  and  J: 

AI  =  x,  AJ  =  y  (5) 

where  A  is  the  matrix  of  the  coordinates  of  the  object 
points  Mi  in  the  object  coordinate  system  and  x  and  y 
are  the  vectors  of  the  x  and  y  coordinates  of  the  image 
points  m,'  offset  by  the  coordinates  of  the  image  point 
mo. 

Generally,  if  we  have  at  least  three  visible  points  other 
than  Mo,  and  all  these  points  are  noncoplanar,  matrix 
A  has  rank  3,  and  the  solutions  of  the  linear  systems  in 
the  least  square  sense  are  given  by 

1  =  Bx,  J  =  By 


where  B  is  the  pseudoinverse  of  the  matrix  A. 

We  call  B  the  object  matrix.  Knowing  the  geomet¬ 
ric  distribution  of  feature  points  Mi,  we  can  precompute 
this  pseudoinverse  matrix  B,  for  example  by  decompos¬ 
ing  matrix  A  by  Singular  Value  Decomposition  (SVD) 
[7].  This  decomposition  has  the  advantage  of  giving  a 
clear  diagnosis  about  the  rank  and  condition  of  matrix 
A.  In  [2],  we  discuss  the  geometric  situations  for  which 
problems  arise. 

Once  we  have  obtained  least  square  solutions  for  T 
and  J,  the  unit  vectors  i  and  j  are  simply  obtained  by 
normalizing  I  and  J.  As  mentioned  earlier,  the  three 
elements  of  the  first  row  of  the  rotation  matrix  of  the 
object  are  then  the  three  coordinates  of  vector  i  obtained 
in  this  fashion.  The  three  elements  of  the  second  row  of 
the  rotation  matrix  are  the  three  coordinates  of  vector 
j.  The  elements  of  the  third  row  are  the  coordinates  of 
vector  k  of  the  z-axis  of  the  camera  coordinate  system 
and  are  obtained  by  taking  the  cross-product  of  vectors 
i  and  j. 

Now  the  translation  vector  of  the  object  can  be  ob¬ 
tained.  It  is  vector  OMq  between  the  center  of  pro¬ 
jection,  O,  and  Mo,  the  origin  of  the  object  coordinate 
system.  This  vector,  OMo,  is  aligned  with  vector  Omo 
and  is  equal  to  ZoOmo/f,  i.e.  Omo/s.  The  scaling  fac¬ 
tor  s  is  obtained  by  taking  the  norm  of  vector  I  or  vector 
J.  The  POS  method  uses  at  least  one  more  point  than  is 
strictly  necessary  to  find  the  object  pose.  At  least  four 
noncoplanar  points  including  Mo  are  required  for  this 
method,  whereas  three  points  are  in  principle  enough  if 
the  constraints  that  i  and  j  be  of  equal  length  and  or¬ 
thogonal  are  applied  (see  [2]  for  a  simple  pose  solution 
for  three  points).  Since  we  do  not  use  these  constraints 
in  POS,  we  can  verify  a  posteriori  how  close  the  vectors 
i  and  j  provided  by  POS  are  to  being  orthogonal  and  of 
equal  length.  Alternatively,  we  can  verify  these  proper¬ 
ties  with  the  vectors  I  and  J  which  are  proportional  to 
i  and  j  with  the  same  scaling  factor  s.  We  construct  a 
goodness  measure  G,  for  example  as 

G'=  |I-J|-|-|I-I- J-J| 

The  goodness  measure  G  becomes  large  when  the  results 
are  poor  and  can  be  used  for  quickly  testing  the  quality 
of  the  computed  pose  or  for  detecting  wrong  correspon¬ 
dences  between  image  points  and  object  points. 

5  FVom  Approximate  Pose  to  Exact 
Pose:  The  POSIT  Algorithm 

5.1  Basic  Idea 

The  POS  method  provides  a  computationally  inexpen¬ 
sive  method  for  directly  obtaining  the  translation  and 
rotation  of  an  object;  the  accuracy  of  POS  may  be  suf¬ 
ficient  for  tracking  the  motions  of  an  object  in  space,  or 
finding  initial  estimates  for  iterative  methods.  In  this 
section,  we  present  one  such  iterative  algorithm,  POSIT 
(POS  with  Iterations)  ,  which  uses  POS  at  each  iteration 
step.  Less  than  five  iterations  are  typically  sufficient. 
The  basic  idea  of  the  iteration  toward  the  exact  solution 
is  the  following: 
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If  we  could  build  an  SOP  image  of  the  object  feature 
points  from  a  TPP  image,  we  could  apply  the  POS  algo¬ 
rithm  to  this  SOP  image  and  we  would  obtain  an  exact 
object  pose. 

Computing  an  exact  SOP  image  requires  knowing  the 
exact  pose  of  the  object.  However,  once  we  have  ap¬ 
plied  POS  to  the  actual  image,  we  have  an  approximate 
depth  for  each  feature  point,  and  we  position  the  feature 
points  on  the  lines  of  sight  at  these  depths.  Then  we 
can  compute  an  SOP  image.  At  the  next  step  we  apply 
POS  to  the  SOP  image  to  find  an  improved  SOP  image. 
Repeating  these  steps  we  converge  toward  an  accurate 
SOP  image  and  an  exact  pose. 

5.2  Finding  an  SOP  image  from  a  TPP  image 
A  TPP  image  point  m,-  (Figure  1)  has  coordinates 

Xi  =  fXilZu  yi  =  fYi/Zi 

and  we  wish  to  obtain  the  corresponding  SOP  image 
point  Pi  with  coordinates 

x'i^fXifZo,  l/i  =  fYi/Zo 

Therefore  the  coordinates  of  the  SOP  point  p,  are  pro¬ 
portional  to  the  coordinates  of  the  TPP  point  m,  with 
coefficient  Zi/Zo.  In  other  words  the  SOP  vector  Cp,-  is 
aligned  with  the  TPP  vector  Cm,  and  the  proportion¬ 
ality  factor  is  Zi/Zo: 

Cpi  =  £cmi  (6) 

5.3  Finding  an  SOP  image  in  POSIT 

After  applying  POS  once  using  the  TPP  image,  we  do 
not  have  the  actual  Zi  for  each  point  Mi,  but  we  can 
compute  the  approximation 

Zi  =  Zo  +  MoMi  •  k  (7) 

where  k  is  the  unit  vector  along  the  optical  axis  Oz.  Ex¬ 
pressed  in  the  object  coordinate  system,  k  is  the  third 
row  of  the  rotation  matrix,  and  MoMj  is  a  known  vec¬ 
tor  with  coordinates  (C/,-,  V^,  Wi).  Thus  Equation  (7)  be¬ 
comes 

Zi  =  Zq  •+  k^,Ui  -b  -f-  tu,  IT,'  (8) 

and  the  SOP  point  p,  is  obtained  by  combining  Equa¬ 
tions  (6)  and  (8): 

Cp.  =  (1  +  Vi-^k,uWi))  Cm,  (9) 

where  we  have  replaced  \/Zo  by  s//,  the  ratio  of  the 
scaling  factor  found  in  POS  by  the  camera  focal  length. 
Expression  (9)  provides  at  each  step  of  the  POSIT  al¬ 
gorithm  the  approximated  positions  of  the  SOP  image 
points  Pi  in  relation  to  the  image  points  mi  using  the 
third  row  of  the  computed  rotation  matrix. 


image-f  »  TPP  ( POS  (imageQ )) 


imagoz  -  TPP  ( POS  (sopi )) 


sop.)  =  SOP(  POS  (imageQ )) 


Figure  2;  TPP  images  (left)  and  SOP  images  (right) 
of  the  cube  poses  computed  at  successive  steps  by  the 
POSIT  algorithm. 


756 


6  Illustration  of  the  Iteration  Process 
in  POSIT 

To  illustrate  the  iteration  process  of  POSIT,  we  apply 
the  method  to  synthetic  data.  The  object  is  a  cube  of  size 
10  cm.  The  cube  is  assumed  transparent,  and  the  points 
of  interest  are  the  corners  of  the  cube.  We  use  a  full 
cube  in  this  experiment  without  simulating  hidden  faces 
because  it  is  interesting  to  see  the  converging  projections 
of  the  parallel  edges  in  the  TPP  image  being  transformed 
into  parallel  projections  in  the  SOP  image  (in  fact  it  is 
not  difficult  to  do  actual  experiments  with  eight  corners 
of  a  cube,  using  light  emitting  diodes  mounted  in  a  cubic 
arrangement  on  a  transparent  frame).  The  image  size  is 
assumed  to  be  512  x  512  pixels,  and  the  focal  length 
is  760  pixels,  providing  an  angular  field  of  view  of  37 
degrees  with  the  assumed  image  size.  The  corners  of  the 
cubes  are  at  a  distance  between  three  and  four  times  the 
cube  size  from  the  center  of  projection  of  the  camera. 
The  projection  on  top  of  Figure  2  is  the  given  image  for 
the  cube.  We  can  write  this  operation  as 

imagtQ  =  TPP{cube). 

The  projections  of  the  cube  edges  are  shown,  although 
they  are  not  used  by  the  algorithm.  The  enclosing  square 
is  the  boundary  of  the  512  x  512  pixel  image  area.  Be¬ 
cause  the  distance-to-size  ratio  for  the  cube  is  small,  the 
cube  image  shows  strong  convergence  of  image  edges  for 
the  cube  edges  almost  parallel  to  the  camera  optical  axis. 
With  such  an  image  the  POS  algorithm  does  not  give  a 
good  approximation  of  the  rotation  and  translation  of 
the  cube.  One  can  have  an  idea  of  the  success  of  the 
POS  algorithm  by  computing  a  TPP  image  of  the  cube 
at  the  found  pose.  The  three  projections  of  the  left  col¬ 
umn  in  Figure  2  are  such  TPP  projections  at  successive 
iterations.  The  transformation  which  gave  the  top  left 
projection  can  be  written  as 

imagei  =  TPP{POS(imageQ) 

where  images  is  the  given  image  and  POS{)  represents 
the  inverse  perspective  operation  found  by  the  POS  al¬ 
gorithm.  As  can  be  seen  from  this  projection,  the  world 
points  found  by  POS  were  not  very  good.  Note  that 
POSIT  does  not  compute  imagei-  Instead,  POSIT  com¬ 
putes  an  SOP  image  using  only  the  actual  image  corners 
and  the  depths  it  computed  for  the  corners.  This  image 
is  shown  on  the  top  right  in  Figure  2.  Equation  (9)  is 
used.  This  operation  is  written  as 

aopi  =  SOP(POS{imageQ)) 

Notice  that  in  the  resulting  projection  the  SOP  images  of 
parallel  edges  of  the  cube  are  not  yet  quite  parallel.  The 
reason  is  that  the  found  corner  depths  are  still  approxi¬ 
mations.  The  next  iteration  of  POSIT  uses  this  SOP  im¬ 
age,  and  at  this  step  it  finds  a  shape  closer  to  the  cube, 
as  illustrated  by  the  fact  that  its  perspective  projection 
image^  (again,  this  image  is  not  actually  computed  by 
the  algorithm)  is  very  close  to  the  original  image.  The 
operation  for  finding  image^  is 

image^  =  TPP(POS(sopi)) 


The  next  SOP  image  found  by  POSIT  has  parallel  edges. 
This  image  is 

sop2  =  SOP{POS{sopi)) 

The  next  iteration,  illustrated  at  the  bottom  of  Figure  2, 
brings  only  minor  improvements. 

7  Protocol  of  Performance 
Characterization 

In  this  section,  we  try  to  follow  the  recommendations 
of  Haralick  for  performance  evaluation  in  computer  vi¬ 
sion  [4].  We  compute  the  orientation  and  position  errors 
of  the  POS  and  PO.SIT  algorithms  for  two  objects  at  ten 
distances  from  the  camera  with  40  random  orientations 
for  each  distance.  Synthetic  images  are  created  with 
three  levels  of  image  noise  for  each  combination,  and  the 
poses  of  the  object  computed  by  POS  and  POSIT  from 
these  images  are  compared  with  the  actual  poses. 

7.1  Objects 
The  two  objects  are 

1.  A  configuration  of  four  points  from  a  10  cm  cube, 
one  corner  taken  as  reference  point,  and  its  three  ad¬ 
jacent  corners.  The  four  points  form  a  tetrahedron 
(Figure  3,  top). 

2.  The  eight  corners  of  a  10  cm  cube.  The  reference 
point  is  one  of  the  corners  (Figure  3,  bottom). 

The  size  of  both  objects  is  taken  to  be  10  cm  and  is 
used  to  measure  the  displacement  of  the  objects  with  re¬ 
spect  to  the  camera.  Note  however  that  the  maximum 
distance  between  two  points  in  the  tetrahedron  configu¬ 
ration  is  also  10  cm,  whereas  the  maximum  distance  for 
the  cube  is  17.3  cm,  between  two  diagonal  corners.  Thus 
the  cube  image  may  be  larger  than  the  tetrahedron  in 
some  orientations  at  the  same  distance  from  the  camera, 
and  be  more  distorted  by  perspective. 

7.2  Object  Positions 

The  reference  points  of  the  objects  are  positioned  on  the 
optical  axis  at  ten  distances  from  the  center  of  projec¬ 
tion,  from  40  cm  to  400  cm,  i.e.  from  four  times  to  40 
times  the  size  of  the  objects  (Figure  3).  These  distance- 
to-size  ratios  are  used  as  the  horizontal  coordinates  on 
all  the  orientation  and  position  error  plots. 

Around  each  of  these  reference  point  positions,  the 
objects  are  rotated  into  40  random  orientations.  The 
rotation  matrices  defining  these  40  orientations  are  com¬ 
puted  from  three  Euler  angles  chosen  by  a  random  num¬ 
ber  generator  in  the  range  (0,2x). 

7.3  Image  Generation 

We  obtain  images  by  perspective  projection  with  a  focal 
length  of  760  pixels.  Here  we  do  not  clip  the  image,  in 
order  to  allow  for  large  offsets  of  the  images.  When  the 
reference  point  of  the  cube  is  40  cm  from  the  image  plane 
on  the  optical  axis  and  when  the  cube  is  completely  on 
one  side  of  the  optical  axis,  the  point  at  the  diagonal 
of  the  cube  may  be  30  cm  from  the  image  plane  and 
have  an  image  355  pixels  from  the  image  center.  Only 
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Figure  3:  Object  and  parameter  definitions  for  rotation 
and  translation  error  estimations.  Top;  Tetrahedron. 
Bottom:  Cube. 


a  wide-angle  camera  with  a  total  angular  field  of  more 
than  50  degrees  would  be  able  to  see  the  whole  cube  in 
this  position. 

We  specify  three  levels  of  random  perturbation  and 
noise  in  the  image.  At  noise  level  1,  the  real  numbers 
computed  for  the  coordinates  of  the  perspective  projec¬ 
tions  are  rounded  to  integer  pixel  positions.  At  noise 
level  2,  the  integer  positions  of  the  lowest  level  are  dis¬ 
turbed  by  vertical  and  horizontal  perturbations  of  ±  1 
pixel  around  the  integer  positions.  These  are  created  by 
a  uniform  random  number  generator.  At  noise  level  3, 
the  amplitude  of  the  perturbations  are  ±  2  pixels.  When 
the  objects  are  at  400  cm  from  the  camera,  the  image 
may  be  as  small  as  20  pixels,  and  a  perturbation  of  two 
pixels  on  each  side  of  the  image  produces  a  20%  pertur¬ 
bation  in  image  size. 

7.4  Orientation  and  Position  Errors 

For  each  of  the  images,  the  orientation  and  position  of 
the  object  are  computed  by  the  POS  algorithm,  then 
by  the  POSIT  algorithm  for  four  more  iterations  (a  to¬ 
tal  of  five  steps  including  the  initial  POS  step).  These 


orientations  and  positions  are  compared  to  tne  actual 
orientation  and  position  of  the  object  used  to  obtain  the 
image.  We  compute  the  axis  of  the  rotation  required  to 
align  the  coordinate  system  of  the  object  in  its  actual 
orientation  with  the  coordinate  system  of  the  object  in 
its  computed  orientation.  The  orientation  error  is  de¬ 
fined  as  the  rotation  angle  in  degrees  around  this  axis 
required  to  achieve  this  alignment.  Details  of  this  com¬ 
putation  are  given  in  [2].  The  position  error  is  defined 
as  the  norm  of  the  translation  vector  required  to  align 
the  computed  reference  point  position  with  the  actual 
reference  point,  divided  by  the  distance  of  the  actual 
reference  point  position  from  the  camera.  Thus  the  po¬ 
sition  error  is  a  relative  error,  whereas  the  orientation 
error  is  a  measure  in  degrees. 

7.5  Combining  the  Results  of  Multiple 
Experiments 

As  mentioned  above,  for  each  distance-to-size  ratio, 
many  rotations  are  considered.  We  compute  the  average 
and  standard  deviation  of  the  orientation  and  position 
errors  over  all  these  rotations  and  plot  the  averages  with 
their  standard  deviation  error  bars  as  a  function  of  the 
distance-to-size  ratios.  Each  plot  shows  the  results  both 
for  POS,  and  for  POSIT  after  five  iterations.  The  plots 
for  the  orientation  error  are  shown  in  Figure  4  ,  and  the 
plots  for  the  position  errors  are  shown  in  Figure  5.  In 
each  of  these  two  figures,  the  plots  in  the  left  column 
are  for  the  tetrahedron,  and  the  plots  in  the  right  col¬ 
umn  are  for  the  cube.  The  top  diagrams  are  for  the 
lowest  image  noise  level,  the  middle  diagrams  for  the 
medium  noise  level,  and  the  bottom  diagrams  for  the 
highest  noise  level. 

8  Analysis  of  the  Pose  Error  Diagrams 

8.1  Comparison  between  POS  rmd  POSIT 

POSIT  provides  dramatic  improvements  over  POS  when 
the  objects  are  very  close  to  the  camera,  and  almost  no 
improvements  when  the  objects  are  far  from  the  camera. 
When  the  objects  are  close  to  the  camera,  the  so-called 
perspective  distortions  are  large,  and  the  approximation 
that  the  image  is  an  SOP  is  poor;  therefore  the  perfor¬ 
mance  of  POS  is  poor.  For  the  object  reference  points 
at  the  shortest  distance-to-size  ratio  (4),  errors  in  ori¬ 
entation  evaluation  are  in  the  10**  region,  and  errors  in 
position  evaluation  are  in  the  10%  region.  When  the  ob¬ 
jects  are  very  far,  there  is  almost  no  difference  between 
SOP  and  TPP;  thus  POS  gives  the  best  possible  results, 
and  POSIT  cannot  improve  upon  them.  POS  gives  its 
best  performance  for  distances  around  30  times  the  ob¬ 
ject  size  for  low  image  noise,  and  around  20  times  for 
high  image  noise,  with  orientation  errors  in  the  3®  re¬ 
gion  and  position  level  in  the  3%  region.  At  very  low  to 
medium  range  and  low  to  medium  noise,  POSIT  gives 
poses  with  less  than  2®  rotation  errors  and  less  than  2% 
position  errors. 

8.2  Variation  of  Noise  Effects  with  Range 

At  short  range,  the  images  of  the  objects  are  large  and 
perturbations  of  a  few  pixels  are  small  compared  to  the 


758 


I 

Figure  4:  Angular  orientation  errors  at  various  distances 
for  a  tetrahedron  (left)  and  for  a  cube  (right)  at  three 
image  noise  levels  (quantization,  ±1  pixel,  ±2  pixels). 

total  size  of  the  image.  At  long  range  the  images  occupies 
only  around  20  pixels  and  perturbations  of  a  few  pixels 
may  change  the  relative  distribution  of  the  image  points. 
Therefore  in  the  diagrams  with  larger  noise  levels,  the 
pose  errors  of  the  algorithms  increase  as  the  distance 
ratios  increase  from  30  to  40. 

8.3  Comparison  between  Cube  and 
Tetrahedron 

The  long  error  bars  at  short  range  for  POS  are  due  to 
the  fact  that  the  apparent  image  size  can  be  very  differ¬ 
ent  depending  on  the  orientation.  For  example,  the  cube 
looks  like  an  object  of  size  10  cm  when  a  face  is  paral¬ 
lel  to  the  image  plane,  but  one  dimension  is  70%  larger 
when  a  cube  diagonal  is  parallel  to  the  image  plane.  In 
this  last  configuration,  the  reference  point  projects  at  the 
image  center  whereas  the  opposite  corner  is  offcentered 
by  more  than  323  pixels,  with  a  large  resulting  perspec¬ 
tive  distortion.  The  tetrahedron  does  not  have  as  large 
apparent  size  changes,  which  explains  the  fact  that  at 
short  viewing  distance  the  average  error  and  standard 
deviation  produced  by  POS  are  smaller  for  this  shape 
than  for  the  cube.  This  is  more  an  artifact  of  the  prob¬ 
lem  of  defining  object  size  with  a  single  number  than  a 
specific  advantage  of  the  tetrahedron  over  the  cube. 
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Figure  5:  Relative  position  errors  at  various  distances 
for  a  tetrahedron  (left)  and  for  a  cube  (right)  at  three 
image  noise  levels  (quantization,  ±1  pixel,  ±2  pixels). 

At  high  noise  level  and  long  range,  the  performance 
with  the  cube  becomes  almost  twice  as  good  as  with 
the  tetrahedron  for  POS  and  POSIT,  because  the  least 
square  method  averages  out  the  random  errors  on  the 
points,  and  the  averaging  improves  when  more  points 
are  made  available  to  the  method. 

9  Convergence  Analysis 

With  the  distance-to-size  ratios  used  in  the  rotation  and 
translation  error  evaluations  above,  the  POSIT  algo¬ 
rithm  would  converge  in  four  or  five  iterations.  The  con¬ 
vergence  test  in  the  POSIT  algorithm  consists  of  quan¬ 
tizing  (in  pixels)  the  coordinates  of  the  image  points  in 
the  SOP  images  obtained  at  successive  steps,  and  ter¬ 
minating  when  two  successive  SOP  images  are  identical 
(see  Appendix  A). 

Calculations  with  ID  images  of  a  2D  world  show  that 
the  quantities  determining  the  algorithm  convergence  are 
ratios  of  image  coordinates  over  focal  length.  When  all 
the  ratios  are  smaller  than  1,  the  algorithm  converges. 
Therefore  with  a  camera  with  a  90  degree  total  field  of 
view,  the  algorithm  would  converge  with  all  possible  im¬ 
age  points.  When  these  ratios  are  larger  than  1  for  all 
the  coordinates,  the  algorithm  diverges.  Thus  with  an 


OUanoe  to  Camm /Oblact  Slaa 


Distance  la  Camara  /  Obiact  Size 

Figure  6:  Number  of  iterations  for  POSIT  as  a  function 
of  distance  to  camera.  Top;  Definition  of  distance  and 
object  size.  Middle:  Convergence  analysis  at  very  close 
ranges.  Convergence  occurs  if  the  10  cm  cube  is  more 
than  1.2  cm  away  from  the  camera.  Bottom:  Number  of 
iterations  for  a  wider  range  of  distances. 

object  with  all  its  image  points  at  the  periphery  of  the 
field  of  a  110  degree  camera  the  algorithm  would  diverge. 
In  mixed  situations  with  small  and  large  ratios,  mixed 
results  are  obtained  as  to  whether  the  small  ratios  win 
convergence  over  the  large  ratios  or  not.  Therefore  it  is 
still  possible  to  obtain  convergence  with  image  points  at 
the  periphery  of  a  very  wide-angle  field  if  some  of  the 
image  points  are  close  to  the  center  of  the  field. 

Simulations  appear  to  show  similar  properties  with  2D 
images  in  a  3D  world.  In  these  simulations,  a  cube  is 
displaced  along  the  camera  optical  axis  (Figure  6).  One 
face  is  kept  parallel  to  the  image  plane,  because  at  the 


shorter  ranges  being  considered,  the  cube  cannot  be  ro¬ 
tated  much  without  intersecting  the  image  plane.  The 
distance  used  to  calculate  the  distance-to-object  size  ra¬ 
tio  in  the  plots  is  the  distance  from  the  center  of  pro¬ 
jection  to  the  cube.  Noise  of  ±  2  pixels  is  added  to  the 
perspective  projection.  For  a  cube  of  10  cm,  four  iter¬ 
ations  are  required  for  convergence  until  the  cube  is  30 
cm  from  the  center  of  projection.  The  number  gradu¬ 
ally  climbs  to  eight  iterations  as  the  cube  reaches  10  cm 
from  the  center  of  projection,  and  20  iterations  for  5  cm. 
Then  the  number  increases  sharply  to  100  iterations  for 
a  distance  of  2.8  cm  from  the  center  of  projection.  In 
reference  to  our  prior  ID  observations,  at  this  position 
the  images  of  the  close  corners  are  more  than  two  focal 
lengths  away  from  the  image  center,  but  the  images  of 
the  far  corners  are  only  half  a  focal  length  away  from  the 
image  center  and  probably  contribute  to  preserving  the 
convergence. 

Up  to  this  point  the  convergence  is  monotonic.  At 
still  closer  ranges  the  mode  of  convergence  changes  to 
a  nonmonotonic  mode,  in  which  SOP  images  of  succes¬ 
sive  steps  seem  subjected  to  somewhat  random  varia¬ 
tions  from  step  to  step  until  they  hit  close  to  the  final 
result  and  converge  rapidly.  The  number  of  iterations 
ranges  from  20  to  60  in  this  mode,  i.e.  less  than  for 
the  worse  monotonic  case,  with  very  different  results  for 
small  variations  of  object  distance.  We  label  this  mode 
“chaotic  convergence”  in  Figure  6.  Finally,  when  the 
cube  gets  closer  than  1.2  cm  from  the  center  of  pro¬ 
jection,  the  differences  between  images  increase  rapidly 
and  the  algorithm  clearly  diverges.  Note,  however,  that 
in  order  to  see  the  close  corners  of  the  cube  at  this  range 
a  camera  would  require  a  total  field  of  more  than  150 
degrees,  i.e.  a  focal  length  of  less  than  1.5  mm  for  a 
10  mm  CCD  chip,  an  improbable  configuration.  This 
preliminary  convergence  analysis  and  the  experiments 
of  the  previous  section  show  that  the  POSIT  algorithm 
converges  in  a  few  iterations  for  most  camera  and  object 
configurations. 

10  Conclusions 

We  have  presented  an  algorithm,  POSIT,  that  can  com¬ 
pute  the  pose  of  an  object  using  several  noncoplanar  fea¬ 
ture  points  of  the  object  at  a  low  computational  cost. 
The  algorithm  consists  of  computing  an  approximate 
pose  by  a  method  (POS)  based  on  scaled  orthographic 
projection  and  a  least  square  technique,  and  replacing 
the  original  image  with  an  image  obtained  by  scaled 
orthographic  projection  (SOP)  using  the  approximate 
depths  of  the  feature  points  given  by  this  pose.  Then 
POS  can  be  applied  to  this  image  to  produce  a  more  ac¬ 
curate  object  pose.  This  SOP-POS  cycle  is  repeated  un¬ 
til  the  improvements  become  smaller  than  the  precision 
required  by  the  application.  Convergence  tests  show  that 
the  algorithm  can  be  made  to  diverge  only  when  all  the 
image  points  are  taken  at  the  edges  of  a  very  wide  field 
of  view.  Very  few  iteration  steps  are  typically  required 
to  converge  to  an  accurate  pose.  We  have  characterized 
the  performance  of  the  POS  and  POSIT  algorithms  by 
applying  them  to  a  large  set  of  simulated  situations  with 
increasing  amounts  of  random  image  perturbation.  The 
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POSIT  algorithm  appears  to  remain  stable  and  to  de¬ 
grade  gracefully  with  increasing  image  noise  levels.  Ap¬ 
plications  of  the  POSIT  algorithm  include  camera  cali¬ 
bration,  real-time  target  tracking  and  object  recognition. 
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Appendix  A:  A  Mathematica  program  im¬ 
plementing  POS  and  POSIT 

(*  Compute  die  poee  of  is  object  gives  i  lift  of  2D  bsige  poisa  *} 

(*  I  list  of  coneipooding  3D  object  points  *) 

(*  md  the  peeodoisverse  siatrix  for  die  list  of  object  points  ^ 

(*  Ibe  output  bis  two  etemeids.  the  poie  computed  by  POS  *) 

(*  md  the  pose  computed  by  POSIT  *) 

(*  oDoe  die  SOP  imige  points  don't  move  from  ooe  step  to  the  next  *) 

(*  The  first  point  of  dte  object  point  list  is  tiken  is  lefereooe  point  blO  *) 

(*  Copyright  Deiiiei  DeMendtos  1991  *) 

POSrniinegeP^ts_,objeetPoints_,objeetMitrix_«foGea^tthJ:«  Module[ 
(objectVecters.  kniteVectors,  IVect.  JVect,  ISquiio,  JSquife,  D.  imigeOinefeoce, 
rowl,  row2.  row3.  scalel,  sade2,  seek.  oldSOPfrnigePoints,  SOPlmigePointi, 
tnnslitkm,  rautlos.  firttPoee«  counts,  converged  •  Pilse). 
objectVectors  «  (iH>bjectPoint8([l]])&  /0  objeetPomts; 
oldSOPIffligePointssimigePoints; 

(*  loop  until  difleiencB  between  2  SOP  itniges  is  less  dim  one  pixel  *) 

While[!  converged, 
lf[oountnO, 

(**  we  get  unige  vectors  Cram  image  of  reCeteaoe  point  for  POS:  *) 
image Vectoci  ■  (#  •  imagcPotnts[[l}))&  imagePoints, 

(*  else  oounoO.  we  compute  a  SOP  image  first  for  POSIT:  *) 

SOPImagePmnts  •  imegePoiiits  (1  (objectVectors.row3Vtnnsl«ioii((31]); 
imegeixaisieiioe  ■  ^;>ply[Plus,  Abtpiouiid[Fbneo[S(VImagePoiiiis])* 
RousdCPtaciBiifoldSOPlawgePamnr]]]]; 
oldSOPIms^^oiats  «  SOnsegtf otnis; 
mageVectort  ■  (9  •  SOPImag^oiiiti((l]])A  /9  SOPImagePointB 
1:  (*  and  else  oountX)*) 

(IVect,  IVect)  ■  TVanspoeeCobjectMetrix  .  imege Vectors]; 

ISquare  >  IVectlVect;  ISquare  •  JVecUVect;  U  « IVecUVeci; 

(scalel,  scale2}  ■  S<pt({ ISquare,  JSquam)]; 

(rowl.  row2)  ■  (IVect/icalel,  JVect/kGele2); 

row3  »  RotateLefkfrowl]  RotateRight(row2]  •RotateLeft(row2]  RotateRightfrowl); 
rotatiosaftowl,  row2,  row3}; 

scale  ■  (scalel  ■cele^fZ.O;  (*  scaling  factor  in  SW  *) 
translation  w  Appeiid[imagBPoiDts([l]l,  focalLengthyscale; 

Ilfcoun^^,  firstPoae  ■  (rotation,  transladao)]; 
converged  «  (oouot>0)  AA  (inMgeDiflereocecl); 
count-*-!’ 

];  (*EiidWhilB*) 

Retum((firftPose.(rotation,  traasletion) )]] 

Example  of  input 

flxagth*  760; 

cube-((0,0,0}.(10,0,0).{10.10,0).(0.10,0).(0.0.10), 

(10.0.10),(  10.10, 10).{0.10,10)}; 
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Abstract 

Bayesian  methods  provide  a  theoretically  pleasing  ap¬ 
proach  for  optimal  decision  making.  However,  while 
it  is  often  easy  to  specify  a  model  for  a  given  deci¬ 
sion  problem,  it  is  much  more  difficult  to  gather  the 
necessary  statistics  to  quantify  the  model  and  imple¬ 
ment  a  decision  procedure  with  reasonable  response 
time.  In  robotics  applications,  decision  models  are 
complicated  by  having  to  deal  with  the  temporal,  spar 
tial,  and  perceptual  aspects  of  the  problem.  We  ex¬ 
amine  a  number  of  representational  techniques,  ex¬ 
ploring  the  issues  that  arise  with  regard  to  quantifi¬ 
cation  and  efficient  implementation.  In  particular,  we 
consider  discrete  and  continuous  temporal  representa¬ 
tions,  volumetric  and  location-based  spatial  represen¬ 
tations,  and  time-separable  and  finite-horison  perfor¬ 
mance  measures.  The  techniques  considered  provide 
the  designer  of  robot  decision  procedures  with  alterna¬ 
tives  in  coping  with  the  complexity  that  arises  in  realis¬ 
tic  problems;  alternatives  that  can  be  justified  in  terms 
of  tradeoffs  between  modeling  accuracy  and  computa¬ 
tional  complexity. 

Introduction 

Bayesian  techniques  have  been  developed  for  sensor  fu¬ 
sion  [Moravec  and  Elfes,  1985],  value  of  information 
calculations  [Dean  and  Wellman,  1991],  state  estima¬ 
tion  [Durrant- Whyte,  1988,  Hager,  1990],  and  motion 
planning  under  uncertainty.  However,  in  focusing  on 
component  problems,  computational  and  knowledge  ac¬ 
quisition  issues  are  often  ignored  or  at  least  discounted. 
In  this  paper,  we  explore  some  of  the  issues  that  arise  in 
quantifying  (gathering  statistics  for)  Bayesian  decision 
models  and  designing  efficient  inference  algorithms. 

*Thu  work  was  supported  in  part  by  s  National  Sci¬ 
ence  Foundation  Presidential  Young  Investigator  Award 
IIII-89S7601,  by  the  Advanced  Research  Projects  Agency  of 
the  Department  of  Defense  monitored  by  the  Ait  Force  un¬ 
der  Contract  No.  F30602-91-C-0041,  and  by  the  National 
Science  foundation  in  conjunction  with  the  Advanced  Re¬ 
search  Projects  Agency  of  the  Department  of  Defense  under 
Contract  No.  IRI-8905436. 


Preliminaries 

A  probabilistic  network  is  represented  as  a  directed 
graph  G  =  {V,  E).  A  vertex  v  is  said  to  be  an  tm- 
mediate  predecessor  of  a  vertex  v'  just  in  case  there  is 
an  edge  from  v  to  v'  in  E.  The  vertices  in  V  correspond 
to  random  variables  called  chance  nodes.  The  edges  in 
E  define  the  causal  and  informational  dependencies  be¬ 
tween  the  random  variables.  Chance  nodes  are  discrete 
or  real  valued  variables  that  encode  states  of  knowl¬ 
edge  about  the  world.  Let  Cic  be  the  set  of  values  of 
a  chance  node  C.  There  is  a  probability  distribution 
Pr(C  =  b),u  G  Dc)  for  each  node.  If  the  chance  node 
hrks  no  predecessors  then  this  is  its  marginal  probability 
distribution;  otherwise,  it  is  a  conditional  probability 
distribution  dependent  on  the  states  of  the  immediate 
predecessors  of  C  in  G.  Pearl  [1988]  provides  a  compre¬ 
hensive  treatment  of  probabilistic  networks. 

In  decision  making  tasks,  nodes  in  the  graph  are  of¬ 
ten  instantiated  by  fixing  the  value  of  the  corresponding 
random  variable.  Instantiation  might  correspond  to  ob- 
tiuning  evidence  or  performing  some  sort  of  hypothet¬ 
ical  reasoning.  The  primary  operation  on  probabilistic 
networks  that  we  are  concerned  with  involves  comput¬ 
ing  the  posterior  distribution  of  each  node  given  the 
instantiated  nodes. 

Markov  processes  provide  a  simple,  well-understood 
method  of  representing  change  over  time.  To  represent 
Markov  processes,  we  introduce  a  special  case  of  prob¬ 
abilistic  networks  [Dean  and  Kanasawa,  1989].  Given 
a  set  of  state  variables,  X,  and  a  finite  ordered  set  of 
states  or  time  points,  T,  we  construct  a  set  of  chance 
nodes,  C  =  X  xT ,  where  each  element  of  C  corresponds 
to  the  value  of  some  particular  x  E  X  at  some  t  £  T. 
Let  Cf  correspond  to  the  subset  of  C  restricted  to  t.  A 
process  model  is  smd  to  be  k-Markov  just  in  case, 

Pr(CtlC,_i,  Ct_a, . . .)  =  Pr(Ct|C._i, . . . , Ct-k). 

For  decision  purposes,  we  are  generally  interested  in 
the  outcomes  resulting  from  particular  sequences  of  ac¬ 
tions.  To  represent  action  sequences,  we  add  state  vari¬ 
ables  corresponding  to  the  actions  taken  in  each  state. 
Outcomes  correspond  to  possible  instantiations  of  the 
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state  variables.  We  employ  a  value  function  to  deter¬ 
mine  the  value  of  outcomes.  In  the  case  of  probabilistic 
outcomes,  we  use  the  expectation  of  value  over  all  out¬ 
comes  as  a  measure  to  guide  decision  making.  To  eval¬ 
uate  a  particular  action  sequence,  we  instantiate  the 
action  state  variables  using  the  actions  in  the  given  se¬ 
quence,  compute  the  posterior  distribution  given  the  in¬ 
stantiated  variables,  and  determine  the  expected  value 
using  the  value  function  and  the  posterior  distribution. 

In  general,  the  value  function  ranges  over  the  prod¬ 
uct  of  the  set  of  values  for  all  of  the  random  variables, 
complicating  decision  making  by  requiring  that  we  enu¬ 
merate  this  very  large  space  of  outcomes.  In  practice, 
we  often  employ  a  simpler,  time-separable  value  func¬ 
tion.  By  time  separable,  we  mean  that  the  total  value 
is  a  (perhaps  weighted)  su  m  of  the  value  at  the  different 
time  points.  If  Vi  is  the  value  function  at  time  t,  then 
the  total  value,  V,  is  defined  as 

t€T 

where  y  :T  — ♦  {*10  <  *  <  1}  is  a  function  of  time  used 
to  weight  future  consequences. 

To  represent  variable  intervals  of  time  separating 
states,  we  encode  duration  information  as  real-valued 
state  variables.  Other  continuous  variables  can  be  in¬ 
troduced  to  handle  resources  like  fuel  or  rations,  or  fac¬ 
tors  like  precision  or  machine  downtime.  As  a  conse¬ 
quence  of  our  employing  continuous  chance  variables, 
we  are  forced  to  employ  sampling  methods  [Peot  and 
Shachter,  1991]  as  the  princi^  method  of  estimating 
the  posterior  distribution  for  our  networks. 

Robot  Decision  Making 
In  the  mobile  target  localisation  (MTL)  problem,  a 
robot  is  given  the  task  of  tracking  and  periodically  re¬ 
porting  on  the  position  of  a  mobile  target.  We  repre¬ 
sent  the  dynamics  in  terms  of  a  Markov  chain  with  state 
variables  corresponding  to  the  location  of  the  robot,  the 
location  of  the  target,  the  observations  of  the  robot,  the 
movement  action,  and  the  report  action.  Movement  ac¬ 
tions  correspond  to  invoking  particular  navigation  pro¬ 
cedures.  In  order  that  the  interval  of  time  separating 
successive  time  points  be  the  same,  we  assume  that  the 
navigation  procedures  run  for  a  set  amount  of  time. 
This  reliance  on  fixed  duration  actions  is  somewhat  un¬ 
natural  and  complicates  the  process  of  quantifying  the 
model.  Performance  is  measured  in  terms  of  a  time- 
separable  value  function  for  which  the  value  at  time  t 
is  the  accur2u:y  of  the  report  issued  at  t. 

In  the  Fixed  Target  Reconnaissance  (FTR)  problem, 
the  robot  is  given  a  task  involving  covert  search  and  sur¬ 
vey;  the  robot  is  sent  out  to  find  and  report  on  the  loca¬ 
tion  of  some  set  of  fixed  targets,  while  at  the  same  time 
avoiding  detection  by  some  set  of  either  fixed  or  mo¬ 
bile  observers.  Avoiding  detection  involves  preventing 
line-of-sight  contact  with  observers.  Potential  threats  of 


detection  might  be  determined  beforehand  in  the  case 
of  known  intervals  coinciding  with  satellites  being  posi¬ 
tioned  overhead  or  ground-based  observers  on  periodic 
inspection  tours.  Alternatively,  the  robot  might  employ 
fixed-location  sensors  or  deploy  relocatable  sensors  to 
detect  and  supply  advance  warning  with  regard  to  less 
predictable  observers. 

The  FTR  problem  introduces  some  interesting  rep¬ 
resentation  issues.  For  instance,  it  is  more  difficult  to 
accept  that  all  actions  take  the  same  amount  of  time. 
Other  issues  involve  tradeoffs  regarding  detection,  tar¬ 
get  localisation,  and  the  time  at  which  the  robot  reports 
on  the  location  of  the  target. 

Fixed  Target  Reconnaissance 

In  order  to  illustrate  some  of  the  issues  involved  in  rep¬ 
resenting  robot  decision  problems,  we  consider  the  FTR 
problem  in  more  detail.  First,  we  present  a  probabilis¬ 
tic  network  that  represents  the  FTR  problem  assum¬ 
ing  that  all  the  actions  take  the  same  amount  of  time; 
we  provide  a  time-separable  value  model  for  a  partic¬ 
ular  instance  of  the  FTR  problem.  Second,  we  look 
at  the  issues  that  arise  when  we  relax  the  restriction 
on  actions  taking  a  fixed  amount  of  time;  we  provide 
another  value  model  that  approximates  our  notion  of 
performance  in  this  case  and  examine  the  conditions 
under  which  this  approximation  is  reasonable.  We  then 
look  at  some  of  the  issues  involved  in  incorporating  evi¬ 
dence  gathered  from  the  environment,  paying  attention 
to  the  fact  that  the  computational  expense  of  evalu¬ 
ating  these  probabilistic  networks  grows  too  quickly  if 
they  are  used  to  represent  all  the  information  available 
to  us;  we  suggest  an  approach  that  allows  us  to  take 
much  of  the  information  into  account  without  severely 
degrading  performance. 

Modeling  the  FTR  Decision  Problem 

Figure  1  shows  a  probabilistic  network  representing  the 
FTR  problem.  Shaded  nodes  represent  direct  mea¬ 
surements.  Square  nodes  represent  actions,  diamond¬ 
shaped  nodes  represent  value  nodes  used  to  encode  a 
performance  measure  for  the  underlying  decision  prob¬ 
lem,  and  unshaded  circular  nodes  represent  states  of 
the  world  that  camnot  be  observed  directly.^ 

In  each  time  slice,  there  are  8  state  variables.  Or 
ranges  over  observations  made  about  the  robot’s  loca^ 
tion,  Ot  over  the  observations  made  about  the  target’s 
location.  Lr  ranges  over  the  possible  locations  of  the 
robot,  and  Lt  over  the  locations  of  the  target.  £>  is  a 
boolean  variable  indicating  whether  or  not  the  robot  is 
detected.  A  ranges  over  possible  actions  the  robot  will 
take  next,  and  V  is  a  real-valued  variable.  O,  ranges 

^We  employ  gisphical  conventions  typically  used  in  dis¬ 
playing  influence  diagrams  [Howard  and  Matheson,  1984],  a 
generalisation  of  probabilistic  networks. 
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Figure  1:  FTR  model  with  constant  time  intervals 


over  the  observations  made  by  the  robot  regarding  ob¬ 
servers  wishing  to  detect  the  robot’s  presence. 

In  the  network  shown  in  Figure  1,  looking  at  the  time 
slices  after  the  first,  the  arc  into  node  Or  indicates  that 
the  observations  made  about  the  robot’s  whereabouts 
depend  on  where  it  is.  The  arcs  into  node  Lr  indicate 
that  the  location  of  the  robot  depends  on  what  action 
it  took  previously  and  where  it  was  one  time  step  ago. 
Those  entering  node  Ot  indicate  that  the  observations 
made  about  the  location  of  the  target  depend  on  where 
the  target  is  and  where  the  robot  is.  Those  entering 
node  Lt  indicate  that  the  location  of  the  target  depends 
on  where  it  was  before  (it  does  not  move).  Finally,  the 
arcs  into  node  D  specify  that  whether  the  robot  has 
been  detected  depends  on  whether  it  was  detected  be¬ 
fore,  what  action  it  took  (one  of  the  actions  may  be  to 
hide),  and  whether  there  appear  to  be  any  observers 
around.  The  value  node  indicates  the  value  of  a  sit¬ 
uation  described  in  terms  of  the  target’s  location  and 
whether  or  not  the  robot  is  detected. 

The  topology  of  this  network  represents  the  problem 
at  a  relatively  high  level  —  the  elements  of  interest  to 
us,  such  as  the  location  of  the  target  and  the  robot, 
whether  the  robot  has  been  detected  etc.  By  chang¬ 
ing  the  spaces  over  which  the  random  variables  range, 
the  same  network  can  be  used  for  robots  moving  inside 
buildings,  in  an  urban  environment,  or  in  the  country¬ 
side.  The  quantification  of  the  network  —  the  spec¬ 
ification  of  the  conditional  probability  tables  for  each 
node  —  will  of  course  vary  depending  on  the  spaces  in¬ 
volved.  In  the  following,  suppose  that  the  network  is 
being  used  to  direct  the  search  for  an  object  within  a 
room,  or  small  number  of  rooms;  the  actions  allow  the 
robot  to  move  from  one  part  of  a  room  to  another,  to 
extract  and  interpret  camera  images,  and  to  gather  and 
analyze  sonar  data. 


Maximizing  Expected  Value 
The  general  description  of  the  problem  provided  so  far 
is  not  sufficient  to  unambiguously  specify  a  performance 
measure.  In  the  following,  we  consider  a  more  detculed 
specification  of  the  problem,  and  examine  how  our  spec¬ 
ification  affects  our  representation  of  performance. 

Suppose  that  the  robot  is  part  of  an  advance  unit 
that  has  infiltrated  an  enemy  installation  in  which  it 
is  known  that  there  is  a  missile  silo.  A  tank  unit  is 
approaching  the  area,  and  the  robot  has  some  distri¬ 
bution,  Pr(Q  =  t),  for  estimating  its  time  of  arrival. 
The  robot  must  locate  the  silo,  and  report  the  location 
to  the  tank  unit  when  it  arrives,  at  time  Q  (Query). 
The  only  penalty  associated  with  being  discovered  is 
that  the  robot  will  not  be  able  to  carry  on  its  search 
of  the  area.  There  is  a  positive  reward  V,  (good)  for 
giving  the  correct  answer  when  the  tank  unit  arrives, 
and  a  negative  reward  Vt  (bad)  for  giving  an  incorrect 
answer. 

We  measure  performance  for  this  version  of  the  prob¬ 
lem  as  follows.  The  value  of  a  sequence  of  actions  is  the 
sum  of  the  values  for  each  time  slice,  weighted  by  the 
likelihood  that  the  tank  unit  will  arrive  between  that 
time  slice  and  the  next.  The  value  of  a  sequence  of  ac¬ 
tions  for  a  particular  time  is  the  expected  reward  for 
giving  the  best  response  we  can  at  that  time.  Since 
actions  enable  observations,  we  have  to  account  for  the 
value  of  the  information  gained  through  observation. 
The  probability  of  giving  a  correct  response  for  a  par¬ 
ticular  observation  given  an  action  sequence  is 

PCt{p\s)  =  max  Pi({Lt  =  u,  t)\{0  =  o,  t),  s), 
uen*, 

where  o  E  Ho  &nd  a  is  an  action  sequence.^  The  ex¬ 
pected  value  of  choosing  the  correct  location  if  we  per¬ 
form  this  action  sequence  and  make  this  observation  is 

V.(o|5)  =  V,PCt{o\a)  +  n(l  -  PCt(o|s)). 

We  assume  that  the  actions  and  observations  are  inde¬ 
pendent  of  the  time  at  which  we  are  asked  to  report  on 
the  location  of  the  target. 

We  define  the  time-separable  value  function  captur¬ 
ing  the  expected  value  of  executing  the  sequence  a  if 
we  are  requested  to  report  the  location  of  the  target  at 
time  t  as 

yt(5)=  Pr((0  =  o,t)|5)V.(o|s). 

o€w. 

The  total  value  V  is  the  value  of  performing  the  se¬ 
quence  at  each  of  the  next  n  steps  in  time,  weighted  by 
the  likelihood  that  it  is  at  that  time  step  that  we  are 
asked  to  make  the  report, 

V(s)  =  53Pr(t<Q<t-f-l)V,(s). 
t€r 

^(/V  =  x,t)  is  read  as  "state  variable  X  has  value  x  at 
time  t”. 
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This  sum  does  not  account  for  there  being  a  Unite  hori- 
son,  but  this  omission  can  be  easily  remedied  by  adding 
another  term  for  the  value  of  the  last  location  estimate 
weighted  by  the  probability  that  the  request  for  a  re¬ 
port  comes  after  the  last  time  point  accounted  for  in 
the  model,  Pr(Q  >  max(t))V'n,u(o(s).  Because  the 
above  value  function  is  time-separable,  the  computation 
is  substantially  less  expensive  than  it  would  be  other¬ 
wise. 

Notice  that  we  have  made  several  simplifying  as¬ 
sumptions  in  order  to  obtain  a  time-separable  function. 
The  most  important  of  these  is  that,  given  the  current 
observations  and  those  at  some  time  i  in  the  future,  we 
can  estimate  the  distributions  for  the  nodes  at  time  t 
without  enumerating  all  possible  sequences  of  observa¬ 
tions  between  now  and  t.  Another  major  assumption  we 
have  made  is  that  the  time  between  slices  is  constant 
and  independent  of  the  action  taken.  This  assumption 
is  not  usually  reasonable;  different  actions  take  differ¬ 
ent  amounts  of  time,  and,  more  importantly,  the  same 
action  may  take  a  different  amount  of  time  depending 
on  the  circumstances  in  which  it  is  invoked. 

In  order  to  take  variable  length  actions  into  account, 
we  introduce  an  extra,  continuous-valued,  node  at  each 
time  slice,  representing  the  total  elapsed  time  since  the 
beginning  of  execution.  The  network  in  Figure  2  in¬ 
cludes  these  nodes.  Now,  whether  or  not  we  are  de¬ 
tected  depends  on  the  duration  of  the  pre'dous  action 
(the  difference  between  the  time  of  the  start  and  the 
time  of  the  end  of  the  action),  and  the  time  an  action 
finishes  depends  on  the  time  it  starts  and  what  action 
it  was.  Here,  the  time  is  an  observation  we  can  make 
directly;  once  an  action  is  completed,  we  know  what  the 
current  time  is.  Notice  that  in  general,  the  time  will  in¬ 
fluence  states  of  the  world.  In  this  particular  case,  the 
location  of  the  target  is  fixed,  so  it  does  not  vary  with 
time,  and  we  assume  that  the  location  of  the  robot  is 
independent  of  time  given  the  action  performed. 

The  value  function  in  this  case  is  not  much  more 
complicated  than  it  was  before;  the  important  point 
to  note  is  that  the  time  is  one  of  the  observations,  and 
for  each  possible  individual  observation,  we  can  com¬ 
pute  the  value  at  that  time,  weighted  by  the  likelihood 
of  being  requested  to  report  the  location  of  the  target 
at  that  time.  So  the  expression  Pr(Q  =  t)  needs  to  be 
factored  into  the  computation  as 

Vi(o|s)  =  Pr(T,  <  Q  <  T.+i) 

((F,PC,(o|s)  +  P»(l  -  EV,{o\sm. 

Note  that  since  Tt  is  a  part  of  the  observations,  it  is 
fixed  at  this  point.  Notice  also  the  distinction  between 
the  subscript  t,  which  denotes  the  tth  time  slice,  and 
the  value  of  Tt,  which  is  the  total  time  elapsed  since  the 
beginning  of  the  execution.  For  different  values  of  o  and 
the  same  value  of  t,  Tt  may  be  different.  The  probability 
that  the  query  will  be  made  between  time  slice  t  and 
time  slice  t  -f  1  clearly  depends  on  the  duration  of  the 
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Figure  2:  FTR  model  with  variable  time  intervals 

action  made  at  time  t.  In  general,  this  term  must  be 
computed  as  an  average  of  the  possible  durations  of  the 
action;  this  is  correct  only  if  the  expected  value  of  the 
situation  resulting  from  the  action  is  independent  of  the 
duration  of  the  action.  Although  this  is  unlikely  to  be 
true,  it  is  a  reasonable  approximation  in  the  case  where 
actions  take  a  short  time  relative  to  the  total  expected 
duration  of  the  exercise;  in  this  case,  the  value  will  not 
usually  vary  enough  between  the  beginning  of  an  action 
and  the  end  to  make  this  overly  inaccurate.  Under  these 
assumptions  the  total  value  function  is  still  separable, 
but  the  assumptions  are  rather  stronger. 

Consider  an  alternative  specification  of  the  problem; 
2^mn,  the  robot  is  required  to  locate  the  missile  silo,  but 
now  it  must  choose  when  to  report  and  it  is  allowed  to 
report  exactly  once.  Moreover,  there  is  an  additional 
consequence  of  being  detected;  the  robot  cannot  report 
on  the  location  of  the  target  if  it  has  been  detected, 
adding  an  incentive  not  to  delay  in  making  a  report.  As 
far  as  we  can  tell,  any  reasonable  value  model  for  this 
problem  will  not  admit  a  time-separable  value  function. 

Methods  for  Reducing  Complexity 
Bayesian  networks  provide  a  concise  and  easily- 
formulated  representation  of  state  and  causality  in 
many  domains.  However,  the  complexity  of  evaluation 
of  these  networks  makes  it  unrealistic  to  use  them  un¬ 
less  the  sise  of  the  network  and  number  of  states  for 
each  node  are  fairly  small.  Some  information  about  the 
world  cannot  be  easily  represented  in  this  form  without 
a  large  number  of  nodes  and/or  large  state  spaces,  yet 
an  approximation  of  this  information  can  often  be  used 
by  dynamically  modifying  the  conditional  probabilities 
represented  in  a  network. 

This  kind  of  situation  arises  in  the  FTR  problem  de¬ 
scribed  above.  Consider  the  case  where  the  robot  must 
explore  a  room,  looking  for  a  particular  kind  of  object. 
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We  assume  that  one  of  the  sensors  available  to  the  robot 
allows  it  to  gather  information  about  the  presence  of  the 
object;  for  example,  a  vision  system  might  be  used  to 
provide  a  degree  of  belief  about  the  target  being  present 
in  a  particular  image.  We  assume  also  that  the  robot 
has  collision  sensors,  for  example  sonar  transducers  that 
allow  it  to  form  a  coarse  map  of  the  room  in  order  to 
move  around  without  colliding  with  furniture. 

If  the  layout  of  the  room  is  known  in  advance,  along 
with  the  location  of  the  furniture  euid  anything  else  in 
the  room,  then  we  can  construct  a  network  that  allows 
us  to  represent  the  location  of  the  robot,  that  of  the 
target,  and  the  observations  about  the  target.  Given 
enough  knowledge  about  the  contents  of  the  room,  we 
can  estimate  the  conditional  probabilities  for  the  nodes 
in  this  network. 

More  realistically,  we  will  have  little  or  no  advance 
knowledge  of  the  contents  of  the  room,  and  quantifying 
the  network  will  be  impossible  to  do  accurately.  In 
particular,  given  a  location  for  the  robot,  and  a  location 
for  the  target,  we  need  to  specify  the  probabilities  of 
the  different  possible  observations  the  robot  will  make 
of  the  target.  Clearly  this  depends  heavily  on  what  is 
between  the  target  and  the  robot. 

However,  once  the  robot  is  in  the  room  making  the 
observations  that  it  needs  to  make  in  order  to  move 
about,  extra  evidence  is  accumulating  that  could  be 
used  to  refine  these  conditional  probabilities.  There 
are  two  ways  to  exploit  this  accumulating  information. 
One  is  to  explicitly  represent  our  knowledge  of  the  con¬ 
tents  of  the  room  in  the  network.  As  we  shall  see,  this 
is  likely  to  be  prohibitively  expensive  from  a  compu¬ 
tational  standpoint.  The  other  is  to  specify  the  con¬ 
ditional  probabilities  in  the  network  as  a  function  of 
our  knowledge  about  the  room,  and  to  update  these 
conditional  probabilities  as  we  gather  more  information 
about  the  room.  This  does  not  allow  us  to  exploit  the 
information  as  fully  as  if  it  were  explicitly  quantified, 
but  it  provides  us  with  a  good  approximation  at  very 
little  computational  cost. 

A  common  technique  for  gathering  evidence  about 
obstacles  in  an  open  area  such  as  a  room  is  to  use  a 
volumetric  spatial  representation  such  as  an  occupancy 
grid  [Moravec  and  Elfes,  1985].  The  area  is  divided  into 
(usually  equally  sised)  grid  elements,  and  a  probability 
that  each  element  contiuns  an  obstacle  is  maintained. 
As  evidence  from  sensors  is  gathered,  it  is  used  to  up¬ 
date  these  probabilities. 

It  is  possible  to  represent  the  same  information  us¬ 
ing  a  Bayesian  network.  The  resulting  network,  for  a 
very  coarse  grid,  might  look  like  figure  3.  We  have  re¬ 
stricted  ourselves  to  a  single  time-slice,  as  otherwise  the 
network  is  too  highly  connected  to  be  displayed  in  any 
reasonable  way.  Another  external  observation  node  has 
been  added:  OccObs.  It  represents  the  observations 
made  about  occupancy,  for  example  from  sonar  read¬ 
ings  made  by  the  robot.  For  presentation  purposes. 


TO 


Figure  3:  Searching  a  room  using  an  occupancy  grid 


we  have  represented  the  case  where  the  grid  is  2  by  2, 
which  would  in  reality  be  much  too  coarse  to  obtain  any 
useful  information.  Four  state  nodes  have  been  added, 
one  for  each  grid  cell.  Each  of  these  nodes  represents 
whether  there  is  an  object  in  that  grid  element  that 
would  occlude  the  target  were  it  between  the  robot  and 
the  target. 

We  use  the  term  occlusion  in  the  broad  sense  of  in¬ 
hibiting  observation;  here,  we  assume  that  the  robot  is 
using  some  form  of  vision  to  detect  the  object,  so  we  are 
interested  in  whether  grid  elements  contun  an  opaque 
object  that  will  make  it  less  likely  that  the  robot  will  see 
the  target  if  it  is  behind  it.  The  same  basic  information 
can  be  used  by  the  robot  to  determi''e  places  to  hide  in 
order  to  avoid  detection.  Occupancy  information  can 
be  with  respect  to  a  coordinate  system  centered  on  the 
robot  or  with  respect  to  some  other,  perhaps  global  co¬ 
ordinate  system. 

Now  at  each  time-step,  observations  are  made  about 
the  target,  as  before,  but  also  about  the  presence  of  ob¬ 
jects  in  the  room  that  may  occlude  the  target.  The  lat¬ 
ter  observations  will  affect  the  probability  distribution 
on  the  node  Ot  since  they  will  influence  the  distribu¬ 
tions  of  the  Occ  nodes,  which  influence  Ot.  Moreover, 
the  system  can  reason  about  change  in  the  distributions 
of  these  Occ  nodes. 

This  is  a  very  appealing  model  —  it  allows  us  to 
reason  about  change  in  the  robot’s  knowledge  about 
occupied  regions  of  space;  we  can  represent  the  like¬ 
lihood  that  obstacles  will  block  the  target  from  view, 
and  we  are  using  information  that  we  would  have  to 
gather  anyway  in  order  to  move  around  the  room.  Un¬ 
fortunately,  adding  one  node  per  grid  element  per  time 
slice  makes  the  network  enormously  expensive  to  eval¬ 
uate  and  update  in  response  to  new  evidence.  Given 
the  computational  complexity  inherent  in  representing 
this  information  explicitly,  we  consider  the  possibility 
of  using  the  information  in  a  less  completely  integrated 
manner,  at  a  much  smaller  computational  cost. 

For  a  given  set  of  probabilities  in  the  occupancy  grid. 
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specifying  the  conditional  probabilities  in  the  network  is 
straightforward  —  this  is  effectively  the  situation  when 
the  room’s  contents  are  known  in  advance.  We  propose 
that  a  network  such  as  the  one  in  Figure  1  be  used  to 
model  the  room  the  robot  is  searching.  Simultaneously, 
an  occupancy  grid'  is  constructed  as  the  robot  explores 
the  room,  independently  of  the  Bayesian  network.  At 
each  time  step,  before  projecting  the  results  of  action 
sequences  to  choose  the  next  action  to  perform,  we  up¬ 
date  the  conditional  probabilities  in  the  network  in  light 
of  the  information  in  the  occupancy  grid.  The  portions 
of  the  network  that  will  change  given  new  information 
about  occupancy  arc:  the  distribution  of  Ot  given  Lt 
and  Lt,  and  the  priors  for  Lr  and  Lt  at  the  first  time- 
slice  in  the  network  (i.e.  now).  More  specifically,  the 
distribution  of  0«  given  Lt  and  Lr  will  depend  on  the 
values  in  the  occupancy  grid  between  Lt  and  X,  —  the 
more  likely  there  is  to  be  an  obstacle  the  less  likely  the 
robot  is  to  see  the  target.  The  priors  for  Lr  and  Lt  may 
be  influenced  by  the  contents  of  the  occupancy  grid  if 
for  example  we  know  in  advance  that  the  robot  is  un¬ 
likely  to  be  in  the  same  location  as  an  obstacle,  or  that 
the  target  is  likely  to  be  on  an  obstacle  such  as  a  desk. 

This  makes  the  specification  of  the  probability  distri¬ 
butions  more  complex  for  the  designer  of  the  network, 
since  they  will  be  parameterized  by  the  values  in  the 
occupancy  grid,  but  the  amount  of  information  avail¬ 
able  with  which  to  make  inferences  is  much  larger  at 
run-time. 

One  disadvantage  of  this  dynamic  technique  over  ex¬ 
plicitly  representing  the  grid  in  the  network  is  that  the 
system  cannot  reason  about  the  effect  of  its  actions 
on  the  contents  of  the  grid,  as  described  in  the  sec¬ 
tion  above.  However,  given  that  representing  this  oc¬ 
cupancy  information  makes  the  decision  model  compu¬ 
tationally  intractable  in  all  but  trivial  cases,  the  choice 
is  between  ignoring  the  effect  of  occupancy  on  the  ob¬ 
servations  made  by  the  robot  and  integrating  them  by 
re-quantifying  the  network;  clearly  the  latter  provides 
us  with  a  better  model. 

Related  Work 

There  is  a  growing  literature  on  using  probabilistic  net¬ 
works  for  applications  in  planning,  interpretation,  and 
image  understanding.  Levitt  et  al.  [1988]  discuss  search 
issues  in  the  context  of  object  recognition.  Agosta 
[l99lj  discusses  the  issues  involved  in  quantifying  rela¬ 
tions  among  visual  features.  Goldman  [l990]  addresses 
related  problems  in  the  context  of  natural  language  un¬ 
derstanding,  with  an  emphasis  on  dynamically  generat¬ 
ing  probabilistic  models. 

Conclusions 

This  paper  presents  techniques  for  applying  a  well 
known,  mathematically  pleasing  but  computationally 
impractical  approach  for  decision  making  under  uncer¬ 
tainty  to  problems  in  robotics  and  perception.  We  pro¬ 


vide  practical  representational  alternatives  for  reason¬ 
ing  about  the  fundamental  issues  of  time,  space,  and 
decision  value.  Ultimately,  we  are  looking  for  a  com¬ 
prehensive  and  disciplined  approach  to  making  the  sort 
of  tradeoffs  involving  modeling  precision  and  computa¬ 
tional  complexity  that  are  outlined  in  this  paper.  While 
that  goal  is  still  some  distance  off,  we  are  beginning  to 
understand  some  of  the  ramifications  of  various  design 
decisions. 
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Abstract 

Implicit  higher  degree  polynomials  in  x,y,z  (or 
in  a:,y  for  curves  in  images)  have  considerable 
global  and  semiglobal  representation  power  for 
objects  in  3D  space.  (Spheres,  cylinders,  cones 
and  planes  are  special  cases  of  such  polynomi¬ 
als  restricted  to  second  degree.)  Hence,  they 
have  great  potential  for  object  recognition  and 
position  estimation.  In  this  paper  we  deal  with 
three  problems  pertinent  to  using  these  polyno¬ 
mials  in  real  world  robust  systems:  1)  Charac¬ 
terization  and  fitting  algorithms  for  the  subset 
of  these  algebraic  curves  and  surfaces  that  is 
bounded  and  exists  largely  in  the  vicinity  of  the 
data;  2)  A  Mahalanobis  disteince  for  comparing 
the  coefficients  of  two  polynomials  to  determine 
whether  the  curves  or  surfaces  that  they  repre¬ 
sent  are  close  over  a  specified  region;  3)  Geo¬ 
metric  Invariants  for  determining  whether  one 
implicit  polynomial  curve  or  surface  is  a  rota¬ 
tion  and  translation  of  another,  or  whether  one 
implicit  polynomial  curve  is  an  affine  transfor¬ 
mation  of  another.  In  addition  to  handling  ob¬ 
jects  with  easily  detectable  features  such  as  ver¬ 
tices,  high  curvature  points,  and  straight  lines, 
the  polynomials  and  tools  discussed  in  this  pa¬ 
per  are  ideally  suited  to  smooth  curves  and 
smooth  curved  surfaces  which  do  not  have  de¬ 
tectable  features. 

1  Introduction  and  Previous  Work 

Much  of  the  early  work  on  implicit  polynomial  curves 
and  surfaces  were  limited  to  2nd  degree  polynomials, 
thus  dealing  with  representations  that  had  modest  ex¬ 
pressive  power,  but  the  fitting  algorithms  were  simple, 
the  computational  cost  small,  and  the  resulting  polyno¬ 
mial  coefficients  were  reasonably  stable  [4,  17,  16,  13,  5, 
3,  14].  Implicit  polynomial  curves  and  surfaces  of  degree 
higher  than  two  have  great  modeling  power  for  compli¬ 
cated  shapes  and  can  be  made  to  fit  data  very  well,  but 
their  coefficients  may  be  sensitive  to  small  changes  in 
the  data.  This  poses  a  problem  since  we  would  like  to 

‘This  work  was  partially  supported  by  NSF  Grant  #IRI- 
8715774  and  NSF-DARPA  Grant  #IRI-8905436 


compare  curves  and  surfaces  based  on  their  polynomial 
coefficients  or  functions  of  the  coefficents  that  represent 
only  shape,  i.e.  that  are  invariant  to  object  rotation, 
translation  and  stretching  in  two  directions  -  general 
linear  transformations.  In  this  paper  we  present  new 
approaches  and  tools  to  these  problems  which  should 
permit  robust  2D  curve  and  3D  surface  object  recog¬ 
nition  and  position  estimation  based  on  the  polynomial 
coefficients  only.  In  particular,  we  introduce  the  class 
of  implicit  polynomials  that  represent  closed  curves  or 
surfaces,  exhibit  a  low  computation  cost  algorithm  for 
fitting  such  that  the  curve  or  surface  exsists  only  in  the 
general  region  of  the  data,  illustrate  the  wide  range  of 
shapes  that  ran  be  represented  and  illustrate  the  im¬ 
proved  stability  of  the  coefficients.  Then,  for  any  poly¬ 
nomial  whether  it  represents  a  closed  or  an  open  un¬ 
bounded  curve  or  surface,  we  present  and  discuss  a  sim¬ 
ple  expression  for  the  aposteriori  probability  distribution 
of  its  coefficients  given  the  data  set  that  is  to  be  repre¬ 
sented  by  a  polynomial.  Polynomial  coefficient  sensitiv¬ 
ity  to  small  clianges  in  the  data  occurs  when  a  data  set 
does  not  sufficiently  constrain  the  coefficients.  Our  de¬ 
velopment  bot  h  determines  the  subset  of  coefficient  space 
that  is  constrained  by  the  data  and  provides  the  appro¬ 
priate  metric  for  polynomial  curve  or  surface  recognition 
based  on  the  polynomial  coefficient  vectors.  Lastly,  two 
approaches  are  discussed  to  the  design  of  functions  of  the 
polynomial  coefficients  that  are  invariant  to  Euclidean 
or  affine  transformations.  Both  approaches  appear  to 
be  promising  and  can  be  functions  of  all  the  coefficients 
rather  than  of  just  the  coefficients  of  higher  degree,  which 
is  the  subject  of  classical  geometric  invariance  theory. 

2  Description  of  Closed  Objects  Using 
Polynomials 

2.1  Finding  the  Fitting  Polynomial 

In  general,  for  a  polynomial  p{x,y)  to  describe  a  closed 
object  O  with  boundary  B  the  following  should  hold: 

1)  The  set  {(x,  //)  :  p{x,  y)  =  0}  is  equal  to  B. 

2)  (x,y)  e  O  iff  p(x,y)  <  0. 

We  shall  refer  to  the  set  {(x,i/)  :  p(x,y)  =  0}  as  the 
zero  set.  Let  us  note  at  this  point  that  polynomials  with 
an  unbounded  zero  set  can  describe  curve  patches,  but  in 
Section  2  wi;  are  interested  in  describing  closed  bounded 
objects. 
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Since  second  degree  polynomials  can  describe  only  cir¬ 
cles  and  ellipses,  let  us  proceed  to  higher  degrees.  The 
standard  notation  for  a  polynomial  of  degree  n  will  be 
adopted:  p{x,y)  =  Oijx'y^ .  The  following  sim- 

0<i+j<n 

pie  lemma  shows  that  the  next  class  in  the  polynomial 
hierarchy  is  not  suitable  for  describing  closed  objects. 

Lemma  1  The  zero  set  of  a  third  degree  polynomial  is 
unbounded. 


Proof:  [ll]. 

Next  on  the  list  are  fourth  degree  polynomials.  Their 
zero  set  can  be  bounded  -  e.g.  x"*  j/^  —  1  =  0  -  or 

unbounded,  e.g.  x^  —  =  0.  It  is  not  surprising  that 

the  high  powers  of  the  polynomial  determine  if  its  zero 
set  is  bounded  or  not.  Let  us  call  those  fourth  degree 
powers,  e.g,  0401'*  -I-  Oaix^y  -|-  aiix'^y'^  -(-  ai^xx^  -f  ao42/'*, 
the  leading  formofp{x,  y),  or  p4(x,  y),  and  the  sum  of  the 
lower  powers  -  e.g.  cubics. quadrics, linear  terms  and  the 
constant  -  the  lower  terms  or  P3(x,  y).  Let  us  also  define 
a  polynomial  to  be  stably  bounded  if  a  small  perturbation 
of  its  coefficients  leaves  its  zero  set  bounded.  For  reasons 
of  numerical  robustness  we  are  interested  only  in  stably 
bounded  polynomials. 


Theorem  1  The  zero  set  ofp{x,y)  is  stably  bounded  iff 
there  exists  a  symmetric  positive  definite  matrix  A  such 
that  P4{x,y)  =  (x^  xy  y^)A{x^  xy 

Proof:  [11].  Summerizing,  given  an  object  O  with 
boundary  B  we  look  for  a  fourth  degree  polynomial 
p(x,  y)  such  that; 

1)  p4(x,j/)  can  be  expressed  as 
(x^  xy  jr)i4(x^  xy  y^)'^ 

with  A  symmetric  positive  definite. 

2)  The  zero  set  of  p(x,  y)  approximates  B. 

We  know  what  the  first  condition  means.  How  is  the 
second  satisfied?  The  first  guess  is:  find  p(x,  y)  such  that 
p^(x,-,y,)  is  minimal.  This,  however,  results  in 

far  from  optimal  description  of  B  because  p^(xj,y,)  is 
often  a  poor  measure  for  the  distance  of  (xj,  y^)  from  the 
zero  set  of  p(x,y).  A  much  better  measure,  suggested 

by  [17]  and  extended  in  [2l],  is  ^  ^x^'y  )  approxi¬ 
mate  squared  distance  of  (xi,yj)  from  the  curve,  (where 
V^P(®f .  y»)  stands  for  the  square  of  the  norm  of  the  gra¬ 
dient).  So  the  expression  to  be  minimized  is 


E 


pH^i,yi) 


(1) 


Taubin  [21],  in  an  extensive  work  on  implicit  curve  and 
surface  fitting,  solves  the  problem  by  approximating 
Equation  1  with  the  expression 


(x,,yi)eB 


(2) 


and  then  minimizing  Equation  2  by  generalized  eigen 
vector  techniques,  followed  by  an  iterative  scheme  for 


improving  the  polynomial  fit.  Taubin’s  work  results  in 
excellent  fits,  but  he  does  not  worry  about  the  zero  set 
being  bounded;  hence  the  outcome  of  his  fitting  algo¬ 
rithm  is  that  the  zero  set  contains  B  but  often  has  addi¬ 
tional  unbounded  parts  (see  Figures  1,2).  A  simple  ex¬ 
ample  is  that  of  a  square;  Taubin’s  algorithm  describes  it 
as  the  union  of  four  straight  lines,  with  the  corresponding 
p(x,  y)  equal  to  the  product  of  the  four  linear  polynomi¬ 
als  describing  these  lines.  So  the  square  is  represented 
as  the  union  of  the  infinite  extension  of  its  edges. 

The  question  is  how  to  incorporate  into  Taubin’s  al¬ 
gorithm  the  condition  that  the  zero  set  be  bounded. 
What  should  be  done  is  simple:  look  only  for  polyno¬ 
mials  p(x,y)  such  that  p4(x,y)  can  be  expressed  as  in 
Theorem  1.  The  question  is  how  to  parametrize  positive 
definite  matrices.  We  use  the  following  result  [l]:  if  a 
matrn.  A  is  symmetric  positive  definite,  it  has  a  sym¬ 
metric  square  root  B. 

Hence  it  is  enough  to  look  at  all  p(x,  y)’s  where  pi{x,y) 
can  be  written  as  (x^  xy  y^)B^{x^  xy  y^)'^  where  B  is 
symmetric.  Thus  the  strategy  chosen  was  to  minimize 
the  error  measure  of  Equation  1  while  conforming  to  the 
above  condition.  This  is  done  by  minimizing  not  over  the 
space  of  unconstrained  polynomials,  but  only  over  the 
space  of  p(x,y)  such  that  P3(x,y)  is  unconstrained  and 
P4(x,  y)  is  as  above.  Technically,  we  look  for  the  optimal 
B  (6  parameters)  and  pz{x,y)  (10  parameters).  Note 
that  in  Equation  1,  fourth  powers  of  the  elements  of  B 
appear.  This  does  make  the  minimization  problem  non¬ 
linear,  but  that  seems  a  reasonable  price  to  pay  if  one 
wants  to  enforce  boundness  of  the  zero  set.  Sometimes, 
the  zero  set  is  bounded  but  contains  spurious  parts  (Fig¬ 
ure  5b).  This  problem  was  solved  by  adding  el  to  B^, 
where  I  is  the  identity  matrix  and  e  a  small  positive 
constant. 

Another  proi)lem  affecting  the  running  time  of  the  fit¬ 
ting  algorithm  is  that  the  expression  in  Equation  1  is 
expensive  to  calculate.  Most  non-linear  minimization 
techniques  require  computation  of  the  function  and  its 
derivatives  many  times.  If  many  points  are  present,  this 
means  computing  the  sum  of  the  function  over  its  gradi¬ 
ent  squared  in  all  these  points,  requiring  enormous  time. 
However,  it  is  possible  to  overcome  this  problem  using 
the  following  iterative  algorithm: 

1)  minimize  p^(x,.  y^).  This  is  quite  fast,  be- 

(r,,y,)€B 

cause  the  sum  of  the  squares  of  the  polynomial  at  the 
points  can  be  written  as  FM  where  F  is  the  vector 
of  the  polynomial’s  coefficients  and  M  is  a  scatter  matrix 
of  the  points  [2l].  This  is  much  faster  than  using  Equa¬ 
tion  1  directly.  Call  the  optimal  polynomial  Pi(x,y). 

2)  Assign  to  each  point  pi  a  weight  Wi  —  y 

3)  Minimize  ^WiP^(pi).  This  is  also  quick  -  it  is 
exactly  the  same  process  as  in  1),  with  M  replaced  by  a 
weighted  scatter  matrix. 

4) Go  back  to  2)  and  update  the  weights  using  the 
minimizer  of  3)  instead  of  Pi(x,y). 

5)  Iterate  untill  the  error  of  fit,  measured  by  Equation 
1,  doesn’t  decrease  substantialy.  (Note  that  we  are  using 
Equation  1.  but  only  a  small  number  of  times  -  usually 
less  than  5  itf'ration  are  needed). 
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This  algorithm  is  suboptimal  in  the  sense  that  it 
doesn’t  minimize  Equation  1,  but  it  is  much  faster  and 
results  in  the  same  quality  of  fits.  Running  times  will  be 
discussed  in  Section  3. 

Some  examples  are  provided  of  bounded  vs.  un¬ 
bounded  descriptions.  In  Figures  1  and  2  an  object 
(super-quadric)  is  shown  with  bounded  and  unbounded 
polynomial  fits  (the  scale  has  been  changed  to  show  the 
global  behavior  of  the  zero  set  of  the  unbounded  poly¬ 
nomial).  It  is  interesting  to  observe  how  the  zero  set  in 
Figure  2  contains  the  super-quadric  but  has  additional 
unbounded  components.  In  Figure  3  an  assortment  of 
objects  that  can  be  exactly  describe  by  fourth  degree 
polynomials  is  presented,  and  in  Figure  4  the  power  of 
polynomials  is  demonstrated  by  showing  four  discon¬ 
nected  objects  that  are  represented  by  a  single  fourth 
degree  polynomial. 

Only  polynomials  of  degree  four  have  been  used  in 
this  work.  In  the  future,  higher  order  polynomials  will 
be  investigated. 

2.1.1  Description  of  3D  Objects 

Everything  that  was  said  about  polynomials  p{x,y) 
extends  to  p{x,y,z).  Boundenss  of  the  zero  set  can  be 
satisfied  by  using  the  identity  for  a  fourth-degree  poly¬ 
nomial: 

(x^  xy  xz  y^  yz  z'^)A{x‘^  xy  xz  y^  yz  z^Y 

=  a4oox^  +  aziQX^y  +  asoix^z  +  ... 

where  A  is  a  6  x  6  symmetric  positive  definite  matrix,  and 
proceeding  with  the  same  line  of  thought  as  for  p(x,y). 
In  Section  3  results  for  3D  fits  will  be  surveyed. 

2.2  Relation  to  Other  Models:  Super-Quadrics 

An  analytic  model  that  has  proved  effective  in  describing 
non-polygonal  objects  in  Graphics,  Vision,  and  Robotics 
is  the  super-quadric  model  [2,  9].  As  the  name  implies 
these  are  extensions  of  ellipses  and  ellipsoids  (and  indeed 
are  also  called  super-ellipsoids).  In  the  plane,  all  super¬ 
quadrics  can  be  described  by  a  scaling,  rotation,  and 
translation  of  the  “twisted  circles”  C{  =  {(*)!/)  :  + 

J/'  =  1}.  If  €  =  2,  Ce  is  a  circle;  for  t  >  2  it  is  a  circle 
that  is  “pushed  out”,  approaching  the  unit  square  as  e 
grows;  for  small  e,  Ct  is  a  non-convex  star-shaped  set 
(see  Figure  1). 

It  can  be  seen  that  an  unconstrained  fourth  degree 
fit  to  the  super-quadric  in  Figure  2  is  indistinguishable 
from  the  super-quadric,  but  its  zero  set  is  unbounded. 
Depending  on  the  application,  this  may  or  may  not  pose 
a  problem.  How  do  fourth  degree  polynomials  with  a 
closed  zero-set  relate  to  super-quadrics?  Since  polyno¬ 
mials  can  also  be  scaled,  rotated,  and  translated,  then 
to  relate  the  description  power  of  polynomials  to  that  of 
super-quadrics  it  is  enough  to  check  if  the  C(  are  well 
approximated  by  polynomials.  The  approximation  for 
f  =  0.6  is  presented  in  Figure  1.  For  larger  values  of  c  the 
approximations  are  so  good  that  one  cannot  distinguish 
the  super-quadric  from  the  fitting  polynomial  by  looking 
at  them!  It  can  be  seen  that  the  polynomials  give  a  rea¬ 
sonably  good  description  of  the  super-quadrics  for  small 
e,  although  they  miss  the  sharp  protrusion  coming  out 


of  the  sides  of  the  super-quadrics.  These  results  should 
be  expected  since  the  polynomials  have  mere  degrees  of 
freedom  than  the  super-quadrics. 

3  Invariants:  Using  Polynomials  for 
Recognition 

3.1  What  are  Invariants? 

Invariants  are  ((uantities  assigned  to  polynomials  that  do 
not  change  when  the  coordinate  system  undergoes  trans¬ 
formations,  and  hence  are  descriptors  of  shape  a  natural 
candidate  for  recognition  purposes.  Probably  the  first 
invariants  known  were  those  of  the  second  degree  poly¬ 
nomial  O2ox^  -I-  oiixy  -1-  ao23/^  -f-  aiox  -f  aoiy  -I-  uoo:  if 
the  coordinate  system  undergoes  an  affine  transforma¬ 
tion  (u, t;)‘  =  A(x,yy  +T,  where  A  is  a  non-sigular 
matrix  and  T  a  vector,  then  the  quantities  020  +  ao2 
and  4a2oao2  —  “ij  are  multiplied  by  the  square  of  the 
determinant  of  A.  Classical  work  [S]  was  concentrated 
on  finding  affine  invariants,  but  only  of  the  leading 
form.  This  is  probably  because  the  leading  form  does 
not  change  under  translation,  hence  the  problem  of 
finding  invariants  is  more  approachable  and  the  theory 
more  elegant.  In  [8],  a  beautiful  solution  -  the  sym¬ 
bolic  method  -  is  presented  for  writing  down  all  the 
affine  invariants  of  the  leading  form.  However,  there 
are  other  claisses  of  invariants,  such  as  Euclidean  invari¬ 
ants,  and  invariants  that  depend  on  all  the  coefficients, 
that  remain  to  be  explored.  Also,  as  noted  in  [l9,  20, 
21],  in  real  applications  one  should  use  invariants  that 
are  as  simple  as  possible.  Taubin  finds  many  invari¬ 
ants  that  are  easy  to  compute,  as  they  are  expressed 
as  eigenvectors  of  relatively  small  matrices.  However, 
for  application  of  Section  4  to  invariants,  they  pose  a 
problem,  since  they  are  not  given  as  explicit  functions  of 
the  coefficients,  and  thus  cannot  be  used  directly  in  the 
statistical  analysis  presented  there.  This  is  because  the 
goal  is  to  utilize  probability  distributions  on  the  coeffi¬ 
cients  to  find  the  probability  distribution  of  the  invari¬ 
ants;  for  that,  simple  explicit  invariants  are  necessary.  [6, 
7]  contain  an  extensive  treatment  of  explicit  invariants 
for  Vision  for  second  degree  polynomials  curves. 

3.2  Using  Symbolic  Computation  to  Find 
Invariants 

In  order  to  find  simple  and  explicit  invariants  of  polyno¬ 
mials,  the  tool  of  symbolic  computation  was  used.  We 
have  used  the  Mathematica  package  [22],  running  on  a 
Sparc  working  station.  Recently,  symbolic  computation 
is  finding  applications  to  Vision  [l2,  lOj. 

The  method  used  to  find  invariants  is  best  demon¬ 
strated  by  a  simple  example.  The  input  for  Mathematica 
to  find  Euclidean  invariants  of  the  second  degree  form  in 
two  variable  x  and  y  is  presented  next: 

1  a2ox^  -t-  an.r?/  + 

2  %/.x—  >  «(1  —  </*/2)  —  x>[q  —  q^/Q) 

3  %/.y-  >  «( 1  -  f/''*/2)  -f-  u{q  -  q^/6) 

4  expl=Expanil[%] 

5  aa2o  =  D[expl ,  {u,  2)]/2 

6  aao2  =  D[e3-pl,  {i',2}]/2 
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7  aaii  =  D[expl,  {u,  1],  {d,  1}] 

8  exp2  =  Expand[Aaa2o+  Baa^^  +Caao2  +  -‘^““2oaaii  + 
Eaa2oCiao2  +  Faanaaoa] 

9  exp3  =  Ao^q  +  Bafi  +  Ca'^^  +  Da2oan  +  £'020002  + 
£  011O02 


10  exp4=Expand[exp2-exp3] 


11  gl  =  Expand[D^ 

12  g2  =  Expand[D 

13  g3  =  Expand[D 

14  g4  =  Expand\D, 

15  g5  =  Expand\p 

16  g6  =  Expand[D 

17  eql=(Coefficient[ql,q,0]==0) 

18  eq2=(Coefficient[ql,q,l]==0) 


exp4,  {020,2}]] 
exp4,{on,2}]] 
exp4,{on,2}]] 
exp4,  {020, 1},  {on,  1}]] 
exp4,  {020, 1},  {002, 1}]] 
exp4,  {oil,  1},  {002, 1}]] 


19  eq3=(Coefficient[ql,q,2]==0) 

20  eq22=(Coefficient[q6,q,l]==0) 

21  eq23=(CoefRcient[q6,q,2]=:=0) 

22  eq24=(CoefRcient[q6,q,3]==0) 

23  Solve[{eql,eq2,...,eq23,eq24},{A,B,C,D,E,F}] 

In  line  1,  a  leading  form  of  degree  2  in  two  variables 
is  presented.  In  lines  2,3,4  the  coordinates  are  rotated 
using  Taylor  series  approximati  ons  for  the  trigonomet¬ 
ric  functions.  In  lines  5,6,7  the  coefficients  of  the  new 
form  are  computed,  in  Lines  9,10  we  guess  for  invariants 
that  are  second  degree  polynomials  in  the  coefficients; 
there  is  no  formal  justific  ation  for  that,  but  there  seem 
to  be  many  invariants  that  are  of  this  type.  Let  us  call 
the  power  of  the  coefficients  in  the  invariant  the  rank 
of  the  invariant;  thus  the  invariant  020  +  ao2  is  of  rank 
one,  and  4a2ono2  —  Oii  of  rank  two.  In  lines  10-16,  the 
difference  between  the  polynomial  expression  in  the  new 
coefficients  (exp2)  and  the  old  ones  (exp3)  is  differenti¬ 
ated  so  as  to  get  the  coefficients  out  of  the  way,  and  in 
lines  17-22  the  derivatives  of  these  expressions  by  q  are 
equaled  to  zero  (it  is  necessary  to  get  rid  of  all  variables 
in  order  to  remain  with  a  hnear  system  of  equations).  In 
line  23  the  system  is  solved.  After  4.5  seconds,  Mathe- 
matica  gave  the  following  answer; 

{A  =  C,  B  =  C/2  -  £/4,  D  =  0,  £  =  0} 

Substituting  first  (C  =  1,  £  =  0}  and  then  {C  = 
0,  £  =  1}  we  get  two  invariants  that  are  equivalent  to 
the  invariants  noted  before  -  (020  -1-  uoz)^  and  4a2oao2  - 
Oji  (remembering  that  combinations  of  invariants  are 
also  invariants). 

We  note  here  that  solutions  obtained  this  way  are  not 
necessarily  invariants,  because  the  trigonometric  func¬ 
tions  were  replaced  by  approximations.  However,  the 
method  does  assure  that  all  invariants  of  the  particular 
shape  sought  are  included  in  the  set  of  solutions,  and 
what  is  left  is  to  detect  if  any  spurious  solutions  crept 
in.  Fortunately,  it  is  much  easier  to  check  if  a  certain 
expression  is  an  invariant  then  to  find  invariants;  to  do 
that,  change  the  coordinate  system  and  check  if  the  ex¬ 
pression  remains  the  same.  In  all  our  experiments  we 
did  not  come  on  any  spurious  solutions.  An  intuitive 
explanation  is  that  expressions  that  are  invariant  under 
arbitrarily  small  rotations  are  invariant  under  every  ro¬ 
tation,  as  every  rotation  can  be  generated  by  composing 


small  rotations.  For  affine  invariants  we  have  to  force 
the  expression  to  slay  constant  under  translation  also, 
but  the  idea  is  th(>  same. 

The  exainpli!  above  is  trivial.  However,  this  simple 
technique  gets  rather  messy  when  searching  for  invari¬ 
ants  of  higher  rank  of  polynomials  of  say  fourth  degree, 
and  invariants  of  jjolynomials  in  3  variables.  In  general, 

a  d-th  degree  form  in  n  variables  has 

coefficients.  .Suppose  we  want  to  find  out  if  the  third 
degree  form  in  x,y,z,  (e.g.  asoox^  +  0210^;“*/  -I-  ■  -F 
(10  coefficients)  has  invariants  of  rank  four. 
Then  the  number  of  the  coefficients  which  are  the 
analogues  of  A,  B,C,  D,  E,  F  in  the  example  above  is 

^  =  715.  It  is  easy  to  see  that  manu¬ 
ally  writing  down  the  system  of  equations  becomes  im¬ 
possible,  and  indeed  it  was  done  using  code  generating 
programs.  The  resulting  system  of  equations  contained 
4,290  linear  equations,  and  the  time  of  handling  the  huge 
expressions  and  solving  the  system  was  about  5  hours. 
Many  more  invariants  were  found  this  way,  including  in¬ 
variants  that  are  not  of  the  leading  form  but  of  all  the 
coefficients  (there  is  no  known  method  for  finding  these) 
of  polynomials  in  two  and  three  variables.  If  affine  in¬ 
variants  are  sought,  the  calculation  can  be  made  much 
easier  by  noting  that  the  possible  invariants  are  only  a 
small  subset  of  all  polynomials  in  the  coefficients;  look¬ 
ing  for  instance  at  the  second  degree  form  in  x,  y,  we 
can  rule  out  immediately  an  invariant  containing  both 
O20  and  011 0112-  This  is  because  stretching  the  coordi¬ 
nate  system  by  c  in  the  x-direction  would  multiply  a^o 
by  c^,  and  rtii«02  t)y  c-  So,  the  only  combinations  that 
can  constitute  an  invariant  are  020002  and  ajj.  Simi¬ 
lar  considerations  apply  for  other  polynomials.  For  in¬ 
stance,  finding  affine  invariants  of  rank  four  of  a  fourth- 
degree  polynomial  in  x,y  w^ls  only  a  few  minutes  work 
for  Mathematica.  A  simpler  example,  finding  Euclidean 
invariants  of  rank  two  of  a  fourth-degree  polynomial  in 
x,y  ,  took  20  seconds  and  resulted  in  the  invariants 

022  ~  3013031  +  I2O04O40, 

30^3  —  8004032  +  2013O31  -F  3031  —  32a4oO|)4  —  8022040, 

3^04  d"  20ii4tl22  -F  013031  -F  2004040  -F  2022040  -F  3O40, 

Of  these,  only  the  first  one  is  an  affine  invariant  and 
hent2  is  the  only  one  that  can  be  found  using  the  sym¬ 
bolic  method.  In  addition,  8  invariants  of  rank  3  of  a 
fourth-degri'e  polynomial  in  x,y  were  found. 

Summerizing,  the  brute  force  technique  of  using  sym¬ 
bolic  computations  by  computer  for  transforming  the  in¬ 
variant  problem  to  a  (huge)  linear  system  proved  very 
useful  in  finding  simple,  explicit  invariants  that  cannot 
be  found  by  tin;  symbolic  method,  such  as  Euclidean  in¬ 
variants  and  invariants  that  depend  on  all  coefficients 
and  not  only  on  the  leading  form. 

An  example  is  given  for  recognizing  an  object  with 
polynomials.  Note  the  object  does  not  have  any  of  the 
common  featun's  often  used  such  as  vertices  or  high  cur¬ 
vature  points  and  recognition  methods  b;used  on  featun' 
identification  might  run  into  trouble  trying  to  recognize 
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objects  of  this  type.  Figure  5a  shows  the  data  set,  the 
bounded  4th  degree  polynomial  fit  and  three  Euclidean 
invariants  discussed  in  section  3.  F’gure  5b  shows  the 
data  set  rotated  and  translated  and  a  4th  degree  poly- 
nomiaJ  fit  that  is  not  restricted  to  lie  in  the  vicinity  of 
the  data.  Although  the  polynomial  in  this  figure  fits  the 
data  very  well,  the  invariants  change  somewhat.  In  Fig¬ 
ure  5c,  the  same  data  set  as  in  5b  is  fit  with  a  4th  degree 
polynomial  while  restricting  the  zero  set  of  the  polyno¬ 
mial  to  lie  within  the  vicinity  of  the  data  [ll].  Note  that 
the  invariants  are  closer  to  the  invariants  of  figure  5a. 
In  Figure  5d,  the  data  set  is  further  rotated  and  con¬ 
taminated  with  considerable  noise.  Again  the  zero  set 
is  restricted  to  lie  in  the  data  region,  and  the  resulting 
invariants  are  quite  similar  to  those  in  5a.  These  sets 
of  figures  illustrate  that  first,  the  invariants  discussed  in 
this  section  really  work,  and  second,  restricting  the  poly¬ 
nomials  to  the  vicinity  of  the  data  set  can  help  to  stabi¬ 
lize  the  estimated  coefficients,  and,  hence,  the  invariants. 
Figure  5e  shows  a  different  object  with  its  fourth  degree 
fit  and  the  invariants,  and  Figure  5f  a  rotation  and  trans¬ 
lation  of  that  data  with  its  invariants  (please  note  that 
the  data  sets  in  Figures  5a-d  and  5e-f  are  the  same,  and 
look  different  because  of  scaling  by  the  drawing  package 
used).  In  practice  we  envision  using  a  larger  number  of 
invariants. 

3.3  Experiments 

Experiments  were  run  on  two  and  three  dimensional  data 
to  test  the  fitting  algorithm.  The  minimization  scheme 
to  solve  the  non-linear  optimization  problem  was  Pow¬ 
ell’s  method  [15].  Implementation  was  quite  simple,  and 
the  programs  for  fitting  2D  and  3D  data  consisted  of 
about  500  and  800  lines  in  C  respectively,  which  were 
compiled  and  executed  on  a  SPARC  2.  A  nice  feature 
of  the  algorithm  is  that  no  initial  guess  is  needed  -  all 
iterations  started  with  the  zero  polynomial. 

The  first  example  consists  of  175  points  in  the  shape  of 
a  packman.  The  first  three  iterations  are  shown  in  Figure 
6  super-imposed  on  the  data.  After  the  third  iteration, 
there  is  no  significant  improvement  in  the  error.  The 
error  is  larger  than  in  the  other  examples,  but  that  is 
because  of  the  very  rough  nature  of  the  data  points. 

The  second  example  consists  of  442  points  that  lie  on 
a  sphere  of  radius  50  around  the  origin  and  442  points 
on  a  sphere  of  radius  100  around  the  origin.  Since  the 
union  of  the  spheres  can  be  exactly  described  by  an  im¬ 
plicit  polynomial  of  degree  four,  the  resulting  error  is 
very  small. 

The  third  example  (super  quadric)  consists  of  441 
points  on  the  surface  of  a  super  quadric  parametrized 
as  following  - 

X  =  50  cos° '*((^)  cos°  ®(0),  y  —  70  cos**  "*(<;i)  sin’’ ®(<?), 
z  =  100sin°'’(<^) 

Results  are  given  in  the  table  with  the  1,2,3  on  the  left 
corresponding  to  the  number  of  the  iteration.  Time  is 
in  seconds.  The  error  is  the  average  distance  of  the  data 
points  from  the  fitted  surface.  Compared  to  the  size  of 
the  objects,  the  errors  are  small. 
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3.4  Eigenvalues  as  Invariants 

In  [19,  21]  many  new  computationaly  efficient  invariants 
are  presented.  We  briefly  give  some  of  the  underlying 
ideas  behind  the  theory.  The  following  is  well  known. 

The  quadratic  form,  i.e.,  2nd  degree  terms,  in  two  vari¬ 
ables  can  be  expressed  as 


and  denoting  the  2x2  matrix  as  ^[1,1],  we  can  write 
<f)(x)  =  i]!.  Now  if  I  =  .4x  is  a  linear  transfor¬ 

mation,  the  transformed  polynomial,  0  (x  ),  satisfies 


<l>'(x'}  =  <f>(A  ^x')  =  ix'‘<5|i  i]x' ; 

hence,  jj  =  which  reduces  to  jj  = 

i]y4‘  if  A  is  orthogonal,  i.e.,  a  rotation.  Since  the 
determinant  |yl|  is  1  for  A  orthogonal,  we  see  that 
is  invariant  to  object  rotation.  Furthermore,  it  is  eas¬ 
ily  seen  that  the  coefficients  of  the  characteristic  poly¬ 
nomial  x(-^)  =  det(A/  —  ^[1,1])  are  orthogonal  invari¬ 
ants  of  the  form  4> .  Equivalently,  the  two  eigenvalues  of 
the  matrix,  the  roots  of  ,  are  orthogonal  invariants. 
For  A  nonsingular  we  have  =  |i4|”^|<I>|j  jj| .  Let 

<j>{x)  =  |x‘4>ri  i]X  and  xl>{x)  —  |x‘^[i  i]X  be  two  non¬ 
singular  quadratic  forms  in  three  variables,  x,  y,  z.  It  is 
well  known  that  trace(0[i,i]4'j'j^jj)  and  det(d>[i 
are  joint  rational  (ratio  of  two  polynomials)  linear  in¬ 
variants  of  the  two  forms.  In  fact,  these  two  invariants 
are  just  two  of  the  coefficients  of  the  characteristic  poly¬ 
nomial  x(-^)  =  det(/\/  —  1  and  all  the  co¬ 

efficients  of  x('^)  are  joint  linear  invariants  of  the  two 


forms.  Equivalently,  the  three  eigenvalues  of  the  matrix 
^U.il^[i\]  joint  invariants  of  the  two  forms.  That 
is,  the  eigenvalues  are  not  only  invariant  to  rotations  of 
the  3D  or  2D  object,  but  also  to  arbitrary  stretchings  in 
two  arbitrary  directions.  Finally,  this  construction  can 
be  extended  to  all  the  pairs  of  quadratic  forms,  not  only 
the  nonsingular  ones:  instead  of  considering  the  eigenval¬ 
ues  of  consider  the  generalized  eigenvalues 

of  the  pair  of  matrices,  the  values  of  (A,i/)  ^  0  such 
that  det(A^[i  1]  =  0 . 

In  [19,  21]  Taubin  generalizes  the  preceding  construc¬ 
tions  to  higher  degrees.  That  is,  he  shows  how  to  com¬ 
pute  orthogonal  and  linear  invariants  of  one  form  or  of 
two  or  more  forms  of  different  degrees,  jointly  by  re¬ 
ducing  the  problem  to  the  computation  of  eigenvalues  or 
generalized  eigenvalues  of  certain  matrices  of  coefficients. 
Thus,  the  invariants  can  be  designed  to  be  functions  of 
the  coefficients  of  all  the  terms  in  a  polynomial  or  of  just 
those  terms  of  one  or  a  few  degrees.  Taubin ’s  develop¬ 
ment  exploits  the  following  known  results. 

The  set  of  monomials  {  x^/x/^  :  |a|  =  d}  of  degree 
d  lexicographically  ordered,  define  a  vector  of  dimension 
hd ,  which  we  will  denote  A'[d](x) .  For  example. 


%](*!,  *2)  =  ;^-^2)‘ 


Hence,  A’[3](xi,  X2)  is  the  vector  of  monomials  of  degree 
3  for  points  X| ,  X2  in  the  plane.  Next,  any  form  of  degree 
d=j+k,  i.e.,  polynomial  where  all  the  terms  have  degree 
d  resulting  from  the  products  of  monomials  of  degrees  j 
and  k,  can  be  expressed  as 

(p<;i(x)  =  A'(jj(x)'<I>[jjt]A'[t](x)  . 
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Finally,  if  x'  =  Ax  is  a  nonsingular  linear  transforma¬ 
tion,  the  resulting  polynomial  is  X|j](a;  )‘4>y  )  , 

with 

%,itl  =  ^[j]^[j.*:l^[ifc]  • 

where  the  elements  of  and  A[ic]  depend  only  on  A. 
Using  this  result,  Taubin  develops  a  rich  range  of  eas¬ 
ily  computable  invariants  for  polynomial  curves  and  sur¬ 
faces  of  arbitrary  degrees. 

4  Asymptotic  Parameter  Distributions, 
Mahalanobis  Distances,  And 
Bayesian  Recognition 

This  section  addresses  the  problem  of  variability  in  the 
polynomial  coefficients  with  small  changes  in  the  data 
set  by  formulating  it  within  a  probabilistic  framework. 
Constraining  the  polynomials  to  have  bounded  zero  sets 
most  often  reduces  the  variability  considerably.  How¬ 
ever,  in  some  cases,  the  coefficients  may  still  be  sensi¬ 
tive  to  small  changes  in  the  data  set.  Also,  unbounded 
polynomials  are  often  more  appropriate  for  representing 
portions  of  or  all  of  an  object  as  in  the  case  with  the 
handprinted  characters  shown  in  Figures  7(a)-(e).  If  the 
polynomial  coefficients  vary  considerably,  so  will  the  in¬ 
variants  that  are  functions  of  these  coefficients,  thus  giv¬ 
ing  unreliable  recognition  results.  Thus,  the  first  prob¬ 
lem  is  to  get  an  estimate  of  the  variance  of  the  poly¬ 
nomial  coefficients.  The  second  problem  is  to  design  a 
metric  based  on  the  polynomial  coefficients  for  compar¬ 
ing  two  polynomial  zero  sets  over  the  region  where  the 
data  exists. 

The  input  data  here  is  a  sequence  of  range  data  points, 
=  {Zi,Z2,...,Zn},  with  Zi  =  (x,-,yi,z<)‘- 
Let  a  denote  the  vector  of  coefficients  of  the  polyno¬ 
mial  f{x,y,z)  that  describes  the  given  object.  We  as¬ 
sume  that  the  range  data  points  Z\,Z2,  ■  ■  ■,Zn  are  sta¬ 
tistically  independent,  with  Zi  having  probability  den¬ 
sity  function  (pdf) 
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The  assumption  is  that  Zi  is  a  noisy  Gaussian  measure¬ 
ment  of  the  object  boundary  in  the  direction  perpendic¬ 
ular  to  the  boundary  at  its  closest  point.  This  model  is 
introduced  and  discussed  in  [3,  18]. 

Thus,  the  joint  probability  of  the  data  points  is 


p(Z^|a)  = 


(2;r«r2)T 


1  ^ 


P(Zi) 

VfiZi) 


The  maximum  likelihood  estimate  djv  of  a  given  the 
data  points  is  the  value  of  a  that  maximizes  (4). 

A  very  useful  tool  for  solving  the  problems  of  object 
recognition  and  parameter  estimation  is  an  asymptotic 
approximation  to  the  joint  likelihood  function,  (4),  which 
can  be  shown  to  have  a  Gaussian  shape  in  a  [3],  i.e., 

p(Z^  )  a)  ss  [p(Z^  )  d;v)]  exp{-i(a-dAr)4'Ar(a-dAr)) 

(5) 


where  iFjv  is  the  .second  derivative  matrix  having  i,.;th 
component.  \  a)  Hence,  all 

the  useful  information  about  a  is  summarized  in  the 
quadratic  form  in  the  exponent  of  equation  (5).  If 
is  not  singular,  then  it  is  the  inverse  covariance  matrix 
of  oiN.  is  called  the  uncertainity  matrix  of  djv. 

The  aposteriori  distribution  of  a  given  the  data,  i.e., 
p(a  I  Z^),  is  j)ropotional  to  p(Z^  |  a)p(a).  This  can 
be  written  using  the  asymptotic  approximation  as 

constant  x 

p(Z^  )  ajv)exp  [-|(a  -  dyv)^N(a  -  “tv)]  p(a) 

(6) 

where  p(o[)  is  a  prior  distribution  for  a. 

This  distribution  addresses  the  first  problem  because 
it  tells  us  about  the  uncertainity  in  the  polynomial  coef¬ 
ficients  given  the  data  points.  The  uncertainity  matrix 
'i/N  defines  an  ellipsoid  around  oiAr  in  the  d-dimensional 
coefficient  space.  The  axes  of  this  ellipsoid  are  the  di¬ 
rections  of  the  eigenvectors  of  the  uncertainity  matrix, 
and  the  lengtlis  of  the  axes  are  equal  to  the  square  roots 
of  the  eigenvalues.  The  volume  of  this  ellipsoid  gives  a 
measure  of  the  uncertainity  in  the  parameter  estimates. 
If  the  volume  is  large,  then  it  implies  that  the  coeffi¬ 
cients  are  not  reliable.  If  the  coefficients  are  not  reliable, 
neither  will  be  the  invariants  that  are  functions  of  these 
coefficients.  Then,  instead  of  using  the  existing  mea¬ 
surements  to  recognize  the  object,  the  system  can  collect 
more  data  in  order  to  improve  the  parameter  estimates 
(i.e.,  reduce  the  uncertainity  volume).  Details  of  how 
to  collect  more  data  in  order  to  reduce  the  uncertainity 
volume  as  quickly  as  possible  are  given  in  [l8]. 

4.1  Mahalanobis  Distance  as  a  Comparison 
Measure  for  Polynomial  Zero  Sets 

The  scenario  that  we  consider  here  is  one  where  we  have 
a  set  of  objects  labeled  /  =  1,2,...,//  in  the  database. 
Each  may  be  a  polynomial  of  different  degree  n  in  x,y 
and  z.  Let  ex'  be  the  parameter  vector  for  object  /. 
The  optimum  recognition  rule  is:  ’choose  /  for  which 
I  ex')  is  maximum’.  This,  however,  requires  con¬ 
siderable  computation  because  the  data  is  used  L  times 
to  compute  pi(Z^  I  «*)  for  /  =  1, 2, . . . ,  L.  However,  if 
all  the  polynomials  are  of  the  same  degree,  considerable 
simplification  results. 

If  all  the  objects  are  of  the  same  degree,  p;(Z^  j  ex') 
will  have  the  same  form,  p(Z^  I  «*)'  for  all  /,  but  will 
differ  in  the  values  of  the  parameter  vector,  ex' . 

Using  the  asymptotic  approximation,  (5),  we  see  that 
since  p(Z^  |  rk.v)  is  independent  of  /,  an  approximately 
equivalent  recognition  is  :  choose  /  for  which  (7)  is  min¬ 
imum 

(ex' —  ex:^)‘'i/^(ex' -  exjii)  (7) 

The  advantage  in  using  (7)  is  that  the  data  is  involved 
just  once  (not  L  times)  to  compute  the  uncertainity  ma¬ 
trix.  Note  t.hal  (7)  is  a  Mahalanobis  distance  measure. 
The  coefficient  vector  of  the  best  fitting  polynomial  to 
the  data  is  compared  with  the  coefficient  vector  of  each 
of  the  stored  polynomials.  To  reiterate,  the  justifica¬ 
tion  for  using  this  distance  measure  is  its  equivalence  to 


checking  how  well  the  data  set  Zi,Z2,  ■  ■  ■,  Zn  is  fit  by 
the  polynomial  having  coefficient  vector  a' . 

An  explanation  for  why  the  Mahalanobis  distance  is 
the  appropriate  metric  for  comparing  polynomial  zero 
sets  is  as  follows.  Assume  that  a  polynomial  of  degree 
n  is  required  to  get  a  good  fit  to  a  bounded  subset  of 
data  in  the  x-y  plane.  Since  the  data  set  is  limited, 
it  may  not  completely  constrain  the  polynomial  coef¬ 
ficients,  thus  permitting  the  polynomial  coefficients  to 
change  greatly  with  small  changes  in  the  data  set.  Thus, 
even  though  two  data  sets  may  be  very  similar,  the  poly¬ 
nomials  that  fit  them  may  look  quite  different  far  away 
from  the  data  sets.  In  these  circumstances,  the  coeffi¬ 
cients  of  the  polynomials  that  fit  the  data  set  and  small 
variations  of  it,  form  a  cloud  of  points  in  the  coefficient 
space  that  is  wide  in  some  directions  and  narrow  in  oth¬ 
ers.  A  large  variation  therefore  in  the  coefficients  along 
the  direction  where  the  cloud  is  wide  does  not  change  the 
probability  of  the  data  very  much.  Thus,  the  direction 
of  large  cloud  width  is  not  very  useful  for  comparing  two 
polynomial  zero  sets.  We,  therefore,  need  a  metric  for 
comparing  two  polynomial  zero  sets  that  weights  direc¬ 
tions  of  large  cloud  width  much  less  than  directions  of 
small  cloud  width.  The  Mahalanobis  distance  is  such  a 
metric. 

A  direction  of  wide  cloud  width  corresponds  to  an 
eigenvector  of  the  uncertainity  matrix  for  which  the 
eigenvalue  is  small,  and  a  direction  of  small  cloud  width 
corresponds  to  an  eigenvector  of  the  uncertainity  matrix 
for  which  the  eigenvalue  is  large.  Thus,  the  Mahalanobis 
distance  weights  the  directions  of  large  cloud  width  much 
less  than  the  directions  of  small  cloud  width.  Infact,  the 
Mahalanobis  distance  effectively  uses  a  subspace  of  the 
coefficient  space  that  is  spanned  by  those  eigenvectors  of 
the  uncertainity  matrix  corresponding  to  the  eigen  v^alues 
that  are  larger  than  c,  where  e  is  small.  Thus,  the  Maha¬ 
lanobis  distance  is  a  metric  that  is  useful  for  comparing 
two  polynomial  zero  sets  in  the  vicinity  of  the  data  set, 
bEised  on  the  coefficient  vectors. 

The  scenario  discussed  here  is  one  where  each  object 
in  the  database  is  characterized  by  one  value  of  the  pa¬ 
rameter  vector,  a'.  A  more  general  scenario  is  where 
each  object  has  a  parameter  vector  a'  that  is  character¬ 
ized  by  a  distribution,  p(aO-  example,  a  sphere  with 
fixed  center  and  radius  that  can  take  a  range  of  values. 
[18]  deals  with  computing  a  computationally  attractive 
recognition  rule  for  this  general  case. 

4.2  Experimental  Results 

The  set  of  results  shown  illustrates  the  use  of  the  Ma¬ 
halanobis  distance  for  comparing  polynomial  zero  sets. 
The  data  sets  here  correspond  to  handwritten  characters 
and  are  all  well  fit  by  third  degree  polynomials.  These 
are  examples  where  unbounded  polynomials  are  appro¬ 
priate,  because  forcing  the  zero  sets  to  be  closed  would 
require  polynomials  of  higher  degree,  and  in  addition 
would  make  them  look  more  like  some  of  the  other  let¬ 
ters.  The  data  sets  in  figures  7(a)  through  7(e)  are  the 
objects  in  the  database.  These  data  sets  are  the  letters, 
‘e’,‘sVt’,‘r’  and  ‘y’.  Also  shown  in  these  figures  arc  the 
best  third  degree  polynomial  fits  to  the  data.  Figure  8 


corresponds  to  another  instance  of  the  haudwritleii  char¬ 
acter  ‘r’  that  looks  very  much  different  from  the  one  in 
the  database.  'I'he  Mahalanobis  distance  (7)  of  the  coeffi¬ 
cient  vector  for  the  best  polynomial  fit  to  this  data  set  to 
the  coefficient  vectors  for  the  best  polynomial  fits  to  the 
letters  ‘e’,‘s’,‘t’,  ‘r’  and  ’y’  'n  ^he  database  are  1.0e-F5, 
1.6e-4-5,  2.8e-f4,  1.9e-t-3  and  8.8e-t-.5  respectively.  For 
computing  the  Mahalanobis  distance,  we  scale  all  the 
data  sets  so  that  they  all  lie  within  a  rectangle  of  the 
same  dimension.  Thus,  all  the  data  sets  are  compared 
on  the  same  scale.  From  the  results  one  can  see  that  the 
distance  is  a  minimum  for  the  alphabet  ‘r’.  Ofcourse,  an 
‘r'  like  the  one  in  the  database  would  produce  a  much 
smaller  distance. 

The  experimeni,,  run  on  a  SPARC  2,  runs  in  close 
to  real  time.  The  polynomial  fitting  runs  in  real  time. 
The  computation  of  the  second  derivative  matrix  is  fast 
because,  due  to  the  structure  of  the  likelihood  function, 
one  can  coinjmte  the  second  derivatives  using  a  set  of 
first  derivatives.  I'hus,  the  Mahalanobis  distance  can  be 
used  to  compare  polynomial  zero  sets  in  real  time. 
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Abstract 

A  set  of  modules  to  extract  partial  descriptions 
of  SHGC  objects  in  an  edge  image  is  presented. 

It  consists  of  modules  to  find  end  edges,  to  find 
meridian  edges,  to  find  cross-section  edges,  and 
to  recover  3D  shapes.  The  first  goal  of  the 
system  is  to  extract  geometrical  edges  derived 
from  an  SHGC  object  and  the  second  goal  is 
to  recover  3D  information  of  the  object.  From 
an  input  edge  image,  pairs  of  end  edges  are  de¬ 
tected  first  by  verifying  strong  geometrical  con¬ 
straints  for  the  ends  of  an  SHGC,  then  meridian 
edges  are  detected  by  using  the  constraint  for 
tangent  intersections  and  the  ones  related  to 
the  end  edges.  Further,  the  axis  of  SHGC  and 
the  axes  of  skewed  symmetry  in  cross-section 
edges  are  detected.  And  finally  original  cross- 
section  and  the  sweeping  rule  are  recovered 
by  utilizing  these  three  orthogonal  axes.  Ex¬ 
tracted  geometrical  edges  and  3D  information 
from  real  images  are  shown. 

1  Introduction 

Though  the  recovery  of  3D  shapes  from  an  image  is  in 
gener^  underconstrained,  it  is  simplified  when  objects 
belong  to  a  generic  object  class.  The  class  of  generalized 
cylinders  [Binford  7l]lAgin  and  Binford  73]  is  a  popular 
representation  scheme  for  curved  objects  and  the  class 
of  straight  homogeneous  generalized  cylinders  (SHGCs) 
is  a  subset  of  generalized  cylinders  where  the  objects  are 
obtained  by  sweeping  arbitrary  cross-section  with  arbi- 
trarv  scaling  along  a  straight  axis  [Shafer  and  Kanade 
83]  [Ponce  et  al.  89]  (figure  1). 

In  this  paper,  we  address  the  problem  to  extract  par¬ 
tial  descriptions  of  SHGC  objects  from  a  real  image  by 
constraint- based  edge  grouping,  i.e.,  by  using  geometrical 
constraints  derived  from  restrictions  in  shape  models. 
We  have  developed  a  set  of  modules  called  BUILDER-I 
(Bottom-Up  Image  Level  Description  ExtractoR  -  ver¬ 
sion  I)  which  consists  of  a  module  to  find  end  edges 
of  an  SHGC,  one  to  find  meridian  edges,  one  to  find 
cross-section  edges,  and  one  to  recover  3D  shapes  from 
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the  meridian  and  cross-section  edges.  The  importance 
of  finding  meridians  and  ends  is  two-fold;  in  the  con¬ 
text  of  model-based  vision  systems,  it  makes  hypothesis 
generation  for  objects  in  the  image  feasible  by  detecting 
a  limited  number  of  edge  sets  for  an  object  part;  also 
it  makes  methodologies  to  recover  3D  shapes  of  gener¬ 
alized  cylinders  developed  in  this  field  be  applicable  to 
real  images. 

Though  researchers  showed  that  the  3D  shapes  of  an 
SHGC  is  recoverable  from  its  cross-section  and  limbs  or 
meridians  with  a  few  degree  of  freedom,  and  that  vari¬ 
ous  useful  invariants  exist  for  the  contours  of  an  SHGC 
[Rao  and  Medioni  88],  only  little  work  has  been  done  for 
finding  geometrical  edges  in  a  real  image  except  finding 
some  type  of  ribbons  or  symmetries,  e.g.,  [Brady  and 
Asada  84],  [Ponce  90],  or  [Sumanaweera  et  al.  88]. 

Mohan  and  Nevatia  proposed  a  method  of  perceptual 
organization  for  segmentation  and  description  from  in¬ 
tensity  edge  contours.  It  performs  a  similar  type  of  task 
as  BUILDER-I  not  by  applying  geometrical  constraints 
but  by  using  a  neural  network  for  grouping  consistent 
symmetries  [Mohan  and  Nevatia  89]. 

Saint-Marc  and  Medioni  proposed  a  method  to  find 
skewed  and  puallel  symmetry  in  a  real  image  [Saint- 
Marc  and  Medioni  90].  Parallel  symmetry  is  closely  re¬ 
lated  to  the  ends  of  SHGC.  However,  it  is  expected  that 
their  method  is  not  suitable  to  extract  the  end  edges 
because  paral  ‘1  symmetry  is  more  general  relation  of 
curves  than  I  z  ends  of  SHGC  and  it  lacks  some  of 
the  geometric  constraints  required  for  the  ends  of  an 
SHGC. 

Ponce  et  al.  showed  algorithms  to  detect  axes  of 
SHGCs  in  an  edge  image  [Ponce  et  al.  89].  The  same 
approach  as  ours  was  used  in  their  work,  i.e.,  clarifying 
the  geometrical  constraints  derived  from  the  shape  re¬ 
strictions  on  SHGCs  mathematically,  then  utilizing  them 
to  develop  algorithms  to  extract  partial  descriptions  of 
SHGC  objects.  However,  the  algorithms  have  unneg- 
ligible  restrictions  to  object  shapes:  an  implicit  restric¬ 
tion  in  the  first  algorithm  derived  from  Hough  transform 
[Hough  62]  is  that  the  object  needs  to  have  sufficient 
number  of  edgels  for  grouping,  i.e.,  long  limbs  or  many 
meridians;  the  restriction  in  the  second  algorithm  is  the 
number  of  zero-curvature  points  in  the  limbs  as  the  au¬ 
thors  noted. 

We  developed  algorithms  to  find  the  ends  of  SHGC 
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Figure  1:  Straight  homogeneous  generalized  cylinders. 


in  an  edge  image  since  strong  geometrical  constraints 
exist  for  them  [Sato  and  Binford  92].  Though  they  as¬ 
sumes  additional  restrictions  to  object  shapes  or  a  priori 
information  to  reduce  computing  time,  the  method  it¬ 
self  is  applicable  to  the  no  a  priori  information  and  no 
additional  restriction  case.  The  algorithms  showed  the 
ability  to  find  pairs  of  end  edges  with  a  limited  num¬ 
ber  of  false  positives.  And  it  is  expected  that  detection 
of  meridian  edges  can  be  more  reliable  and  efficient  by 
combining  the  information  from  the  end  edges. 

The  recovery  of  3D  shape  of  generalized  cylinders  has 
been  tackled.  To  fix  the  remaining  freedom,  certain  ad¬ 
ditional  constraints,  e.g.,  constraints  from  shading  or 
heuristic  ones  like  orthogonality  constraints,  are  intro¬ 
duced  [Gross  and  Boult  90]  [Ulupinar  and  Nevatia  90]. 
In  this  paper,  additional  explicit  restrictions  to  object 
shapes,  i.e.,  RSHGC  with  symmetric  cross-section,  are 
introduced.  It  is  applicable  only  to  a  subset  of  SHGC  ob¬ 
jects,  however,  the  geometrical  constraints  derived  from 
them  make  the  algorithm  simple  and  efficient  with  only 
the  assumption  of  skewed  symmetry.  The  rest  of  the 
paper  is  organized  as  follows: 

In  section  2,  overview  of  the  BUIDER-I  system  is  pre¬ 
sented.  In  section  3,  the  algorithms  to  find  the  ends 
of  SHGCs  is  summarized  briefly.  In  section  4,  the  zd- 
gorithms  to  find  meridian  edges  and  cross-section  edges 
are  presented.  Geometrical  constraints  used  in  the  algo¬ 
rithms  and  techniques  to  utilize  them  are  described.  In 
section  5,  a  method  to  recover  30  shapes  of  an  SHGC 
is  presented.  GTOmetrical  relation  between  the  projec¬ 
tions  of  orthogonal  axes  and  viewing  transform  is  clari¬ 
fied  and  a  simple  method  to  find  axes  of  skewed  symme¬ 
try  is  described.  Experimental  results  from  real  images 
are  shown  throughout  the  paper. 

2  Overview  of  BUILDER-I  system 

The  task  of  BUILDER-I  is  the  extraction  of  partial  de¬ 
scriptions  for  objects  in  an  image.  BUILDER-I  is  de¬ 
signed  to  be  a  subsystem  in  a  flexible  model-based  vision 
system,  such  as  ACRONYM  [Brooks  81]  and  SUCCES¬ 
SOR  [Binford  et  al.  87],  which  combines  bottom-up  and 
top-down  processing.  The  descriptions  from  BUILDER- 
I  are  supposed  to  be  used  by  top-down  modules  to  index 
appropriate  models  and  to  predict  the  appearance  of  ob¬ 
jects  in  images. 

The  input  to  BUILDER-I  is  an  edge  image  in  which 
position  and  orientation  of  edgels  are  measured  in  rea¬ 
sonable  accuracy  and  the  edgels  are  linked  /  divided  with 
their  continuity.  BUILDER-I  consists  of  four  modules; 
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Figure  2:  BUILDER-I  system  configuration. 


one  for  finding  the  ends  of  SHGCs,  named  end-finder, 
one  for  finding  meridian  edges,  meridian-finder,  one  for 
finding  cross-section  edges,  cross-section  finder,  and  one 
for  recovering  sweeping  rules  and  original  cross-sections, 
SD-recoverer.  Configuration  of  BUILDER-I  system  is 
shown  in  figure  2.  Though  the  initial  step  to  extract 
visual  clues  from  an  image  can  be  either  finding  ends 
or  finding  limbs  /  meridians  [Rao  and  Nevatia  87],  the 
former  is  selected  because  strong  geometrical  constraints 
exist  for  the  ends  of  an  SHGC.  Currently,  the  system  is 
not  fully  automatic:  an  edge  for  focus  of  attention,  actu¬ 
ally  one  of  the  end  edges,  has  to  be  indicated  manually 
in  order  to  reduce  computing  time  (see  section  3). 

The  behavior  of  the  system  is  as  follows:  first,  the  ends 
of  an  SHGC  are  detected  by  end-finder  from  the  input 
edge  image  with  a  given  reference  edge;  then,  merid¬ 
ian  edges  are  detected  by  meridian-finder  using  the  ends 
detected  in  the  first  step;  also,  cross-section  edges  cor¬ 
responding  to  the  ends  are  detected  by  cross-section- 
finder;  finally,  if  the  meridian  and  cross-section  edges 
are  found  and  if  the  object  shape  can  be  assumed  to 
be  a  RSHGC  with  its  axis  passing  through  the  center 
of  its  cross-section,  3D-recoverer  is  activated,  and  if  the 
cross-section  has  the  axes  of  skewed  symmetry,  the  orig¬ 
inal  cross-section  and  the  sweeping  rule  are  recovered. 
The  results  of  the  modules,  i.e.,  the  ends,  the  meridian 
edges  and  their  axis,  the  cross-section  edges,  the  sweep¬ 
ing  rule,  the  original  cross-section,  and  the  viewing  an¬ 
gles  are  combined  to  output. 

3  Algorithms  to  find  the  ends  of  an 
SHGC 

In  this  section,  we  briefly  summarize  the  algorithms  to 
find  the  ends  of  an  SHGC  because  it  is  an  important 
module  in  BUILDER-I.  More  precise  description  of  the 
algorithms  can  be  found  in  [Sato  and  Binford  92). 

In  this  paper,  we  use  the  terms,  such  as  SHGC, 
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Figure  3:  Origin  of  scaling  and  comeridian  edgel  pair. 
Corresponding  comeridian  edgel  pairs  are  shown  with 
the  lines  connecting  each  edgel  pair. 


Figure  4;  Tangent  intersection  and  extension  intersec¬ 
tion. 


RSHGC,  LSHGC,  meridian,  parallel,  and  limb,  accord¬ 
ing  to  [Shafer  and  Kanade  83]  and  [Ponce  et  al.  89]. 
Since  the  primary  task  of  BUILDER-I  is  2D  image  anal¬ 
ysis,  those  terms  are  often  used  for  the  projections  of 
objects  to  2D  images. 

The  term  cross-section  edges  are  used  for  edges  at  the 
intersections  of  a  side  face  and  end  faces,  i.e.,  end  par¬ 
allels.  On  the  other  hand,  original  cross-section  is  used 
for  a  cross-section  function  which  is  used  to  generate 
the  object  with  a  sweeping  rule.  Ends  of  an  SHGC, 
end  edges,  or  ends,  are  used  for  a  part  of  cross-section 
edges  which  bounds  a  visible  part  of  the  side  face.  Co¬ 
meridian  (respectively  co-parallel)  edgels  or  co-meridian 
edgel  pair  is  used  for  a  pair  of  edgels  in  the  ends  (merid¬ 
ians)  which  lie  in  a  meridian  (parallel).  (Figure  3.)  Cor¬ 
responding  co-meridian  (respectively  corresponding  co- 
parallel)  edgel  pairs  is  used  for  co-meridian  edgel  pairs 
whose  edgels  lie  in  the  same  pair  of  parallels  (meridi¬ 
ans).  Tangent  intersection  is  used  for  a  point  at  which 
tangent  lines  to  limbs  or  meridians  at  each  edgels  in  a 
co-parallel  edgel  pair  intersect.  Extension  intersection  is 
used  for  an  intersection  point  made  by  a  pair  of  extension 
lines  which  pass  through  each  edgel  pair  in  correspond¬ 
ing  co-meridian  edgel  pairs  (figure  4).  Origin  of  scaling 
is  used  for  a  point  which  is  the  extension  intersection  for 
the  ends  of  an  SHGC.  We  use  the  term  origin  of  scal¬ 
ing  rather  than  the  term  apex  of  shape  because  apex  of 
shape  seems  to  be  used  for  a  virtual  apex  of  an  LSHGC 


Figure  5:  Projection  and  position  of  origin  of  scaling. 

in  [Shafer  and  Kanade  83].  Scaling  ratio  is  used  for  the 
ratio  of  scaling  factors  for  a  pair  of  parallels  .  The  imag¬ 
ing  process  is  assumed  to  be  orthographic  projection  in 
this  paper. 

The  constraints  used  to  detect  the  ends  of  an  SHGC 
are  as  follows: 

End-constraint(i)  The  edgels  which  form  a  co¬ 
meridian  edgel  pair  have  the  same  orientation. 

End-con8traint(ii)  The  extension  intersections  for 
corresponding  co-meridian  edgel  pairs  coincide  at 
a  point  in  the  axis. 

End-constraint  (iii)  The  scaling  ratio  for  each  edgel 
pair  in  corresponding  co-meridian  edgel  pairs  is  the 
same. 

End-constraint  (iv)  The  vector  between  each  edgel 
pair  in  corresponding  co-mcridian  edgel  pairs  is  the 
same  when  its  scaling  ratio  is  equal  to  1 . 

Because  a  reliable  and  efficient  pre-processing  for 
curve  fitting  or  selection  of  feature  points  is  another  dif¬ 
ficult  problem,  the  algorithm  shown  below  is  a  simple 
edgel-based  method,  i.e.,  it  selects  and  groups  pairs  of 
edgels  which  satisfy  the  above  constraints.  Without  ac¬ 
curate  measurement  of  curvature  at  edgels,  it  is  required 
to  hypothesize  the  position  of  origin  of  scalings.  Thus, 
the  computational  complexity  of  the  algorithm  becomes 
0(nkdc)  by  using  projection  technique  in  [Nevatia  and 
Binford  77],  where  n  is  the  number  of  edgels,  k  is  the 
number  of  discretized  projection  directions,  d  is  the  num¬ 
ber  of  hypotheses  for  an  edgel  pair,  c  is  the  number  of 
partially  parallel  edges  in  an  image.  Unfortunately  this 
would  require  very  long  computing  time  and  we  intro¬ 
duced  additional  constraints  to  avoid  it.  The  first  algo¬ 
rithm  is  for  cylindrical  ends,  which  means  scaling  factors 
at  the  ends  are  the  same.  The  second  algorithm  is  for 
any  SHGC.  However,  a  modified  version  is  implemented; 
given  a  reference  end  edge,  it  finds  edges  possibly  paired 
with  it.  The  second  algorithm  is  summarized  as  follows 
(see  figure  5  also); 
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1.  discretize  orientation  for  projection  direction,  dis¬ 
cretize  two-dimensional  space  for  origin  of  scaling, 
and  prepare  buckets  for  projection  and  cells  for  ori¬ 
gin  of  scaling, 

2.  group  the  edgels  according  to  their  orientation,  (the 
results  are  called  orientation-groups,) 

3.  for  each  projection  direction,  and  for  each 
orientation-group, 

(a)  project  each  edgel  in  the  group  into  a  bucket, 
(the  results  are  called  projection-orientation- 
groups,) 

(b)  calculate  extension  lines  in  the  projection  direc¬ 
tion  and  determine  the  cells  (r,^)  which  each 
extension  passes  through, 

(c)  put  the  projection-orientation-groups  to  the 
cells, 

4.  select  candidates  for  the  cell  position  for  origin  of 
scaling 

(more  precisely  in  [Sato  and  Binford  92]). 

5.  in  each  candidate  cell  (r,^), 

(a)  for  each  projection-orientation-groups  in  the 
cell,  generate  pairs  of  edgels  in  the  group  which 
consist  of  an  edgel  in  the  reference  edge  and  an¬ 
other  in  other  edges,  then  calculate  the  scaling 
ratio  for  each  pair, 

(b)  group  the  edgel-pairs  according  to  their  scaling- 
ratios  and  position  of  the  cell,  (the  results  are 
called  scaling-origin-groups,) 

6.  for  each  scaling-origin-group,  restore  the  edge-pairs 
corresponding  to  the  edgel-pairs  in  the  group. 

Figure  6  and  figure  7  show  the  extracted  end  pairs 
by  the  implemented  algorithms.  The  algorithms  detect 
most  of  the  possible  end  pairs  and  few  false  positives  be¬ 
cause  the  constraints  collected  along  an  edge  are  strong 
enough  to  exclude  most  false  positives.  Also,  except  for 
the  computing  time,  the  method  used  here  is  applicable 
in  the  case  with  no  additional  restriction  and  no  a  priori 
information  simply  by  giving  each  edge  as  the  reference 
edge. 
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Figure  9:  False  correspondence  for  meridians. 


pairs  directly,  so  that  the  implicit  restriction  derived 
from  Hough  Transform  was  introduced.  In  this  paper, 
we  also  use  the  following  constraints  which  are  related 
to  the  junction  points  between  the  ends  and  meridians. 

Constraint  (i)  All  vectors  between  each  edgel  pair  in 
corresponding  co-parallel  edgel  pairs  are  parallel 
(figure  8). 

Constraint(ii)  For  a  pair  of  parallel,  extension  inter¬ 
sections  are  the  same  point  in  the  axis. 


4  Finding  the  meridian  edges  of  an 
SHGC 

4.1  Overview  of  the  algorithm  to  find 
meridians 

It  is  shown  by  Shafer  and  Kanade  that  the  tangent  lines 
to  meridians  at  each  edgel  in  a  co-parallel  edgel  pair  in¬ 
tersect  at  a  point  in  the  axis.  Also  other  researchers 
showed  that  tangent  lines  to  limbs  at  for  a  co-parallel 
edgel  pair  intersect  at  a  point  in  the  axis  in  the  imaging 
plane  [Ponce  et  al.  89]  [Ulupinar  and  Nevatia  90].  We 
call  those  properties  the  constraint  for  tangent  intersec¬ 
tions.  The  constraint  for  tangent  intersections  can  be 
used  to  find  the  axes  of  SHGCs,  and  its  simplest  utiliza¬ 
tion  is  Ponce’s  O(n^d)  algorithm. 

Unfortunately,  the  constraint  for  tangent  intersections 
only  is  too  weak  to  prune  candidates  for  co-parallel  edgel 


The  proofs  for  these  properties  are  given  in  appendix. 

With  one  of  the  ends,  co-parallel  edgel  pairs  can  be 
determined  directly  by  the  constraint  (i).  Still,  axes  are 
not  constrruned  sufficiently,  so  that  the  algorithm  might 
find  wrong  correspondences  (figure  9).  By  using  end 
pairs,  those  wrong  correspondences  are  rejected,  so  that 
we  consider  the  only  meridians  that  connect  two  ends  in 
the  following  algorithm. 

Another  problem  in  finding  meridians  is  that  the  con¬ 
straints  do  not  determine  if  a  single  edge  path  is  a 
meridian  or  not,  but  find  that  a  pair  of  edge  paths  or 
two  pairs  of  edge  paths  are  not  consistent  as  meridians. 
Inconsistent-sets  generated  by  applying  the  constraints 
are  used  to  find  groups  of  edge  paths  as  candidates  for 
consistent  meridians. 

The  algorithm  is  summarized  as  follows: 

1.  find  junctions  between  an  end  edge  and  other  edges. 
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(a) 


2.  select  pairs  of  junctions  which  could  be  the  ends  of 

a  meridian. 

3.  find  edge  paths  which  connect  each  pair  of  junctions. 

4.  repeat  the  followings  until  meridians  are  found. 

(a)  select  the  edge  path  that  connect  the  largest 
number  of  parallels  in  untested  edge  paths. 

(b)  select  edge  paths  which  connect  the  parallels 
that  terminate  the  edge  path  in  step  (a). 

(c)  for  each  pair  of  selected  edge  paths,  calculate 
their  corresponding  angle,  corresponding  co¬ 
parallel  edgels,  scaling  ratios,  tangent  intersec¬ 
tions,  and  extension  intersections. 

(d)  verify  if  the  scaling  ratios  coincide  or  not  for 
two  pairs  of  edge  paths. 

(e)  verify  if  the  tangent  intersections  coincide  or 
not  for  two  pairs  of  edge  paths. 

(f)  generate  consistent-sets  for  the  inconsistent- 
sets  in  step  (c),  (d),  and  (e). 

(g)  for  each  consistent-set,  verify  if  the  tangent  and 
extension  intersections  lie  in  a  line  or  not  by 
calculating  normalized  least  squares  line. 


(b) 


(d) 


input  edge  image,  (b)(c)(d):  extracted  end  edges. 

4.2  Techniques  to  find  meridian  edges 

4.2.1  Finding  junctions  between  the  ends  and 
other  edges 

To  find  the  junctions,  it  is  necessary  to  infer  missing 
parts  of  edges.  We  assume  that  extrapolations  of  edges 
can  be  used  to  approximate  the  missing  parts.  Values 
of  curvature  are  determined  by  fitting  a  linear  function 
to  measured  orientation  values  near  the  ends  of  edges. 
By  assuming  a  constant  curvature,  the  extrapolations  of 
edges  are  calculated.  And  if  they  come  sufficiently  close, 
the  junctions  ue  hypothesized. 

Figure  10  shows  hypothesized  junctions.  Filling  the 
missing  parts  by  extrapolating  usually  gives  sufficient 
approximation,  however,  if  the  change  in  curvature  is  not 
small,  e.g.  around  a  limb,  the  error  in  the  position  and 
orientation  can  be  fairly  large.  This  makes  it  impossible 
to  use  very  strict  conditions  for  finding  junctions  and  the 
algorithm  may  find  many  false  junctions  as  well  as  the 
right  ones. 

4.2.2  Finding  corresponding  Junction  pairs 

In  the  next  stage,  pairs  of  junctions  which  could  be 
the  ends  of  a  meridian  are  selected.  Constreiints  to  a 
junction  pair  are  as  follows: 

•  orientation  of  the  tangents  to  end  edges  at  the  ends 
of  a  meridian  are  the  same. 
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Figure  7:  Other  examples  of  the  results  in  end-finder.  (a)(c):  input  edge  images,  (b)(d):  extracted  end  edges. 


•  extension  of  the  line  passing  through  the  ends  of  a 
meridian  passes  by  the  origin  of  scaling. 

A  polar  coordinates  with  non-linear  scaling  in  [Sato  and 
Binford  92]  is  used  to  verify  if  the  extension  lines  pass  by 
the  origin  of  scalings.  Figure  11  shows  the  junction  pairs 
extracted  by  the  algorithm  with  the  lines  which  connect 
each  junction  pair. 

4.2.3  Searching  edge  paths  between  a  junction 
pair 

Edge  paths  which  connect  a  junction  pair  are  detected 
as  candidates  for  meridians  in  the  next  stage.  Edge  links 
are  found  by  putting  a  rectangular  box  around  the  ends 
of  edges.  Best  first  search  algorithm  is  used  to  find  edge 
paths;  starting  from  an  edge  at  a  junction,  repetitively, 
the  edge  path  that  has  the  minimum  length  in  interme¬ 
diate  edge  paths  is  selected  and  extended  with  the  edge 
links  from  its  end  until  it  reaches  the  other  junction  or 
exceeds  a  predetermined  length. 

4.2.4  Finding  corresponding  co-parallel  cdgel 
pairs 

Corresponding  co-parallel  edgel  pairs  can  be  detected 
by  applying  the  constraint  (i).  For  a  pair  of  edge  paths, 
the  correspondence  angle  between  a  co-parallel  edgel  pair 
is  calculated  from  the  junctions  in  an  end.  And  by  us¬ 
ing  projection  in  that  angle,  corresponding  co-parallel 


Figure  13:  Invalid  sweeping  rule. 


edgel  pairs  are  found.  Figure  12  shows  examples  of  cor¬ 
responding  co-parallel  edgel  pairs.  During  the  projection 
the  following  constraints  are  verified:  projected  positions 
of  edgels  along  a  meridian  must  be  monotonic  because  a 
sweeping  rule  is  a  single- valued  function  along  Z  axis  (fig¬ 
ure  13);  a  pair  of  meridians  must  not  share  an  edge  seg¬ 
ment.  If  these  constraints  are  not  satisfied,  inconsistent- 
sets  are  generated  since  the  pair  of  edge  paths  are  not 
consistent  as  meridians. 
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Figure  10;  Hypothsized  junctions  from  the  ends. 


(a)  (b)  (c) 

Figure  12:  Examples  of  corresponding  co-parallel  edgel  pairs. 
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4.2.5  Inconsistent'sets  by  scaling  ratios  and 
tangent  intersections 

Each  corresponding  co-parallel  edgel  pair  gives  a  scal¬ 
ing  ratio  and  a  tangent  intersection  for  a  parallel,  though 
its  position  along  the  axis  is  not  determined  yet.  Two 
pairs  of  edge  paths  give  two  sweeping  rules  and  two  sets 
of  tangent  intersections  that  should  coincide.  Averaged 
difference  of  sweeping  rules,  as  well  as  tangent  intersec¬ 
tions,  is  calculated  for  two  pairs  of  edge  path  that  share 
an  edge  path,  namely  (Ml,  M2)  and  (Ml,  M3).  If  the 
difference  is  not  sufficiently  small,  an  inconsistent-set  of 
three  edge  paths,  (Ml,  M2,  M3)  is  generated.  To  calcu¬ 
late  distance  between  tangent  intersections,  a  polar  coor¬ 
dinate  system  with  non-linear  scaling  is  used  for  a  rough 
approximation  of  error  normalization  as  in  section4.2.2. 

4.2.6  Consistent-set  generation  for  a  group  of 
meridians 

The  constraints  yield  inconsistent-sets  as  described 
above.  A  group  of  meridians  that  satisfy  a  consistent 
interpretation  must  not  be  a  superset  of  any  inconsistent- 
sets.  Manipulation  of  consistent-sets  can  be  imple¬ 
mented  efficiently  with  simple  boolean  operations  on  bit 
vectors  as  used  in  ATMS.  For  instance,  the  condition,  A 
is  a  superset  of  S,  is  ~  B  0  B  ®  A  =  T,  where  ~,  0  and 
®  are  bit-not,  bit-or  and  bit-and  operator,  and  each  bit 
in  the  bit  vectors,  A  and  B,  indicates  that  each  hypoth¬ 
esis  is  true  or  false,  i.e.,  each  edge  path  is  a  meridian 
or  not.  By  generating  combinations  of  hypotheses  and 
pruning  them  with  the  inconsistent-sets,  consistent-sets 
are  acquired. 

Though  this  method  might  take  long  time  for  a  large 
number  of  hypothe-'es,  it  is  useful  when  the  number  of 
hypotheses  are  limited,  e.g.  to  20  or  30,  and  many  of  the 
hypotheses  are  pruned  in  the  early  stages. 

4.2.7  Normalized  least  squares  method  to  find 
the  axis  of  SHGC 

The  final  condition  to  verify  for  a  meridian  group  is 
that  their  tangent  intersections  and  extension  intersec¬ 
tions  lie  in  a  straight  line.  Least  squares  method  is  used 
to  calculate  the  best  fit  line  and  the  error  from  it. 

The  method  used  here  normalizes  the  errors  accord¬ 
ing  to  the  following  model  (figure  14):  For  extension 
intersections,  the  error  in  the  position  of  an  edgel  P,  is 
magnified  by  (/  +  fo)/fo  to  the  error  in  extension  inter¬ 
section  e,  where  /q  is  the  distance  between  the  junction 
and  P\  I  is  the  distance  between  P  and  the  extension 
intersection.  Thus,  the  normalized  error  e  is: 

c  =  e-/o/(f  +  /o)  =  c-^^^, 

So 

where  sq  and  si  are  the  distances  between  co-parallel 
edgel  pairs.  The  normalized  least  squares  method  min¬ 
imize®  not  ^*^t  In  the  same  manner,  the 

normalized  error  e  for  a  tangent  intersection  is  approxi¬ 
mated  to: 

fo  •  I  cos  (^  —  ^2)  —  cos  (^  -  )| 

C  =  c - - - , 

»0 

where  ^  is  the  angle  between  a  co-parallel  edgel  pair; 
0i  and  02  are  the  orientation  at  the  edgels;  /q  is  the  length 


Figure  14:  Error  me  Jel  for  intersections. 


of  the  edge  curve  used  to  calculate  the  orientation;  sq  is 
the  distance  between  the  edgel  pair. 

Figure  15  shows  examples  of  detected  meridians  and 
axes  by  the  algorithm. 

In  figure  15  (a),  only  the  meridian  edges  which  have 
junctions  at  both  ends  are  detected.  In  (b),  the  limbs  are 
detected  though  they  are  not  meridians  strictly,  because 
the  condition  (i)  is  satistied  when  the  object  is  a  solid  of 
revolution.  In  (c),  the  limbs  are  detected  because  limbs 
by  a  linear  sweeping  rule  are  meridians. 

Table  1  shows  averaged  square  errors  by  the  nor¬ 
malized  least  squares  method  and  usual  least  squares 
method.  The  errors  by  the  normalized  least  squares 
method  is  stable  and  interpretable,  i.e.,  the  errors  in  po¬ 
sition  at  edgels  are  less  than  1  (pixel)  when  the  intersec¬ 
tions  lie  in  a  line.  (The  maximum  error  in  figure  15  (a)  is 
quite  large  because  many  of  the  intersections  are  nearly 
plus  and  minus  infinite.  On  the  other  hand,  the  errors 
by  usual  least  squares  method  are  magnified  by  the  e.x- 
tension  length,  so  that  they  are  not  suitable  to  verify  if 
the  intersections  are  in  a  line  or  not. 

The  cases  are  further  divided  into  where  tangent  and 
extension  intersections  are  distributed  in  a  line,  where 
they  are  at  a  finite  focal  point,  or  where  they  are  at  an  in¬ 
finite  focal  point.  These  three  correspond  to  a  non-linear 
sweeping  rule,  a  linear  (not  constant)  sweeping  rule,  and 
a  constant  sweeping  rule  respectively.  The  maximum  er¬ 
rors  and  ratio  of  infinite  foci  are  used  to  determine  if  the 
meridians  have  an  axis  or  a  focal  point.  Actually,  the 
axes  are  not  fully  constrained  in  case  (a)  and  (c).  The 
axes  in  the  figures  are  drawn  as  they  pass  through  the 
center  of  cross-section  or  the  center  of  the  given  end  edge 
to  show  the  acquired  results. 

In  general,  limbs  are  not  meridians  and  an  object 
doesn’t  necessarily  have  two  or  more  meridians.  For¬ 
tunately,  our  method  is  extensible  to  limbs  when  a  cor¬ 
responding  cross-section  is  detectable,  though  this  ex¬ 
tended  version  is  not  implemented  yet.  (See  next  sec¬ 
tion  about  detection  of  cross-section.)  It  is  well-known 
that  orientation  of  the  tangent  to  limbs  and  that  to  the 
corresponding  point  in  the  cross-section  are  the  same  in 
the  imaging  plane  (figure  16),  so  that  the  corresponding 
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Figure  15:  Detected  meridian  edges  and  axes. 


method 

normaHze< 

least  squares 

usual  least  squares 

infinite  foci 

data 

min  error 

max  error 

min  error 

max  error 

ratio 

ElbowJoint:  figurel5(a) 

0.08952 

537.90 

9081.7 

6.2926E6 

0.580 

RiceBowl:  figurel5(b) 

0.13935 

11.730 

861.24 

1.6723E5 

0.045 

HeartCup:  figurel5(c) 

0.02379 

0.22385 

371.24 

2171.2 

0.0 

Table  1:  Averaged  squares  error  by  normalized  least  squares  fitting  to  a  line  and  ratio  of  infinite  distant  intersections. 


Figure  16:  Relation  between  limb  points  and  cross- 
section  points. 

points  P'  in  the  cross-section  can  be  determined  from  a 
limb  point  P.  If  the  angle  between  two  limb  points  P\ 
and  P2  is  equal  to  the  angle  between  the  corresponding 
points  in  cross-section  PI'  and  P2',  they  are  supposed 
to  lie  in  a  parallel.  The  same  constraint  is  used  to  find 
parallels  from  a  given  cross-section  and  limbs  in  the  pre¬ 
vious  work. 

Performance  of  the  algorithm,  both  its  detectability 
and  computing  time,  depend  on  the  ability  to  find  junc¬ 
tions  correctly.  The  algorithm  to  find  junctions  used  here 
is  not  so  reliable  that  it  detects  many  false  junctions  and 
causes  long  computing  time.  Once  junctions  are  found, 
the  constraints  are  strict  enough  to  reject  false  meridians 
in  our  experiments. 

4.3  Finding  cross-section  edges  in  an  edge 
image 

The  module  to  find  cross-section  edges  is  a  necessary  part 
of  the  program.  We  describe  the  algorithm  to  find  cross¬ 


section  edges  in  this  section,  though  it  has  an  obvious 
limitation  and  should  be  replaced  by  another  algorithm 
using  more  sophisticated  constraints  in  the  future. 

If  there  is  no  restrictions  to  cross-section  shape  in  three 
space,  no  strong  geometrical  constraints  is  available  to 
detect  projected  cross-section  edges.  Thus  we  impose 
an  additional  restriction  to  cross-section  shapes,  i.e.,  the 
cross-section  should  be  a  C*  curve. 

Currently,  the  algorithm  finds  the  minimum  length 
C*  closed  edge  path.  The  algorithm  is  a  best-first  search 
starting  at  the  end  edges  detected  by  the  end-finder  mod¬ 
ule.  The  shortest  edge  path  is  repetitively  selected  and 
updated  with  C*  edge  links  found  from  its  end  until  a 
closed  path  is  found  or  the  length  of  path  exceeds  a  cer¬ 
tain  threshold.  The  method  to  find  edge  links  is  simi¬ 
lar  to  the  one  for  finding  junctions.  The  difference  is  that 
the  former  verifies  difference  of  orientations  between  two 
edges  at  the  edge  link. 

Detected  cross-sections  are  shown  in  figure  17.  This 
algorithm  has  the  same  weak  point  as  the  junction  find¬ 
ing  algorithm  in  section  4.2.1.  Still,  it  can  detect  correct 
cross-sections  mostly  when  no  significant  obscurations  or 
missing  parts  exist  in  the  cross-section. 

5  Recovering  3D  shape  of  SHGC 

Though  the  geometrical  constraints  for  contours  only  are 
not  sufficient  to  recover  the  3  J  shapes  of  SHGCs,  it  is 
possible  when  some  more  constraints  are  available.  Since 
a  planar  region,  the  projected  shape  of  the  cross-section, 
is  detected  by  the  system,  skewed  symmetry  gives  an 
important  constraint  as  in  the  analysis  of  polyhedral  ob¬ 
jects.  Two  more  additional  constraints  are  derived  from 
assumptions  on  object  shapes: 
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Figure  17:  Detected  cross-section  edges  and  axes  of  skewed  symmetry. 


•  The  object  belongs  to  Right  SHGCs. 

•  The  axis  passes  through  the  center  of  cross-section. 

With  th<ise  assumptions,  the  detection  of  three  orthog¬ 
onal  axes  in  an  image  becomes  possible,  and  we  have 
sufRcient  constraints  to  recover  3D  shape  quantitatively. 

5.1  Recovery  of  3D  information  from  three 
orthogonal  axes 

In  this  section,  we  show  the  relation  between  the  pro¬ 
jection  of  three  orthogonal  axes  and  viewing  angles,  and 
then  the  relation  of  object  contours  in  an  image  to  its 
sweeping  rule  and  original  cross-section. 

According  to  [Ponce  et  al.  89],  the  contours  of  RSHGC 
in  the  imaging  plane  are: 


OQ  =  p(0)r(z)  sin  (0  —  a)w 

-t-[z  sin  0  —  p{9)r{z)  cos  0  cos  (0  —  «)]«. 


where  indicates  the  position  in  the  image;  {w,  u) 
is  unit  bases  in  the  imaging  plane;  (a,0)  is  the  viewing 
angle  (figure  18). 

By  substituting  the  values  for  0  and  z,  we  get  the 
projection  of  three  axes  as: 


OQx  =  — p(0)r(0)(sin  aw  -|-  cos  0  cos  au). 

OQy  =  p(^)r(0)(cosaii;  —  cos/9sinau). 

OQ,  =  z  sin  0u. 

An  actual  image  is  obtained  by  rotating  and  transfer- 
ing  the  projected  object  contours  in  the  imaging  plane. 
By  letting  the  angles  of  axes  before  rotation  be  V’®. 
and  ^l>,,  and  the  angles  after  rotation  be  V’y.  V’i. 
we  get  the  following  relations: 


V-i  =  V-*  +  T.  =  +  7.  =  7- 

,  cos  ^  cos  a  ,  cos/?  sin  a 

tan  ipt  = - : - ,  tan  = - . 

sin  a  cos  a 

where  y  is  the  rotation  angle  around  viewing  axis. 
Since  ip,  is  x/2,  we  get  the  rotation  angle  y  =  —  x/2. 


Prom  the  last  two  equations,  we  get  the  following  re¬ 
lations: 

cos/?  =  ±\/—  tan ipx  tan ipy , 


a  =  T  arctan 


The  tansfer  vector  in  the  imaging  plane  is  determined 
from  the  center  of  the  cross-section  because  the  axis  of 
SHGC  is  assumed  to  pass  through  it.  After  cancelling 
the  rotation  and  transfer,  projected  cross-section  is  ob¬ 
served  as: 


u;  =  p(0)r(0)  sin  (0  —  a),  u  =  —p{0)r(O)  cos  0 cos  (0  —  a). 

Thus,  by  using  (w,u)  coordinate  system  and  substi¬ 
tuting  a  and  cos  /?,  original  cross-sections  are  calculated 


p(ff)  •r(0)  =  \ 


tan  ipt  tan  ipy  ’ 


0  =  ±  arctan 


w  •  yj  —  tan  ypx  tan 


T  arctan  W  — 


tan  ipy 


tan  rpz 


Also,  points  along  a  projected  meridian,  i.e., points  by 
changing  z  value  with  a  constant  0,  are: 

w  =  p(0)r{z)  sin  {0  —  a), 
u  =  zsin/?  —  p{0)r{z)  cos  0 cos  {0  —  a). 

The  junction  between  an  end  (z  =  0)  and  a  meridian 
(u;o,«o)  is: 

u)o  =  S'"  “  ®)> 

uo  =  — p(tf)r(0)  cos  0  cos  {0  —  a). 

By  using  (lu,  u)  coordinate  system  and  (u^o,  uq),  sweep¬ 
ing  rules,  /i(z)  =  r(z)/r(0),  are  calculated  in: 

,  u;  .  1  ,  w  . 

mz)  =  - ,  z  =  ± — ,  -.  (u - Uq). 

Wo  \/\  +  tan  xpx  tan  rpy  u'o 

Signs  give  reverse  or  symmetric  shapes  since  the  di¬ 
rection  of  axes  may  be  reversed. 
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Figure  18:  A  coordinate  system  for  RSHGC. 


5.2  A  simple  method  for  finding  skewed 
symmetry  in  cross-section 

Researchers  reported  methods  to  find  the  axes  of  skewed 
symmetry,  e.g.  a  method  based  on  moments  [Friedberg 
86],  or  one  based  on  local  signature  [Ponce  90].  In 
this  section,  we  propose  an  alternate  algorithm  based  on 
least  squares  fitting  of  a  straight  line  to  the  axis  points. 
Though  it  is  applicable  only  to  a  single  closed  curve  like 
object  boundary,  the  criterion,  linearity  of  the  axis,  is 
straight  forward  and  easy  to  understand.  Also,  it  doesn’t 
depend  on  the  curvature  measurement,  so  that  it  is  ap¬ 
plicable  even  if  the  measurement  at  edgels  is  poor.  The 
algorithm  shown  below  is  a  variation  of  Nevatia  and  Bin- 
ford’s  axis  finding  algorithm  and  is  summarized  as  fol¬ 
lows: 

1.  Discretize  orientation  for  projection  direction, 

2.  For  each  direction, 

(a)  project  edgels  into  buckets, 

(b)  order  edgels  in  each  bucket, 

(c)  if  the  number  of  edge  segments  in  each  bucket 
is  even,  calculate  mid-points  as  shown  in  fig¬ 
ure  19, 

(d)  fit  a  straight  line  to  the  mid-points  by  least 
squares  method  and  get  the  averaged  error  from 
the  line. 

3.  find  local  minimum  of  the  errors  in  all  the  projec¬ 
tion  directions;  if  the  error  value  at  the  local  mini¬ 
mum  is  sufficiently  small,  the  fit  line  gives  the  axis 
of  symmetry  and  the  projection  angle  determines 
the  direction  of  the  axis  of  skew. 

Examples  of  the  axes  of  symmetry  detected  by  the 
algorithm  are  shown  in  figure  17. 

5.3  Experimental  results  of  the  recovery  of  3D 
shape 

By  using  equations  in  section  5.1,  the  original  cross- 
section  shapes  and  sweeping  rules  are  calculated  from 
the  axes  and  edges  shown  in  figure  15  and  figure  17. 


Figure  19:  Mid-points  and  contours  for  skewed  symme¬ 
try  detection. 

The  results  are  shown  in  figure  20.  As  in  pivot  theorem 
[Shafer  and  Kanade  83],  it  is  impossible  to  determine 
an  axis  from  meridians  when  the  sweeping  rule  is  a  lin¬ 
ear  function.  In  figure  15  (b),  it  is  assumed  that  the 
axis  of  SHGC  passes  through  center  of  the  cross-section. 
The  recovered  shape  functions  are  good  approximations 
of  the  original  cross-sections  and  sweeping  rules,  how¬ 
ever,  the  effects  due  to  the  deformation  by  perspective 
projection  are  not  negligible.  For  quantitative  analysis 
of  original  shapes,  consideration  of  perspective  projec¬ 
tion  may  be  required.  However,  for  selecting  hypotheses 
in  model-based  vision  systems,  the  resulting  information 
may  be  sufficiently  accurate;  many  features  used  to  in¬ 
dex  object  shapes,  such  as  elongatedness,  compactness, 
area,  or  moments  can  be  estimated  from  it. 

6  Conclusion 

In  this  paper,  we  presented  a  set  of  modules  to  extract 
partial  descriptions  of  SIIGC  objects  in  an  edge  image 
and  described  the  algorithms  used  in  the  modules,  i.e., 
an  algorithm  to  find  the  ends  of  SHGC,  one  to  find  the 
meridian  edges,  one  to  find  the  crc^s-section  edges,  and 
one  to  recover  the  sweeping  rule  and  the  original  cross- 
section  shape. 

Major  contribution  of  this  work  is  that  it  shows  the 
feasibility  to  extract  geometrical  edges  from  real  images 
and  provides  the  algorithms  for  SIIGC  objects.  We  be¬ 
lieve  that  this  work  will  accelerate  the  development  of 
model-based  vision  systems  and  other  methodologies  in 
this  field,  especially  their  application  to  real  images. 
BUILDER-l’s  weak  points  are  as  follows: 

Problem  (i)  it  requires  heavy  computation  to  find  ge¬ 
ometrical  edges,  especially  end  edges; 

Problem  (ii)  it  cannot  handle  considerable  obscura¬ 
tions  in  geometrical  edges; 

Problem  (iii)  some  of  the  modules  lack  generality,  in 
other  words,  requires  additional  restrictions  to  M)- 
ject  shapes. 


(a)  (b) 


(c)  (d) 

Figure  20:  Recovered  cross-sections  and  sweeping  rule  functions,  (a):  cross-section  for  RiceBowI,  (b):  sweeping  rule 
for  RiceBowI,  (c):  cross-section  for  HeartCup,  (d):  sweeping  rule  for  HeartCup. 


Curve  fitting  to  the  edges  or  selection  of  edgels  which 
represents  edges,  e.g.,  B-spIine  representation  in  [Saint- 
Marc  and  Medioni  90]  or  zero  curvature  points  in  [Ponce 
et  al.  89],  should  be  effective  to  improve  the  problem  (i). 
Also,  detection  of  curvilinearity  or  super-edges  which  are 
the  groups  of  edges  that  preserve  C*  or  C*  continuity, 
will  solve  the  problem  (i)  and  (ii)  partly.  Improvement 
of  junction  finder  algorithm  will  not  only  reduce  the  com¬ 
putation  in  meridian  finder  module  but  also  give  addi¬ 
tional  clues  regarding  the  junctions.  They  are  especially 
useful  for  the  occlusion  problem  and  cross-section  finder 
if  it  can  determine  the  types  of  junctions  in  [Malik  87]. 

Though  the  assumptions  imposed  on  the  object  shapes 
in  3D-recoverer  module  are  not  realistic  in  many  situ¬ 
ations,  it  is  still  applicable  to  many  industrial  objects 
and  gives  an  example  of  further  bottom  up  processings. 
Other  researchers  showed  promising  results  for  the  rein¬ 
forcement  in  the  3D  recoverer  to  relax  the  restrictions 
on  object  shapes  [Ulupinar  and  Nevatia  90]  [Gross  and 
Boult  90]. 

A  complete  system  for  image  recognition  is  supposed 
to  have  a  flexible  control  mechanism  which  combines  bot¬ 
tom  up  and  top  down  processings  with  focus  of  attention. 
So,  the  bottom  up  processing  module  itself  should  be  ex¬ 
tended  to  cope  with  various  top  down  information,  i.e.. 


restriction  on  the  shape  or  the  appeuance  of  objects  in 
the  image. 

A  Geometrical  constraints  on  the 
meridian  edges 

According  to  [Ponce  et  al.  89],  the  projected  contours  of 
an  SHGC  including  oblique  case  are: 

=  p{0)r{z)d  -1-  zh, 

b  =  sin  sin  (oto  —  oi)w 

-b[cos  j3o  sin  —  cos  0 sin  0o  cos  (ao  —  <»)]«, 
d  =  sin  {0  —  a)w  —  cos  0  cos  (0  —  a)«, 

where  is  the  position  in  an  image,  (tiJ,  u)  is  unit  bases 
of  the  imaging  plane,  p(0)  is  a  cross-section  curve,  r(z) 
is  a  sweeping  rule,  (a,  0)  is  viewing  angles,  and  (orot  0o) 
is  orientation  of  the  axis  of  SHGC. 

The  vector  between  a  co-parallel  edgel  pair  Q1Q2  is: 

^  -  p{0x)d{0y)]r(z). 

So,  they  are  parallel  for  a  pair  of  meridians  regardless 
of  z.  The  vector  between  a  co-meridian  edgel  pair  QoQi 
and  its  extension  OQa  -b  iQoQ\  are: 
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QoQi  =  p(0)d(0)(r(zi)  -  r(zo))  +  (^i  -  zo)b. 


Oi^  +  iQoQi  =  P(0)d(0)lr(zo)  +  t(r(ri )  -  r(2o))] 
+<(zi  -  Zo)6- 


Thus,  we  get  the  position  of  focus  OF: 
r(zo) 


OF  = 


r(2o)-r(ri) 


(zi  -  zo)b. 


where  6  is  the  projection  of  the  3D  axis  of  the  SHGC. 
Thus,  the  foci,  i.e,, extension  intersections  are  in  the 
axis. 
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Abstract 

We  present  preliminary  results  of  automated 
image  interpretation  using  Bayesian  net¬ 
work.  With  onfy  a  rudimentary  Implementa¬ 
tion  of  our  approach,  we  successful^  find 
structure,  pose  and  shape  of  a  3-D  pipe 
elbow.  Bayesian  aggregation  nodes  are  used 
to  Infer  interpretations  of  image  data  from  the 
bottom  up.  penultlmately  producing  a  cylin¬ 
der  and  helte.  between  which  simple  auto¬ 
mated  symbolic  algebra  Infers  3-D  relations. 

In  a  second  part,  we  briefly  introduce  Clas¬ 
sics.  a  hlghfy  typed,  constraint  system  for 
geometric  modeling.  And  though  not  robustfy 
developed  for  this  early  report,  we  mention 
model  based  prediction  and  how  it  cooperates 
with  aggregation. 

1.  Introduction 

This  paper  presents  a  preliminary  implementa¬ 
tion  of  an  automated  image  interpretation  sys¬ 
tem.  For  this  paper  we  define  Interpretatkm  as 
the  process  of  hypothesizing  structure,  pose  and 
shape  of  three  dimensional  scenes,  whose  projec¬ 
tion  appears  In  a  monocular  image.  In  our 
framework,  image  interpretation  has  three  prima¬ 
ry  components:  aggregation,  tndejdng.  and  pre¬ 
diction,  and  is  guided  by  a  model  data  base  and  a 
control  strategy  (see  Figure  1).  We  will  refer  not 
only  to  interpretations  of  3-D  scenes,  but  also  to 
interpretations  of  2-D  data  as  well. 

The  first  part  of  this  paper  focuses  on  aggrega¬ 
tion  as  dl^ramed  In  boldface  In  Figure  1.  Aggre¬ 
gation  is  the  grouping  of  structure  at  one  level  to 
form  higher  level  structure,  creating  more  com¬ 
plex  Interpretations  of  the  bottom  level  data  as  it 
goes  (for  example,  grouping  lines  Into  parallel 
lines  or  grouping  parts  of  an  object  to  form  the 
whole). 

Successful  aggregation  virtually  solves  the 
matching  problem  by  eliminating  it.  Beginning 
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Figure  1:  Diagram  of  Successor  vision  system. 

This  paper  focus  on  aggregation  and  the 
modeling  ^tem.  shown  in  boldface. 

with  the  image  data  itself,  for  in  a  broad  sense 
aggregation  includes  edge  detection  and  s^men- 
tation.  low  level  interpretations  rise  into  3-D  in¬ 
terpretations  with  model/data  correspondences 
intact.  However,  two  facts  daunt  bottom  up  ag¬ 
gregation:  there  exists  no  unique  interpretation 
of  image  data,  and  errors,  from  occlusion,  noise, 
shadows  and  specularlties.  give  rise  to  missing  or 
misleading  evidence. 

To  overcome  multiple  image  interpretations,  we 
Invoke  strong  generic  aiguments  to  hypothesize 
likely  Interpretations.  For  example,  while  an  el¬ 
lipse  Is  not  guaranteed  to  be  the  projection  of  a 
circle,  it's  a  good  bet  and  a  worthy  guess.  Natu¬ 
rally.  when  combined  with  error  Inherent  in  the 
data,  this  gives  rise  to  Incorrect  hypotheses.  But 
the  belief  network  dynamically  accumulates  evi¬ 
dence.  pro  and  con.  and  keeps  Interpretations  at 
all  levels  rank  ordered  by  posterior  probability. 
Our  aim  Is  to  demonstrate  that  the  rich  informa¬ 
tion  In  images  ultimately  leads  to  a  correct  Inter¬ 
pretation. 

Interpretation  is  model  guided  at  all  levels,  re¬ 
quiring  models  of  classes  of  objects,  geometiy. 
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geometric  relations,  and  the  imaging  process,  as 
well  as  specific  object  models.  To  achieve  the 
requisite  level  of  symbolic  representation  for  geo¬ 
metric  reasoning  and  machine  Interpretation,  we 
developed  Classics.  Structures  wi^in  Classics 
are  highly  typed,  retaining  significant  symbolic 
information.  By  explicit^  defining  classes  in 
terms  of  other  classes  using  mathematical  con¬ 
straints,  Classics  works  like  a  constraint  net¬ 
work.  The  class  typing  allows  us  to  reason  in  a 
limited  (geometric)  domain  and  keep  closed  the 
Pandora’s  box  of  solving  the  gener^  constraint 
problem. 

Previous  Work 

We  are  developing  an  experimental  ^tem.  Suc¬ 
cessor,  to  support  research  in  model  based  vi¬ 
sion,  and  this  work  is  part  of  that  system. 

The  concepts  in  this  paper  follow  directly  from 
work  In  Bayesian  reasoning  for  vision 

(Levitt  et  ai.  89:  Blnford  et  al.  87;  Agosta  88|.  Aggregation 
nodes  come  from  new  work  presented  in 
(Agosta  91],  and  network  building  control  strategies 
are  discussed  in  (Levitt  et  al.  90|.  In  particular,  Lev¬ 
itt,  Agosta  and  Blnford  developed  a  similar 
framework  for  model-based  vision  using  Ba}re- 
sian  Inference  In  (Levitt  et  al.  89|,  which  contains  a 
more  encompassing  description  than  is  found  in 
this  paper. 

The  approach  and  algorithms  of  this  paper  de¬ 
pend  on  other  segmentation  and  aggregation  re¬ 
sults,  some  of  which  overlap  this  work  in  parallel 
development.  Sato  and  Blnford  present  aggrega¬ 
tion  results  spanning  from  edge  data  to  genei^- 
ized  cylinder  hypotheses  (Sato&  Binford  9ia: 
(Sato  &  Blnford  9  lb]. 

Other  researchers  hypothesize  generalized  cylin¬ 
ders  from  edge  contours  (Uluplnar&  NevatlaSl; 
(Gross  &  Boult  901,  although  they  do  not  segment 
image  data  automatically.  Although  not  probabi¬ 
listic,  feature  grouping  (aggregation)  has  been 
used  In  (Mohan  &  Nevatia  89;  Chung  &  Nevatia  91: 
(Hutteniochera  Wayner9i|.  Bayes  networks  have 
been  used  for  vision  modeling  in  (Jensen  et  al.  90|. 
We  used  a  modified  Canity  edge  detector  devel¬ 
oped  by  Blnford  and  Wang  (Binford  &  Wang  (. 

Interesting  and  original  work  In  algebraic  vision 
used  ^nnbolic  methods  for  “explicitly  relating  the 
shape  of  image  contours  to  models  of  curved 
three-dimensional  objects”  (Ponce  &  Kriegman  88a: 
(Ponce  &  Kriegman  88b|.  In  (Lowe  911,  Lowe  USes  nu¬ 
merical  minimization  to  fit  arbltraiy  curved  sur¬ 
faces  to  edge  data.  Both  approaches  solve  not 
onty  for  the  projection  transformation,  but  also 
for  model  parameters  as  well.  In  the  former 
work,  segmentation  and  matching  is  done  by 
hand,  whereas  our  work  solves  the  segmentation 


and  matching  automatically.  In  a  step  precisely 
analogous  with  our  prediction,  Lowe  relies  on 
choosing  edges  nearby  a  projection  of  the  model. 
However,  his  model  is  a  priori  close  to  correctly 
positioned. 

Object  modeling  ^sterns  using  constraints  ap¬ 
peared  in  (Nguyena  et  al.  91],  (Marefat  &  Kashyap  911 
and  (Walker  etal.  871,  focuslng  on  polyhedral  ob¬ 
jects,  and  In  (Kriegman  881.  In  work  similar  to 
Classics,  Yokpyama  defines  FREEDOM,  an  “ob¬ 
ject-oriented  and  constraint-based  knowledge 
(system)  for  design  object  modeling”,  (Yokoyama  i. 

2.  Bayesian  Networks 


Figure  2:  A  simple  Bayes  network.  (This  is  what 
the  network  looks  like  during  prediction,  dis¬ 
cussed  in  a  later  section). 

A  Bayesian  network  is  a  graph  of  nodes,  where 
nodes  are  random  variables  and  arcs  are  condi¬ 
tional  probabilities  between  the  random  vari¬ 
ables.  The  network  forms  a  directed  acyclic 
graph  (DAG),  and  we  will  use  the  term  DAG  Inter- 
chaingeably  with  Bayesian  network.  The  DAG’s 
are  drawn  as  influence  diagrams  (Shacter  86|,  and 
though  influence  diagrams  use  circles  to  repre¬ 
sent  probabilistic  nodes,  we  choose  a  sli^tly 
non-standard  notation  of  rounded  boxes  for  Data 
nodes  to  emphasize  that  th^  Mdll  be  Instantiated 
with  a  pre-posterlor. 

Prior  probabilities,  the  beliefs  in  the  states  prior 
to  additional  evidence,  must  be  explicitly  provid¬ 
ed  for  top  level  nodes  (Parallel  Curves  in 
Figure  2).  Priors  on  interior  and  leaf  nodes  arise 
through  propagation  of  priors  from  above. 
Though  having  to  provide  prior  probabilities 
seems  ad  hoc  on  first  sight,  if  no  information  is 
otherwise  available,  making  all  prior  probabilities 
equal  in  a  node  neutralizes  their  Influence. 

Referring  to  Figure  2,  for  example,  between  nodes 
Curve  and  Data  is  the  arrow  representing  the 
conditional  probability  of  the  data  given  the 
Curve  hypothesis,  P{Data  =  d,/Cuive  =  Cj),  or  the 
probability  that  the  random  variable  Data  is  in 
state  d,  given  that  random  variable  Curve  is  in 
state  Cj.  For  notatlonal  brevity,  we  will  usually 
refer  to  this  as  Pldj/Cj).  In  this  paper,  network 
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Cylinder  &  Helix 

Orthogonal,  Intaraacting  axaa 


Figure  3:  An  ouerall  visw  the  interpretation  network.  The  left  side  shows  a  simplified  diapam  of  the  Baw 
net  used  for  aggregation  of  a  3D  c^inder  and  helix.  The  right  side  shows  prediction  from  the  elbow  model. 
Uny  nets  from  aggregation  become  the  ‘ornaments*  in  prediction,  as  shown  in  brick.  Black  nodes  were  too 
we^  to  find  in  aggregation,  but  can  be  used  for  support  and  denial  in  prediction. 


leaf  nodes  are  always  Data,  whose  states  are 
known  through  observation,  and  thus  we  Instan¬ 
tiate  the  pre  posterior,  P{Data  =  Observation}  =  1. 
Through  application  of  Bayes  Rule,  the  posterior 
probabilities  of  the  remaining  nodes  follow  and 
therein  lies  the  power  of  Bayes  networks  to  dis¬ 
criminate.  We  use  the  HUGIN  Bayesian  belief 
system  to  propagate  probabilities  through  the  net 
lAndersen.  et  ai.  90].  For  more  on  Influence  dia¬ 
grams,  the  reader  is  referred  to  ishacter  861. 

3.  Aggregation 


stanUated  data  node.  Connected  directly  to  the 
data,  these  small  DAG’s  become  the  ornaments 
that  we  will  later  hang  on  the  predictive  tree 
|Ago8ta9i|.  Shown  In  Figure  3  is  an  aggregation 
and  prediction  netwoiic  for  Interpretation  of  the 
pipe  elbow  pictured  in  Figure  5. 


Figure  4:  An  Aggregation  Node.  The  Parallel 
Curves  DAG  Is  built  in  response  to  good  curve 
hypotheses,  but  Is  not  connected  to  Curve  nodes 
through  conditional  probabilities. 

In  recent  work,  Agosta  &  Levitt  Introduce  aggre¬ 
gation  nodes  for  combining  the  influence  of  evi¬ 
dence  in  a  Bayesian  network.  In  this  paper,  each 
level  of  aggregation  builds  a  set  of  simple,  two 
node  DAG'S  (see  Figure  4):  a  hypothesis  and  In- 


Flgure  5:  Image  of  the  Elbow 


3.1  Network  Nodes 

In  our  implementation,  we  define  the  foUowlng 
nodes  (random  variables):  Data,  Curve.  Parallel- 
curves,  Similar-curves,  Parallel-&-slmllar-curves. 
Cylinder,  and  Cyllnder-&-Hellx.  The  probability 
nodes  and  their  states  are  listed  below  along  with 
more  precise  state  definitions,  for  nodes  must  be 
well  defined  in  order  to  define  meaningful  condi¬ 
tional  probabilities  between  the  nodes. 


•  Data  nodes  have  states  Observed  and  Not-ob- 
served.  The  Observed  hypothesis  is  that  the 
given  feature  would  be  observed  in  edge  data 
with  as  least  as  much  error  as  the  given  data. 

•  Curve  nodes  have  states  Line,  Ellipse  &  Circle. 
The  Line  (Ellipse,  Circle)  hypothesis  is  that 
there  is  a  linear  (elliptical,  circular)  feature  in 
the  image  data  with  the  parameters  of  our 
Line  (Ellipse.  Circle)  hjrpothesls. 

•  Parallel  Curves  nodes  have  states  Parallel¬ 
lines.  Parallel-ellipses.  Parallel-circles  and 
Not-parallel.  The  Parallel-lines  (circles,  el¬ 
lipses)  hypothesis  is  that  there  are  two  parallel 
linear  features  in  the  in  the  image  data  with 
the  parameters  of  the  h3rpothesis.  Ellipses 
and  circles  are  actually  elliptical  and  circular 
arcs  which  are  locally  parallel  in  the  marmer 

of  [Sato  &  Blnford  dial. 

•  Similar  Curves  nodes  have  states  Similar- 
lines.  Similar-ellipses.  Similar-circles  and  Not- 
similar.  The  Similar-lines  (ellipses,  circles)  hy¬ 
pothesis  is  that  there  are  two  linear  features 
in  the  image  data  of  the  same  length. 

•  Parallel  &  Similar  Curves  nodes  have  states 
P&S-Unes,  P&S-elUpses.  P&S-circles.  and  Not- 
P&S.  The  hypotheses  are  the  conjunction  of 
being  both  parallel  and  similar. 

•  Cylinder  node  has  states  Cylinder  and  Not-a- 
cylinder.  The  Cylinder  hypothesis  is  that  the 
image  data  is  the  projection  of  a  circular  cylin¬ 
der  with  hypothesized  parameters  and  viewing 
direction. 

•  Helix  node  has  states  Helix  and  Not-a-helix. 
The  helix  hypothesis  is  that  the  image  data  is 
the  projection  of  a  3-D  helix  with  hypothesized 
parameters  and  viewing  direction. 

•  Cylinder  &  Helix  has  states  C&H  and  Not- 
C&H,  The  Cylinder-&-hellx  h3rpothesis  is  the 
conjunction  of  cylinder  and  helix,  with  a  hy¬ 
pothesized  3-D  transformation  relation  be¬ 
tween  the  cylinder  and  the  helix. 

Note  that  the  Parallel-curve  node  has  a  Not-par¬ 


allel  hypothesis,  whereas  the  Curve  node  admits 
no  possibility  that  the  data  might  not  be  a  curve. 
Open  world  hypotheses,  like  Not-parallel  or  Not- 
Une-elllpse-nor-clrcle.  absorbs  evidence  into 
themselves  in  the  simple  aggregation  scheme 
outlined  in  this  paper.  Since  Parallel-lines  nec¬ 
essarily  requires  two  lines,  for  example,  Not-par- 
allel-lines  will  be  true  if  either  line  is  not  a  line, 
with  higher  levels  having  ever  smaller  posteriors 
on  the  positive  hypotheses.  Omitting  the  Not- 
hypothesis  normalizes  probabilities,  and  is  equiv¬ 
alent  to  conditioning  the  node  on  the  existence  of 
(one  of)  the  feature(s)  being  true.  A  more  sophis¬ 
ticated  aggregation  network  could  make  the  ex¬ 
istence  conditioning  explicit,  and  in  particular, 
the  prediction  network  should  Indeed  express  ex¬ 
istence. 

3.2  The  Conditional  Probabilities 

E^ch  probability  node  is  connected  to  a  data 
node  with  an  instantiated  pre-posterior,  and  the 
conditional  probabilities  between  them  are  given 
below.  Although  these  conditional  probabilities 
are  occasionally  ad  hoc.  our  aim  is  to  draw  upon 
maximum  likelihood  estimation,  quasi-invariant 
analysis  and  imaging  physics  to  find  conditional 
probabilities  in  the  lowest  levels  of  the  Bayesian 
network. 


Figure  6:  The  shaded  portion  represents  the 
probability  of  the  Data  node  =  Observed  given  a 
Curve  hypothesis.  It  is  the  probability  that  the 
average  error  in  the  fit  of  the  curve  to  edgel  data 
is  greater  than  the  observed  average  error. 

•  P{Data/Curve}:  Curves  are  best  least  squares 
fits  of  the  curve  model  to  the  data.  Assuming 
the  distance  between  each  edgel  and  the  curve 
hypothesis  (the  error  of  the  fit)  to  be  a  Gauss¬ 
ian  random  variable,  the  average  error  of  fit¬ 
ting  a  conic  to  edgels  will  also  be  Gaussian, 
shown  in  Figure  6.  The  resulting  P{Observed 
/  Curve  hypothesis)  is  the  probability  that  the 
average  error  of  fitting  a  curve  to  edgel  data  is 
greater  than  the  observed  average  error, 
shown  shaded  in  the  figure.  The  standard  de¬ 
viation.  sigma,  is  estimated. 


796 


•  P{Data/Parallel-curves)  = 

Plcuiveimcurvel)  •  (l  ■  angular-difference ^ 

I  10  degrees  } 


•  P{Data/ Similar-curves)  = 
P{Cutvel}*P{Curve2}  •  (l  - 


[length  1  -  leiigth2(  \ 
(length  1  +  length2) } 


•  P{Data/ParaDel-&-simllar-curves)  = 
P{Parallel-curve}  *  P{SlinlIeir-curve). 

•  P{Data/CyIinder)  = 

P{Parallel-&-siniIlar-ellipses)*Il  or  0).  depend- 
li^  on  whether  or  not  the  ellipses  could  be  the 
projection  of  the  ends  of  a  circular  cylinder. 

•  P{Data/Hellx  }  =  1.  We  fit  a  helix  to  the  data, 
but  It  Is  not  tn  the  Bayes  network. 

•  P{Data/Cyllnder-&-Hellx )  =  P{Data/Cyllnderj. 


The  other  conditional,  P{Not-observed/Feature}  is 
naively  taken  to  be  1  -  P{Observed/Feature}.  Al¬ 
though  this  Is  consistent  with  the  framing  of  the 
Observed  hypothesis,  the  choice  of  this  condi¬ 
tional  slgnihcantly  affects  the  discriminatory 
power  of  the  system.  The  choice  of  P{Not-ob- 
served/Feature)  warrants  further  investigation. 
Bayes  nets  hinge  on  defining  meaningful  condi¬ 
tional  probabilities  for  the  network,  for  even 
when  a  net  has  intuitively  pleasing  structure,  the 
conditional  probabilities  can  be  dllllcult  to  write 
down,  if  not  meaningless. 


3.3  Results 


Figure  7:  Diagram  of  the  Implementation.  This 
corresponds  to  the  right  hand  side  of  Figure  1 , 
but  with  implementation  details  fleshed  out. 


The  edge  image  from  a  modified  version  of  the 
Canny  edge  detector  appears  In  Figure  8.  Rough 
collections  of  parallel  curves  were  found  by  algo¬ 
rithms  described  in  |Sato&  Binfordoia)  (Figure  7). 
For  this  subset  of  edges,  curve  hypotheses  were 
made,  as  shown  In  Figure  9  and  Figure  10. 


Figure  11  shows  curve  hypotheses  which  arose 
combining  curves  Into  aggregated  curves.  In¬ 
cluding  the  ellipse  hypothesis  which  is  parallel  to 
the  other  end  of  the  cylinder.  As  seen  In 
Figure  10.  the  components  of  this  ellipse  were 
originally  a  circle  £ind  two  small  line  s^ments, 
but  simple  prediction  from  the  circle  lead  to  the 
accumulation  of  the  lines  into  an  ellipse. 

The  two  parallel  ellipses  are  used  to  generate  a  3- 
D  cylindrical  cone  hypothesis  and  the  large  set  of 
ellipses  leads  to  a  helix  hypothesis.  In  this  pre¬ 
liminary  work,  these  penultimate  hypotheses 
were  a  foregone  conclusion  and  not  driven  by 
models.  Also  in  this  work,  the  fit  of  the  cylinder 
to  the  data  was  represented  by  fitting  two  ellipses 
to  the  data,  constrained  to  be  the  same  shape 
and  tilted  at  the  same  angle.  The  helix  fit  was 
estimated  dlrectty  from  the  fit  of  the  mai^  el¬ 
lipses.  We  anticipate  fitting  3-D  models  directly 
to  the  data  In  future  work. 

The  final  model  fit  Is  shown  in  Figure  12.  Notice 
that  the  limb  edge  of  the  cylinder  matches  the 
lower  limb  of  the  model  cylinder,  and  could  be 
found  easily  through  prediction.  Although  it 
looks  as  though  the  upper  limb  has  been  mtesed, 
this  Is  not  the  case,  for  what  appears  to  be  two 
candidates  for  the  upper  limb  are  in  reality  part 
of  the  small  ribs  attached  to  the  cylinder.  The 
cylinder  limb  is  not  visible  and  the  model  limb  is 
correctly  positioned  between  the  two  edges. 

In  order  to  flesh  out  the  3-D  model,  the  ^stem 
must  establish  the  relative  Z  coordinates  of  the 
two  3-D  circles  to  form  a  cylinder,  and  of  the  cyl¬ 
inder  and  the  helix  to  form  an  elbow  hypothesis. 
We  used  a  computer  Interface  to  the  Maple  alge¬ 
braic  system  to  find  the  solution  symbolically  by 
hypothesizing  a  constraint  that  the  two  3-D  cir¬ 
cles  are  coaxial,  and  that  the  axes  of  the  cylinder 
and  helix  intersect.  Although  not  model  driven 
yet.  this  symbolic  constraint  based  reasoning  Is 
the  goal  of  Classics. 

Finally,  In  Figure  13  and  Figure  14  we  show  two 
other  views  of  the  model.  Note  there  Is  a  small 
error  tn  the  angle  between  the  cylinder  and  the 
helix.  The  actual  angle  Is  90  degrees. 

The  number  of  hypotheses  at  each  level  for  the 
aggregation  of  the  cylinder  Is  given  below: 


Edges  In  Image 

119 

Curve  hypotheses 

63 

Aggregated  curves 

19 

Parallel  curve  pairs 

169 

Similar  curve  pairs 

271 

Parallel  &  similar  ellipses 

12 

Cylinders 

3 
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Figure  14:  Model  rotated  for  front  view. 

3.4  Performance 

At  a  given  level,  nodes  are  rank  ordered  by  poste¬ 
rior  probability,  and  the  most  likely  are  consid¬ 
ered  for  pairwise  aggregation.  Most  of  those  con¬ 
sidered  fail  a  crude  measure  of  compatibility, 
which  Is  conservative  In  not  throwing  away  possi- 
bfy  good  pairs.  After  aggregation,  even  more  end 
up  with  low  posteriors  and  are  never  considered 

again.  This  simple  scheme  Is  order  N^,  but  only 
in  the  size  of  the  subset  considered  for  aggrega¬ 
tion.  which.  If  there  is  good  data,  is  small.  If 
more  information  Is  needed  from  the  lower  levels, 
it  can  be  incorporated  Incrementally. 

4.  Indexing  and  Prediction 

Model  Induing  observed  Object 
Hypothesis  -  I^rpothesls 


Weaker  Strong 

EMdence  Evidence 

Figure  15:  The  fundamental  cycle  of  image  un¬ 
derstanding  presented  In  this  work. 

Implementations  of  indexing  and  prediction  are 
limited  In  this  work,  but  mention^  because.  In 
complex  Images,  aggregation  will  succeed  only  in 
partnership  with  Indexing  and  prediction.  In  a 
probabilistic  hypothesize  and  test  paradigm,  our 
method  Invokes  models  at  every  level  to  predict 
Interpretations  at  the  next  lower  level.  Predic¬ 
tions  help  Incorporate  evidence  which  was  not 
strong  enough  to  base  an  aggregated  hypothesis 
on.  but  which  nonetheless  lends  evidence  toward 
(or  against)  a  hypothesis.  This  cycle,  shown  hi 
Figure  15.  occurs  between  all  levels  pictured  In 
Figure  1. 

The  top  level  node  In  the  prediction  network  Is  a 
model,  or  small  set  of  models,  from  which  we  will 


derive  the  prediction  DAG.  There  will  be  leaf 
nodes  in  the  prediction  DAG  which  have  no  ag¬ 
gregated  data  nodes  cormected  to  them.  The 
viewing  direction  will  be  highly  constrained  be¬ 
cause  a  match  of  3-D  data  to  Image  data  has  al¬ 
ready  occurred.  We  will  project  unmatched  fea¬ 
tures  Into  the  image  to  ftnd  potential  data  to 
match  and.  since  we  know  where  the  projection 
will  be,  there  will  be  few  possible  matches  and 
combinatorics  will  not  be  a  problem. 

5.  The  Classics  Constraint  System 

Classics  Is  a  constraint  ^rstem  designed  for  geo¬ 
metric  modeling.  On  the  one  hand  it  shares 
many  of  the  features  of  an  object  oriented  sys¬ 
tem:  class  hierarchies.  Inheritance  of  properties 
and  maps,  which  resemble  methods.  On  the 
other.  Classics  Is  a  constraint  system:  classes 
are  defined  In  terms  of  others  using  constraints. 
As  much  as  possible,  classes  in  Classics  retain 
their  mathematical  definition.  In  this  final  re¬ 
gard,  Classics  Is  a  symbolic  mathematical  lan¬ 
guage. 

Classics  has  been  designed  for  building  a  geo¬ 
metric  modeling  system.  Below  Is  a  discussion  of 
the  four  principle  features  of  Classics  which  facil¬ 
itate  geometric  modeling: 

•  Constraints 

•  Highly  typed  representation 

•  Symbolic  mathematics 

•  Intuitive  expression 

5.1  Constraints 

One  defines  a  class  in  terms  of  other  classes 
using  constraints:  Set  operations  (union.  Inter¬ 
section,  difference),  dimension  altering  con¬ 
straints  (e.g.,  cartesian-product,  projection),  geo¬ 
metric  constraints  (e.g.,  parallel,  coaxial,  adja¬ 
cent).  arithmetical  operations  and  special  con¬ 
straints  (e.g.,  subclass-of,  representatlon-of). 

The  constraints  are  themselves  classes  which 
have  been  defined,  and  have  meaning  to  the  com¬ 
puter.  Thus  It  Is  possible  for  algorithms  to  rea¬ 
son  about  classes  and  the  relationships  between 
them.  Classes  In  typical  object  oriented  languag¬ 
es  are  data  structures  with  simple  Inheritance 
rules,  but  otherwise  have  meaning  only  to  the 
programmer. 

5.2  Highly  Typed  Representation 

E^ssentl^ly  “highly  typed"  means  that  everything 
in  classics  is  a  class.  We  give  below  two  simple 
examples  where,  in  the  first  case,  data  represen¬ 
tation  Is  unified  and.  in  the  second,  declarative 
statements  are  made  more  natural. 

Consider  the  following  class: 
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Tail-People 

(subclass-of  People) 

( >  height  6’) 

Not  only  do  all  tall  people  have  a  height  greater 
than  six  feet,  the  height  of  tall  people  equals  a 
subclass  of  the  class  Interval,  with  a  minimum 
of  zero  and  maximum  of  six.  One  might  think  of 
an  instance  of  Tail-People,  say.  John  with 
height  6’  3".  as  a  highly  constrained  subclass 
with  only  one  member.  There  reason  we  prefer 
this  interpretation  is  that  it  eliminates  ambiguity 
when  properties  of  an  ‘Instance’  have  a  con¬ 
strained  value  rather  than  a  uniquely  specified 
value,  in  which  an  Instance  of  People  with 
height  constrained  to  be  greater  than  6  would 
appear  as  something  different  than  the  class 
Tall- People. 


Abstractions  ||  Representations 


Figure  16:  Simple  example  of  the  use  of  repre¬ 
sentations  of  abstract  classes.  The  blackened 
class  represents  the  map  between  polar  and  car¬ 
tesian  space.  The  ^stem  can  automatically  de¬ 
rive  the  class  “Polar  point  on  X  axis"  if  needed. 

Abstract  classes  are  distinct  from  their  represen¬ 
tations.  with  isomorphic  or  homomorphic  rela¬ 
tions  between  the  representations.  Constraints 
may  be  given  in  terms  of  the  abstract  class  or 
any  representation.  In  the  second  example 
(Figure  16).  conversions  between  polar  and  carte¬ 
sian  points  occur  automatically  and  the  con¬ 
straint  (=  Y  0)  in  the  cartesian  representation  is 
equivalent  to  (=  Theta  0)  in  the  polar  form,  and 
the  more  natural  representation  can  be  used. 

Some  constraints  can  be  expressed  in  terms  in¬ 
dependent  of  representation  (e.g..  parallel  lines  in 
terms  of  tangents),  abstracting  away  details  when 
they  would  be  a  hindrance.  The  abstract  class 
collects  all  representations  in  one  semantic 
group. 

5.3  Symbolic  Blatbematlcs 

Classics  is  built  on  top  of  several  symbolic  alge¬ 
bra  packages  (Maple.  MACSYMA.  Reduce,  and 
later.  Mathematlca).  Although  few  features  are 


available  as  yet.  the  full  functionality  of  these 
systems  is  potentially  within  classics  through  di¬ 
rect  LISP  calls.  Translations  to  and  from  Clas¬ 
sics  expressions,  which  when  possible  match 
those  of  LISP,  into  the  dialect  for  a  given  algebra 
system  occur  automatically. 

It  is  not  reasonable  to  throw  general  algebraic 
constraints  at  a  system  like  Maple  and  expect  a 
solution.  Symbolic  algebra  hasn’t  reached  that 
point.  However,  throu^  the  highly  typed  (object 
oriented)  structure.  Classics  emphasizes  expres¬ 
siveness  over  automation.  Classics  works  as  a 
declarative  programming  language,  letting  the 
user  provide  sufficient  information  into  the  model 
so  that  Classics  can  solve  non-trivial  geometric 
problems  without  trying  to  be  a  general  con¬ 
straint  solving  ^stem. 

Classics  can  perform  probabilistic  Inference  with 
the  HUGIN  belief  network  code  (Andersen,  et.  al.  901. 
and  can  invoke  numerical  fitting  and  minimiza¬ 
tion  algorithm  as  well. 

5.4  Intuitive  Expression 

Both  for  rigor  and  intuitive  clarity,  classes  and 
maps  have  mathematically  meaningful  names. 
For  example,  the  Common  Lisp  Object  System 
(CLOS)  manual  gives  an  example  of  the  class  AP¬ 
PLE-PIE  as  having  superclasses  APPLET  and 
CINNAMONiBobrow  et  al.  881.  While  this  provides 
the  Intended  effect  of  multiple  slot  inheritance, 
from  a  set  theoretic  perspective  it  makes  no 
sense.  A  set  with  multiple  superclasses  is  in  the 
intersection  of  those  superclasses.  The  intersec- 
Uon  of  APPLES  and  CINNAMON  is  null,  not  AP¬ 
PLE-PIE.  A  more  precise  statement  might  be 
that  APPLE-PIE  is  the  (carteslan-product-of  AP¬ 
PLES  CINNAMON),  although  this  is  clearly  an  in¬ 
adequate  definition. 

6.  Conclusion 

We  presented  results  from  the  first  (almost)  fully 
worldng  version  of  the  Successor  image  interpre¬ 
tation  system.  Beginning  with  grouped  ^ge 
data,  our  contribution  automatically  constructed 
a  3-D  interpretation  of  an  image  from  the  bottom 
up.  using  a  Bayesian  approach  that  we  believe 
readily  extends  to  more  complex  scenes.  We  pre¬ 
sented  details  of  a  newly  developed  constraint 
based  system  for  geometric  modeling,  which  we 
aim  to  make  a  meta-language  for  guiding  our  in¬ 
terpretation  algorithm  without  user  intervention 
beyond  model  building. 
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Aerial  Photo  Matching  and  Interpretation 
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Abstract 

In  this  paper  we  discuss  some  potential  applications  for 
the  use  of  multispectral  imagery  in  conjunction  with 
panchromatic  mapping  imagery.  The  remote  sensing 
community  has  traditionally  used  both  satellite  and 
aerial  imagery  whose  information  content  is  not  limited 
to  the  visible  portion  of  the  electromagnetic  spectrum. 
Computer  vision  and  image  understanding  for 
cartographic  feature  extraction  has  primarily  been 
limit^  to  panchromatic  imagery.  While  some 
navigation  and  robotic  sensing  research  has  used 
sensors  such  as  color  television,  acoustic  and  laser 
rangefinders,  and  thermal  imaging,  multispectral 
imagery  has  not  been  the  subject  of  extensive  research. 

In  the  area  of  cartographic  feature  extraction  it  is  clear 
that  having  surface  material  information  could  aid  in 
monocular  and  stereo  scene  analysis.  However  getting 
such  information  has  been  difficult,  particularly  at 
spatial  resolutions  that  are  comparable  with  the 
panchromatic  imagery.  We  describe  some  current 
work  in  supervise  multispectral  classification,  the 
refinement  of  multispectral  classification  using 
monocular  panchromatic  imagery,  and  the  fusion  of 
stereo  disparity  maps  with  surface  material 
information.  Our  conclusions  are  that  multispectral 
imagery  with  moderate  spatial  resolution  has  great 
potential  to  provide  scene  domain  cues  necessary  to 
improve  the  performance  of  cartographic  feature 
extraction  based  on  panchromatic  imagery  with  high 
spatial  resolution.^ 


’  This  research  was  primarily  sponsored  by  the  U.S.  Army 
Topographic  Engineoing  Center  under  Contracts  DACA72-87- 
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Aeronautical  Systons  Division  (AFSC),  U.  S.  Air  Force,  Wright- 
Patterson  AFB,  OH  45433-6543  under  Contract  F33615-90- 
C-1465,  Arpa  Order  No.  7597.  The  views  and  conclusions 
contained  in  this  document  are  those  of  the  authors  and  should  not 
be  interpreted  as  representing  the  official  policies,  either  expressed 
or  implied,  of  the  U.S.  Army  Topographic  Engineering  Center,  or 
the  Defense  Advanced  Research  Projects  Agency,  or  of  the  United 
States  Government. 


1.  Introduction 

Over  the  last  twenty  years  the  remote  sensing 
community  has  focused  its  efforts  on  the  analysis  of 
remotely  sensed  multispectral  imagery  such  as  Landsat 
MSS,  Landsat  TM,  and  SPOT  using  computational 
models  derived  from  statistical  analysis.  As  the  pixel 
size  has  shrunk  from  80  meters,  to  30  meters,  to  10-20 
meters,  the  opportunity  has  increased  to  apply 
structural  and  spatial  analysis  techniques  developed  for 
high  resolution  aerial  imagery  to  satellite  data. 
Further,  a  new  generation  of  high  resolution  airborne 
multispectral  scanners  (Daedalus,  AVRIS,  MEIS,  etc.) 
can  be  flown  to  support  multispectral  pixel  resolutions 
well  below  five  meters.  An  opportunity  exists  to 
develop  image  understanding  techniques  to  support  the 
automated  analysis  of  high  resolution  multispectral 
imageiy  that  goes  beyond  traditional  statistical 
analysis.  In  this  paper  we  discuss  some  preliminary 
work  to  evaluate  the  utility  of  multispectral  imagery  at 
high  spatial  resolution  as  a  source  of  information  for 
automated  cartographic  feature  extraction.  The  variety 
of  applications  and  its  potential  impact  for  automated 
surface  material  classification,  more  accurate  map 
feature  attribution,  and  improved  thematic  and  land-use 
maps,  makes  this  an  imj^rtant  new  area  of  research 
within  the  context  of  computer  vision. 

This  paper  is  organized  into  the  following  sections.^  In 
Section  2  we  give  a  brief  overview  of  some  current 
multispectral  imaging  systems,  including  a  description 
of  their  spectral  and  spatial  resolutions.  In  Section  3 
we  describe  the  current  state-of-the-art  in  manual 
photointerpretation  for  various  domain  applications. 
These  applications  have  been  primarily  in  areas  such  as 
forestry,  agricultural  analysis,  environmental  studies, 
and  regional  planning.  We  give  a  brief  overview  of 
current  practice  in  supervised  classification  of 
multispectral  imagery,  including  issues  in  training  set 
selection,  band  selection,  and  classifier  types.  We  also 


^  Figures  7  through  28  have  been  printed  in  color  as  an  insert  to 
be  included  in  the  1992  DARPA  Image  Understanding  Workshop 
Proceedings. 
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describe  some  tools  for  the  generation  of  ground  truth 
classifications  used  to  assess  accuracy  for  classifier 
peiformance  evaluatioa  In  Section  4  we  describe 
some  initial  experiments  in  the  automatic  generation  of 
surface  material  classification  using  an  airborne 
multispectral  scanner  data  over  Washington  D.C.. 
Finally,  in  Section  5  we  show  the  application  of  coarse 
surface  material  classification  data  to  improve  the 
inteipretation  of  stereo  disparity  estimates  generated 
using  high  resolution  panchromatic  aerial  imagery. 

2.  Multispectral  Systems 

The  launch  of  the  Landsat  1  satellite  on  23  July  1972 
marked  the  beginning  of  a  new  era  of  digital  image 
data  acquisition  and  processing  for  the  remote  sensing 
community.  The  principal  imaging  instrument  onboard 
the  first  Landsat  vehicle  was  the  Multispectral  Scanner 
(MSS)  which  collected  image  data  at  a  spatial 
resolution  of  79  meters  simultaneously  in  four  spectral 
band  intervals.  Table  1  lists  Landsat  1  MSS  spectral 
bandwidths.  The  four  spectral  bands  measured 
reflected  solar  radiation  from  the  earth’s  surface 
covering  the  electromagnetic  spectrum  from  green  to 
near  infrared  and  recorded  the  measured  energy  as 
pixel  intensity  or  brightness  values  in  four  separate 
images  [Richards  86]. 

During  the  past  10  years,  a  trend  towards  higher  spatial 
resolution  for  orbital  multispectral  imaging  systems 
has  occurred.  In  the  early  1980’s,  Landsat  4  was 
launched  carrying  as  part  of  its  payload,  the  Thematic 
Mapper  (TM).  The  Thematic  Maurer,  a  multispectral 
ima^g  iitstrument,  collects  imagery  at  a  higher  spatial 
resolution  of  30  meters  as  compart  to  the  Landsat  1 
MSS  spatial  resolution  of  79  meters.  During  the  latter 
part  of  the  1980’s,  the  French  satellite  SPOT,  carrying 
the  High  Resolution  Visible  Imaging  Instrument 
(HRV),  became  operational  and  is  capable  of  acquiring 
multispectral  imagery  at  20  meter  spatial  resolution.  It 
also  has  a  paiKhiromatic  operating  mode  at  10  meter 
spatial  resolution,  and  the  sensor  can  be  steered  to 
provide  off  axis  imaging  to  allow  for  stereo  coverage. 
In  recent  years.  United  States  government 
organizations  and  professional  remote  sensing  societies 
have  proposed  multispectral  mapping  satellite  systems 
to  provide  high  spatial  resolution  (10  meter) 
stereoscopic  and  multispectral  coverage  of  the  Earth’s 
surface  [Colvocoresses  90,  Light  90].  The  ability  to 
collect  multispectral  imagery  at  these  spatial 
resolutions  provides  the  scale  needed  to  enaUe  the 
application  of  spatial  and  structural  analysis  to 
complex  scenes  in  urban/suburban  environments. 

2.1.  Daedalus  Airborne  Thematic  Mapper 

Until  a  multispectral  satellite  system,  such  as  the 
proposed  Landsat  7,  becomes  operational,  the  remote 
sensing  community  relies  upon  airborne  systems  to 
collect  high  resolution  multispectral  imagery.  The 
Daedalus  Airborne  Thematic  Mapper  (ATM)  is  an 
example  of  an  aircraft-based  multispectral  imaging 


system.  The  following  sections  cover  the  spectral  and 
spatial  characteristics  of  this  multispectri  imaging 
system  along  with  two  operational  orbital  multispectral 
satellite  systems. 

2.1.1.  Spectral  Bandwidths 

The  Daedalus  ATM  is  an  airborne  system  configured 
to  acquire  multispectral  imagery  in  the  NASA-selected 
Landsat  TM  bands  and  additional  spectral  bands  not 
available  on  the  Thematic  Mapper  [Daedalus  83]. 
Figure  1  illustrates  the  spectral  bandpasses,  used  in 
collecting  reflected  electromagnetic  radiation  from  the 
earth’s  surface,  for  the  Daedalus  ATM  along  with  two 
popular  orbiting  satellite  multispectral  imaging 
systems,  Landsat  TM  (TM)  [Engel  and  Weinstein 
83,  NASA  84]  and  SPOT  HRV  [Chevrel,  et.  al. 
81,  Courtois  and  Traizet  86].  Tables  2,  3  and  4  give 
the  detailed  spectral  bandwidths  for  these  systems. 

Figure  1  is  a  schematic  version  of  the  average 
atmospheric  transmission  curve  from  the  earth’s 
surface  to  the  top  of  the  atmosphere.  The  bandpasses 
of  the  three  major  multispectral  imaging  systems  have 
been  included  to  show  tiieir  relative  positions.  The 
atmosphere  interacts  with  the  propagating 
electromagnetic  radiation  in  two  ways  [Slater  80]: 

•  Scattering:  Electromagnetic  radiation  is  reflected  or 
refracted  by  particles  in  the  atmosphere  which  may 
range  in  size  from  gas  molecules  to  dust  particles  and 
large  water  droplets. 

•  Absorption:  Electromagnetic  radiation  is  selectively 
absorbed  by  gas  molecules  contained  in  the 
atmosphere. 

Atmospheric  scattering  and  absorption  processes  can 
severely  attenuate  the  electromagnetic  radiation  as  it 
passes  through  the  atmosphere.  For  example,  the 
atmosphere  heavily  attenuates  electromagnetic 
radiation  in  the  spectral  region  between  1.80  -  1.95 
microns  due  to  a  strong  water  vapor  absorption  band 
located  at  1.9  microns.  As  a  consequence,  these 
processes  dictate  the  electromagnetic  regions  where 
passive  optical  remote  sensing  systems  can  collect 
reflected  solar  radiation  from  the  earth’s  surface. 
These  electromagnetic  regions  are  commottly  referred 
to  as  atmospheric  windows  and  have  been  labeled  in 
Figure  1  as: 

•  Visible  (VIS)  (BLUE,  GREEN  and  RED). 

•  Near  Infrared  (NIRj,  NIR2  and  NIR3). 

•  Shortwave  Infrared  (SWBRj  and  SWIR2). 

Using  these  electromagnetic  regions,  the  pectral 
bandpasses  of  the  three  multispectral  imaging  systems 
can  te  characterized  as  follows: 

•  Daedalus  ATM  possessing  6  visible.  2  near  infrared 
and  2  shortwave  infrared  bands. 

•  Landsat  TM  collecting  in  3  visible,  1  near  infrared 
and  2  shortwave  infrared  bands. 


Landsat  1  Multispectral  Scanner  (MSS) 

Band 

Number 

Spectral  Bandwidth 
(microns) 

Electromagnetic 

Region 

4 

0.500  -  0.600 

Visible  (GREEN) 

5 

0.600  -  0.700 

Visible  (RED) 

6 

0.700  -  0.800 

Near  Infrared  (NIR, ) 

7 

0.800-  1.100 

Near  Infrared  (NIR1-NIR2) 

Spatial  Resolution:  79  meters 

Table  1:  Reflective  Spectral  Bandwidths  for  Landsat  1  MSS 


Figure  1;  Reflective  Spectral  Bandpasses  of  Daedalus  ATM,  Landsat  TM  and  SPOT 

HRV  Multispectral  Intaging  Systems 


•  SPOT  HRV  imaging  with  2  visible  and  1  near 
infrared  bands. 

With  the  Daedalus  ATM  collecting  reflected  solar 
energy  in  10  spectral  regions,  spanning  the  visible  to 
shortwave  infrared,  its  multispectral  imagery  provides 
additional  measurements  in  the  visible  and  near 
infrared  electromagnetic  regions.^  These 

measurements  are  not  available  in  Landsat  TM  or 


^  It  also  has  an  eleventh  channel  in  the  thermal  region  between 
8.S  and  13.0  microns  whose  respon.se  is  sensitive  to  emitted  energy 
from  natural  and  man-made  surfaces. 


SPOT  HRV  multispectral  imagery  and  provide 
additional  spectral  characterization  of  earth  surface 
materials. 

2.1.2.  Spatial  Resolution 

Since  the  Daedalus  ATM  multispectral  imaging  system 
is  aircraft-based,  the  sensor  has  the  capability  to 
acquire  high  resolution  multispectral  imagery  by  flying 
the  system  at  lower  altitudes.  When  comparing  the 
Daedalus  ATM  collection  set  spatial  resolution  to 
Landsat  TM  or  SPOT  HRV  multispectral  imagery,  the 
Daedalus  ATM  imagery  (8  meter)  possesses 
considerably  higher  spatial  resolution  than  either  of  the 
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Daedalus  Airborne  Thematic  Mapper  (ATM) 


Spectral  Bandwidth 
(microns) 

Electromagnetic 

Region 

0.420  -  0.450 

Visible  (BLUE) 

0.450-0.520 

Visible  (BLUE) 

0.520  -  0.600 

Visible  (GREEN) 

0.605  -  0.625 

Visible  (RED) 

0.630  -  0.690 

Visible  (RED) 

0.695-0.750 

Visible  (RED) 

0.760  -  0.900 

Near  Infrared  (NIRj) 

0.910-1.050 

Near  Infrared  (NIR2) 

1.550-  1.750 

Shortwave  Infrared  (SWIRj) 

2.080  -  2.350 

Shortwave  Infrared  (SWIR2) 

two  satellite  systems  (30  and  20  meter,  respectively)^ 
As  a  result  of  the  Daedalus  ATM  8  meter  spatial 
resolution,  detailed  spatial  scene  features  are  present  in 
the  Daedalus  image  structure  which  are  not  evident  in 
Landsat  TM  or  SPOT  HRV  multispectral  imagery.  For 
example.  Figures  7,  8,  and  9  demonstrate  this 
difference  in  spatial  resolution  between  the  Daedalus 
ATM  imagery  and  simulated  SPOT  HRV  and  Landsat 
TM  multispectral  imagery  for  a  near  infrared  image 
(Daedalus  ATM  band  7,  5,  and  3).  The  simulated 
spatial  resolution  for  the  SPOT  HRV  and  Landsat  TM 
image  were  created  by  spatially  reducing  the  Daedalus 
imagery. 

In  the  Daedalus  image  (Figure  7),  features  such  as 
buildings,  road  networks,  shadows,  and  ground 
structures  (e.g.  reflecting  pools,  treetops,  lawn  area, 
parking  lots,  etc.)  can  be  readily  identified.  However, 
in  the  simulated  SPOT  HRV  image  (Figure  8),  these 
features  are  no  longer  easily  discernible.  For  the 
simulated  Landsat  TM  image  (Figure  9),  only 
significant  areal  features  such  as  the  mall  area,  large 
water  bodies,  and  city  blocks  are  readily  identifiable. 

3.  Multispectral  Photointerpretation 

One  advantage  of  multispectral  imagery  over 
panchromatic  (i.e.  black  and  white)  imagery  is  the 
ability  to  view  the  imagery  in  various  combinations  of 


*  A*  stated  previously  SPOT  has  a  10  meter  panchromatic 
capability.  The  focus  of  this  work  is  multispectral  imagery  with 
high  spatial  resolution. 


color  presentations.  For  visual  interpretation  tasks,  this 
flexibility  enables  the  image  analyst  to  select  the 
appropriate  band  combinations  which  provide  die  best 
visual  discrimination  between  land-cover  types  of 
interest  [Holfer  78]. 

The  utilization  of  multispectral  imagery  to  aid  in 
identifying  and  mapping  land-cover  types  crosses 
many  disciplines.  These  disciplines  range  from 
forestry  to  regional  planning  and  their  particular 
interpretation  tasks  are  just  as  varied.  As  a 
consequence,  each  discipline  must  determine  the 
appropriate  land-cover  ty^s  in  order  to  address  their 
task  at  hand.  The  following  brief  discussion  gives  an 
overview  of  a  variety  of  remote  sensing  applications 
from  various  disciplines. 

Forestry  issues  include  the  identification  of  tree 
species,  measurement  of  the  damage  due  to  insects, 
pollution,  environmental  stress,  and  forestry 
management  [Goodenough,  ct.  al.  87,  Hopkins,  et.  al. 
88].  There  has  been  recent  interesting  woA  [Ahem,  et 
al.  91]  to  compare  the  performance  of  conventional 
digital  classification  techniques  and  visual 
interpretation  of  digitally  enhanced  monotemporal 
SPOT  HRV  multispectral  images  for  the  identification 
of  tree  species. 

Crop  analysis  and  agricultural  studies  were  one  of  the 
first  major  justifications  for  the  Landsat  1  and  2 
programs.  Issues  include  diverse  applications  from 
vegetable  crops  census  [Williams,  et.  aJ.  87]  to  use  of 
aii^me  Daedalus  ATM  data  to  map  the  extent  of 
irrigated  crops  [Williamson  89]. 
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Landsat  Thematic  Mapper  (TM) 


Band 

Number 


Spectral  Bandwidth 
(microns) 

Electromagnetic 

Region 

0.450  -  0.520 

Visible  (BLUE) 

0.520-0.600 

Visible  (GREEN) 

0.630  -  0.690 

Visible  (RED) 

0.760  -  0.900 

Near  Infrared  (NIRj) 

1.550-1.750 

Shortwave  Infrared  (SWIRj) 

2.080  -  2.350 

Shortwave  Infrared  (SWIR2) 

Spatial  Resolution;  30  meters 


Table  3:  Reflective  Spectral  Bandwidths  for  Landsat  TM 


SPOT  High  Resolution  Visible  (HRV)  Imaging  Instrument 


Band 

Number 


Spectral  Bandwidth 
(microns) 

0.500-0.590 

0.610  -  0.680 

0.790  -  0.890 


Electromagnetic 

Region 

Visible  (GREEN) 
Visible  (RED) 


3  0.790  -  0.890  Near  Infrared  (NIRj) 

Spatial  Resolution:  20  meters 
Table  4:  Reflective  Spectral  Bandwidths  for  SPOT  HRV 


Land  use  analysis  and  regional  planning  often  integrate 
geographic  information  systems,  image  processing,  and 
multis^ctral  classification  [Ehlers,  et.  al.  90].  Land- 
cover  inventories  [Baumann  90]  have  been  used  to 
monitor  transitional  land-use  from  agricultural  to 
second-home  property.  Wildlife  management 
applications  include  estimation  of  bird  fora^ng 
acreage  [Hodgson,  et.  al.  87]  to  habitat  suitability 
models  for  deer  [Ormsby  and  Lunetta  87]. 

In  the  remainder  of  this  section  we  briefly  overview 
statistical  classification  techniques  and  describe  our 
development  of  a  supervised  spectral  classifier. 

3.1.  Statistical  Classification  Techniques 

With  the  collection  of  reflected  energy  in  several 
spectral  regions  as  illustrated  in  Figure  1,  multispectral 
imaging  systems  provide  an  insight  into  a  material’s 
spectral  signature.  Using  the  spectral  energy 
measurements  from  a  set  of  multispectral  bands, 
individual  multispectral  image  pixels  can  be  assigned 
or  classified  into  spectral  classes  (e.g.  grass,  tree, 
water,  soil,  etc.)  of  similar  measurements,  based  on  the 
multispectral  image  pixel’s  intensity  or  brightness 


values  [Sabins  87].  Traditional  multispectral 
classifiers  can  be  categorized  into  one  of  two  methods: 
unsupervised  and  supervised.  The  primary  distinction 
between  the  two  multispectral  classification  procedures 
centers  around  the  involvement  and  interaction  of  the 
image  analyst  with  the  classification  process. 
Typically,  time  must  be  spent  by  the  image  analyst  to 
identify  candidate  spectral  classes,  called  training  sets, 
prior  to  supervised  classification.  However  in  both 
classification  procedures,  input  band  selection  is  an 
issue  common  to  both  me^ods.  In  the  following 
sections,  we  briefly  discuss  both  methods  and  describe 
our  development  of  a  traditional  supervised  classifier 
along  with  post-classification  evaluation  tools. 

3.2.  Unsupervised  Classification  of 
Multispectral  Imagery 

Unsupervised  classifiers  assign  multispectral  image 
pixels  to  spectral  classes  without  the  image  analyst 
having  foreknowledge  of  the  existence  or  names  of 
those  spectral  classes.  This  classification  procedure 
most  often  uses  clustering  methods  (i.e.  grouping 
spectrally  similar  pixels  together)  to  determine  the 
number  and  location  of  the  spectral  classes  into  which 
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the  image  data  falls  and  to  determine  the  spectral  class 
of  each  pixel.  The  image  analyst  then  identifies  and 
labels  those  spectral  classes  a  posteriori  [Richards 
86,  Swain  and  Davis  78], 

3.3.  Supervised  Classification  of  Multispectral 
Imagery 

Supervised  classification  is  the  procedure  most  often 
used  for  quantitative  analysis  of  remote  sensing  image 
data.  For  supervised  classifiers,  spectral  classes  arc 
identified  before  multispectral  classification  occurs. 
First  an  image  analyst  or  domain  scientist  selects  a 
representative  set  of  pixels,  commonly  called  a  training 
set,  to  statistically  describe  each  of  the  desired  spectral 
classes.  These  training  sets  establish  potential  spectral 
classes  in  multispectral  feature  space  to  which  the 
image  pixels  can  be  assigned.  During  the  actual 
classification  phase,  each  individual  image  pixel  is 
compared  to  each  spectral  class  and  assigned  to  the 
spectral  class  to  which  the  image  pixel  has  highest 
likelihood  of  being  a  member  [Richards  86]. 

3.3.1.  Training  Sets 

The  selection  of  appropriate  land-cover  types  or 
spectral  classes  by  an  image  analyst  is  based  upon 
his/her  domain  knowledge,  the  scene  content,  and  the 
objectives  of  the  classification.  Once  a  list  of  thematic 
classes  has  been  established,  training  regions  are 
selected  in  the  multispectral  image.  Traditionally,  for 
supervised  training,  the  training  sets  are  contiguous 
pixels  or  blocks  of  pixels  from  representative  locations 
across  the  image  called  block  training  [Richards  86]. 
An  alternative  to  block  training  includes  single-pixel 
training  set  selection  [Gong  and  Howarth  90],  where 
individual  pixels  in  the  training  set  arc  several  pixels 
away  from  any  other  pixel  member  of  the  training  set. 

Statistical  measures  are  calculated  for  each  spectra! 
class  training  set.  These  statistical  measures  include 
minimum  vector,  maximum  vector,  mean  vector, 
covariance  matrix  and  correlation  matrix.  The 
dimensionality  of  the  vectors  and  matrices  is 
determined  by  the  number  of  bands  used  in  the 
statistical  analysis.  Of  particular  interest  are  the 
number  of  samples,  mean  vector,  and  covariance 
matrix  with  the  latter  two  metrics  being  directly  used  in 
the  supervised  spectral  classifier.  Sufficient  training 
samples  for  each  spectral  class  must  be  present  to 
allow  reasonable  estimates  of  the  mean  vector  and 
covariance  matrix  [Richards  86,  Swain  and  Davis  78]. 

The  mean  vector  characterizes  the  spectral  class’s 
average  intensity  or  brightness  level  for  each 
multispectral  band  while  the  covariance  matrix 
describes  the  shape  and  orientation  of  the  spectral 
class’s  population,  assuming  a  multivariate  normal 
distribution.  The  diagonal  of  the  covariance  matrix 
contains  the  variance  or  dispersion  of  the  spectral 
class’s  brightness  levels  for  each  multispectral  band. 
To  assess  the  amount  of  correlation  between  any  two 


multispectral  bands  for  a  given  spectral  class,  the 
correlation  matrix  can  be  consulted.  By  normalizing 
the  covariance  mauix,  a  correlation  matrix  is  generated 
with  coefficients  ranging  from  -1.0  to  1.0,  inclusive.  A 
pair  of  highly  correlated  bands  will  have  correlation 
coefficients  near  -1.0  or  1.0  while  an  uncorrelated  band 
pair  will  have  coefficients  close  to  0.0.  Highly 
correlated  band  pairs  indicate  that  a  pixel’s  brightness 
level  in  band  X  can  be  predicted,  within  some  delta, 
given  the  pixel’s  brightness  level  in  band  Y.  Therefore, 
techniques  which  add  additional  bands  to  improve 
class  discrimination  will  fail  if  the  bands  arc  highly 
correlated  with  one  another. 

3.3.2.  Band  Selection 

When  performing  spectral  classification,  the  number  of 
input  multispectral  bands  along  with  spectral  region 
locations  of  the  bands  will  influence  the  classification 
process,  in  terms  of  run  time  and  the  ability  to 
discriminate  between  the  candidate  spectral  classes. 
Therefore,  some  care  must  be  taken  when  selecting  the 
input  multispectral  bands  with  respect  to  the  candidate 
spectral  classes.  Common  practice  include  the 
selection  of  one  band  from  the  visible,  the  near  infrared 
and  shortwave  infrared.  Such  a  set  forms  a  good  basis 
for  discrimination  between  water,  soil,  vegetation  and 
man-made  spectral  features.  A  fourth  band  from  any 
of  these  three  spectral  regions  may  be  included  for 
additional  discrimination  between  candidate  spectral 
classes. 

It  is  best  to  avoid  including  band  pairs  which  are 
highly  correlated,  if  possible.  The  correlation  matrix 
generated  from  the  spectral  class’s  training  set  should 
be  reviewed  for  obvious  high  correlation  between  band 
pairs.  Typically,  adjacent  spectral  bands  in  the 
electromagnetic  spectrum  tend  to  demonstrate  high 
correlation.  There  arc  several  established  metrics  of 
separability  (i.e.  the  degree  of  distinguishability 
between  spectral  classes)  available  to  predict  best 
multispectral  band  combinations  for  spectral 
classifications  [Mauscl,  et.  al.  90,  Richards  86,  Swain 
and  Davis  78].  These  metrics  are  based  on 
measurements  of  the  statistical  separability  between 
spectral  classes  of  interest  [Kailath  67,  Swain  and  King 
73]. 

3.3.3.  Classifiers  in  The  Large 

Spectral  classifiers  for  the  quantitative  analysis  of 
remotely  sensed  image  data  can  range  from  complex  to 
very  simplistic.  The  most  commonly  used  teclmiques 
include  maximum  likelihood,  minimum  distance, 
parallelepiped,  and  mahalanobis  distance. 

Maximum  likelihood  assumes  that  the  spectral  class 
probabilities  are  multivariate  normal  distributions. 
This  is  an  assumption,  rather  than  a  demonstrable 
property  of  natural  spectral  classes  [Richards  86].  The 
probability  distribution  of  each  individual  spectral 
class  is  modeled  by  using  its  mean  vector  and 
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covariance  matrix  as  calculated  from  its  training  set. 
When  classifying  a  multispectral  image  pixel,  the 
probability  of  the  pixel  belonging  to  each  of  the 
candidate  spectral  classes  is  determined  and  assigned 
to  the  spectral  class  with  the  highest  probability. 

When  the  number  of  training  samples  per  candidate 
spectral  class  is  limited,  the  estimation  of  the  statistics 
of  the  class  may  be  inaccurate,  especially  the 
covariance  matrix.  Under  this  situation,  it  is  probably 
best  to  resort  to  a  classifier  utilizing  only  the  mean 
positions  of  the  candidate  spectral  classes  like  a 
minimum  distance  classifier.  When  classifying  a 
multispectral  image  pixel,  the  distance  from  the  pixel 
to  the  mean  of  each  candidate  spectral  class  is 
calculated  and  assigned  to  the  spectrd  class  with  the 
smallest  distance  [Duda  and  Hart  73,  Richards 
86,  Swain  and  Davis  78], 

A  parallelepiped  classifier  partitions  multispectral 
feature  space  using  n-dimensional 
parallelepipeds  [Jensen  86].  Parallelepiped  generation 
can  be  implemented  using  the  mean  vector  and 
standard  deviation  or  the  minimum  and  maximum 
vector  from  each  candidate  spectral  class.  In  both 
implementations,  lower  and  upper  endpoints  are  used 
to  define  the  length  of  the  parallelepiped’s  sides. 
Using  the  mean  vector  and  standard  deviation,  the 
endpoints  would  be  calculated  by  extending  out  +/-  n 
standard  deviations  from  the  mean  point  for  each  of  the 
bands  in  feature  space.  In  the  case  of  using  the 
extrema  vectors,  the  minimum  and  maximum  for  each 
band  in  feature  space  defines  the  endpoints.  A 
multispectral  pixel  is  assigned  to  a  particular  candidate 
spectral  class  if  the  pixel  resides  inside  that  spectral 
class’s  parallelepiped  [Richards  86]. 

If  the  covariance  matrices  of  all  the  candidate  spectral 
classes  are  considered  to  be  equal  (i.e.  a  global 
covariance  matrix),  the  maximum  likelihood  classifier 
reduces  to  the  mahalanobis  distance  classifier.  The 
mahalanobis  distance  is  a  statistical  distance  from  a 
multispectral  pixel  to  a  candidate  spectral  class’s  mean 
point,  normalized  by  the  global  covariance 
matrix  [CHida  and  Hart  73,  Richards  86].  Like  the 
minimum  distance  classifier,  the  smallest  distance 
between  the  pixel  and  a  spectral  class  mean  point 
determines  the  pixel’s  assignment.  The  normalization 
by  the  global  covariance  matrix  retains  a  degree  of 
direction  sensitivity,  like  a  maximum  likelihood 
classifier,  which  is  absent  in  a  minimum  distance 
classifier. 

3.4.  Supervised  Spectral  Classifier 
Our  su^rvised  spiral  classifier,  mbclass,  utilizes 
pre-classification  spectral  class  statistics  derived  from 
the  training  sets  and  user-selected  input  multispectral 
image  bands  to  assign  each  multispectral  image  pixel 
to  one  of  the  available  candidate  s^ctral  classes  from 
the  training  sets.  A  maximum  likelihood  decision  rule, 
which  assumes  multivariate  normal  distributions  for 


the  candidate  spectral  classes,  is  used  as  the 
discriminant  function  for  assigning  multispectral  image 
pixels  to  one  of  the  available  candidate  spectral  classes. 
Upon  completion  of  classifying  the  multispectral 
image,  a  classmap  is  generated,  recording  each 
multispectral  image  pixel’s  class  assignment.  We 
show  some  examples  of  the  results  of  this  classification 
process  in  Section  4.  We  use  these  results  as  a  basis  for 
out  experiments  in  multispectral  fusion  and  refinement 
in  Section  5. 

An  inherent  problem  with  this  assignment  method, 
commonly  called  forced  allocation,  involves  the 
assignment  of  outlying  multispectral  pixels  to  a 
candidate  spectral  class  which  are  not  representative  of 
that  spectraJ  class.  Such  outliers  may  be  due  to  sensor 
noise  or  saturation  during  image  acquisition,  heavily 
mixed  pixels  composed  of  two  or  more  spectral  ground 
features  (e.g.  vegetation  and  water)  or  absence  of 
proper  candidate  spectral  classes.  To  alleviate  this 
inherent  problem  with  forced  allocation,  a  threshold 
limit  can  be  specified  in  mbclass  to  minimize  the 
assignment  of  outliers  to  the  candidate  spectral 
classes  [Richards  86,  Swain  and  Davis  78]. 

3.5.  Assessing  Classification  Accuracy 
With  the  generation  of  classmaps  by  various 
classification  schemes,  it  is  necessaiy  to  have  some 
method  for  evaluating  and  comparing  the  accuracy  of 
the  generated  classmaps.  This  classmap  assessment 
process  is  required  to  aid  in  the  assessment  and 
improvement  of  classification  schemes.  Various 
techniques  have  been  develop^  and  implemented  in 
the  remote  sensing  community  to  determine  and 
evaluate  the  accuracy  of  land-use/land-cover  maps 
derived  from  remotely  sensed  data  [Aronoff  82,  Dicks 
and  Lo  90,  Fitzpatrick-Lins  81,  Story  and  Congalton 
86].  The  basic  accuracy  assessment  procedure 
involves  the  selection  of  samples  from  the  land- 
use/land-cover  map,  verification  of  the  samples  using 
ground  truth,  and  statistical  evaluation  of  the  verified 
samples  from  the  land-use/land-cover  map  [Congalton, 
et.  i.  83,  Congalton,  et.  al.  84,  Congalton  and  Mead 
86]. 

Typically,  in  an  operational  cartographic  environment, 
attempting  to  verify  every  sample  in  a  land-use/land- 
cover  map  is  prohibitively  time-consuming  and 
expensive.  Therefore,  a  sample  selection  scheme  is 
adopted  to  compile  a  representative  sample  population 
from  which  an  accuracy  estimate  of  the  entire  land- 
use/land-cover  map  can  be  determined.  Common 
sample  selection  strategies  applied  to  land-use/land- 
cover  maps  are  simple  raiidom  sampling  without 
replacement,  cluster  sampling,  stratified  random 
sampling,  systematic  sampling  and  stratified 
systematic  unaligned  sampling  [Congalton  88]. 
Arguments  for  and  against  each  of  these  strategies  can 
be  made,  and  the  decision  as  to  which  method  to  use  is 
often  application  and  data  dependent.  The  sample 
procedure  utilized  should  minimize  the  effects  of 
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spatial  autocorrelation  as  well  as  ensure  that  all  classes 
of  interest  are  adequately  sampled  [Dicks  and  Lo  90]. 
The  selected  map  samples  are  then  usually  verified 
with  aerial  photographs  and  field  site  visits  when  photo 
verification  is  inadequate  [Dicks  and  Lo 
90,  Fitzpatrick-Lins  81].  TTie  verification  results  are 
then  tabulated  in  an  error  or  confusion  matrix  with  an 
overall  map  accuracy  percentage  calculated  [Story  and 
Congalton  86]. 

We  are  beginning  to  develop  tools  to  allow  for 
accuracy  assessment  of  the  ground-truth  classification. 
Currently  we  rely  on  ground  truth  generation  using 
high  resolution  black  and  white  aerial  photographs. 
These  hand  segmentation  tools  are  useful  for 
generating  ground  truth  where  geometric  relationships, 
such  as  building  size,  shape  and  boundaries,  are  ^e 
primary  focus.  With  regard  to  spectral  classification, 
the  ground  truth  needs  to  represent  the  material  types 
located  in  the  scene.  As  a  consequence,  our  previously 
developed  manual  hand  segmentation  tools  were 
inadequate  and  inappropriate.  We  have  begun  to 
address  this  inadequacy  by  the  development  of  an 
interactive  supervised  classification  tool,  iclass,  to 
generate  surface  material  ground  truths  in  the  form  of 
reference  classmaps.  In  the  following  section  we 
discuss  experiments  in  surface  material  classification 
using  our  supervised  classification  system,  MBCLASS. 

4.  Experiments  in  Surface  Material 
Classification 

One  area  of  current  research  is  the  automated 
construction  of  surface  material  classmaps  in  complex 
urban  environments.  The  surface  material  classmaps 
are  generated  by  our  supervised  sjxctral  classifier, 
MBCLASS,  using  Daedalus  ATM  multispectral  imagery 
of  Washington,  D.C..  This  imagery  was  collected  for 
the  United  States  SPOT  HRV  Simulation 
Campaign  [SPOT  83,  SPOT  84]  and  should  not  be 
confused  with  actual  SPOT  HRV  imagery  at  20  meter 
spatial  resolution  having  three  spectral  regions  as 
outlined  in  Table  4.  The  following  sections  discuss  the 
generation  of  surface  material  classmaps  for  two  sites 
in  the  Washington,  D.C.  metropolitan  area. 

4.1.  Urban  Area  Test  Sites 

Figures  10  (CIVILI)  and  1 1  (GAOi)  show  two  of  the  test 
areas  with  which  we  have  been  experimenting.  Both 
images  are  shown  using  a  near  infrared  encoding  using 
Daedalus  bands  7,  5,  and  3.  The  scene  content  is 
representative  of  a  complex  urban  area  with  buildings 
of  various  shapes,  sizes,  and  heights,  street  networks, 
and  landscaped  areas.  Specific  scene  characteristics 
include: 

•  CIVILI:  The  Civil  Service  Building  is  located  in  the 
center  of  the  image.  Other  major  buildings  and 
landmarics  relative  to  the  Civil  Service  Building 
include:  the  Reflecting  Pool  to  the  south,  the  Slate 
Department  to  the  west,  the  Interior  Department 


immediately  to  the  east  and  the  General  Services 
Administration  to  the  northeast.  A  mixture  of  natural 
terrain  and  man-rnade  features  comprise  the  image. 

•  GAOl:  The  General  Accounting  Office  Building  and 
Judiciary  Square  occupy  the  right  of  center  portion  of 
the  image.  Other  major  buildings  and  landmarks 
located  in  the  image  include:  the  FBI  and  Justice 
Department  in  the  southwest  portion,  the  Labor 
Department  in  the  southeast  section  and  1-395  just  east 
of  the  General  Accounting  Office  Building.  This 
image  is  dominated  by  man-made  features  and 
structures. 

4.2.  Training  Sets 

The  objective  of  our  classification  task  involves  the 
generation  of  surface  material  classmaps  at  a  coarse 
level  for  urban  multispectral  scenes.  That  is  to  say  we 
are  initially  only  interested  in  characterizing  the 
primary  level  of  land-cover  detail.  In  our  urban 
analysis  problem,  the  two  primary  land-cover  types  of 
most  interest  to  us  are  natural  terrain  and  man-made 
features.  In  Figure  2,  these  two  primary  land-cover 
types  are  further  divided  into  specific  spectral  classes 
based  upon  visual  interpretation  of  the  Daedalus  ATM 
multispectral  imagery.  The  inclusion  of  a  shadow 
feature  in  the  spectral  class  hierarchy  alleviates 
misclassifications  of  shadow  pixels  as  water  spectral 
features  due  to  spectral  similarities  between  the  two 
features. 

Block  training  sets,  consisting  of  homogeneous  areas 
of  pure  pixels,  were  collected  manually  for  the 
candidate  spectral  classes  from  various  regions 
distributed  throughout  the  entire  Daedalus  ATM 
multispectral  imagery.  The  candidate  spectral  classes 
are  listed  in  Table  5  with  the  number  of  training 
samples  or  pixels  per  spectral  class.  The  combined 
total  of  spectral  class  training  set  samples  represents 
1.5%  of  this  Daedalus  ATM  multispectral  image.® 

4.3.  Band  Selection 

During  the  process  of  selecting  training  sets  for  pre¬ 
classification  spectral  class  statistics,  it  was  observed 
that  Daedalus  ATM  bands  3,  5,  7,  and  10  provided 
excellent  color  separation  among  the  earth  surface 
features  (c.g.  grass,  tree,  water,  soil,  etc.)  and  man¬ 
made  features  (e.g.  road,  bridge,  building,  parking  lot, 
etc.).  Therefore,  from  a  subjective  point  of  view,  it 
was  felt  these  four  multispectral  bands  would  provide 
fairly  reasonable  spectrsd  class  separation  for  the 
supervised  spectral  classifier,  MBCLASS.  As  a 
validation  check,  the  average  Jeffries-Matusita 
Distance  [Mausel,  et.  al.  90,  Richards  86,  Swain  and 


*  The  Daedalus  ATM  multispectral  imagery  over  Washington, 
D.C.  measured  716  rows  by  3000  columns  for  each  of  the  10 
reflective  bands. 
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Figure  2:  Urban  Spectral  Classes 


Spectral  Classes  of  Interest 

Spectral  Class 

Name 

Number  of 
Training  Samples 

asphalt 

733 

concrete 

656 

coniferous  tree 

52 

deciduous  tree 

9720 

deep  water 

16433 

grass 

2535 

shadow 

140 

shallow  water 

823 

soil 

354 

tile 

260 

turbid  water 

2269 

Total  Training  Samples;  33975 

Table  5:  Spectral  Classes 
Used  in  Classification 


Davis  78]  was  calculated,  using  the  statistics  from  the 
spectral  class  training  sets,  in  order  to  rank  spectral 
class  separability  for  all  four  band  combinations. 
Table  6  lists  the  top  10  multispectral  band 
combinations  from  this  calculation  for  all  four  band 
combinations.  The  spectral  class  separability  .score  for 
Daedalus  ATM  band  combination  3,  5.  7,  and  10  is 


Daedalus  Airborne  Thematic  Mapper  (ATM) 

Band 

Combination 

Average 

JM-Distance 

3,4,7.10 

1.980 

2,3,4,7 

1.979 

3,4,6,10 

1.978 

3,5,7,10 

1.977 

3,4.7,9 

1.977 

3,6,7,10 

1.977 

3,4,6,7 

1.977 

2,3,4,6 

1.976 

3.4,5, 7 

1.976 

2,3,5,7 

1.976 

Maximum  Possible  Distance:  2.0 

Table  6:  Top  10  Four  Band  Combinations 

Using  the  Spectral  Class  Training  Sets 


1.977  with  a  ranking  of  4  out  of  210  band 
combinations. 

4.4.  Generated  Surface  Material  Classmaps 

A  surface  material  classmap  was  generated  for  each 
site,  CIVILl  and  GAOl,  u.ing  our  supervised  spectral 
classifier,  MBCLASS,  with  forced  allocation  invoked. 
The  resulting  classmaps  are  shown  in  Figures  12  and 
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13,  respectively,  with  the  class  assignment  of  each 
multispectral  pixel  encoded  by  its  color.  A 
multispectral  class  legend  is  shown  in  Figure  18. 

The  forced  allocation  method  requires  that  each  pixel 
in  the  multispectral  image  be  assigned  to  one  of  the 
candidate  spectral  classes.  As  discussed  earlier,  this 
technique  often  results  in  the  assignment  of  outlying 
multis^ctral  pixels  to  a  candidate  spectral  class  which 
are  not  representative.  Two  examples  of  this  occur 
with  water  and  shadow  pixels.  Our  shadow  spectral 
class  was  defined  by  training  pixels  with  no  direct 
illumination  and  minimal  downwelling  illumination 
due  to  atmospheric  scattering.  Therefore,  we  would 
expect  little  surface  material  information  to  reach  the 
sensor.  Shallow  water  pixels  along  the  edge  of  the 
reflecting  pool  in  Figure  12  are  classified  as  shadow 
due  to  the  mixing  of  water  pixels  with  the  immediately 
surrounding  background.  The  water  and  shadow 
spectral  classes  have  similar  spectral  feahires  and  these 
mixed  pixels  exhibit  properties  closest  to  the  shadow 
class.  In  a  second  case,  several  shadow  areas  along 
side  of  buildings  in  both  test  areas  shown  in  Figures  12 
and  13,  are  classified  as  shallow  water.  This  is  a  result 
of  indirect  illumination  (i.e.  scattered  solar  radiation) 
on  the  asphalt  street  surfaces.  This  mixed  pixel  occurs 
at  the  transition  edge  between  shadows  and  direcUy 
illuminated  asphalt. 

Traditionally,  this  problem  is  handled  either  via  post¬ 
processing  such  as  contextual  classification,  or  the 
application  of  a  threshold  limit  based  on  percentage 
confidence  followed  by  a  second  stage  classification 
tWharton  82].  We  address  these  problems  in  the 
following  section  on  multispectral  fusion,  where  we 
use  additional  information  in  the  form  of  high 
resolution  panchromatic  aerial  imagery  to  refine  the 
classification  estimates. 

5.  Multispectral  Fusion 

In  this  section  we  describe  recent  research  to  evaluate 
the  utility  of  information  fusion  between  multispectral 
imagery  witii  moderate  spatial  resolution  and  high 
resolution  panchromatic  aerial  imagery.  We  have 
focused  on  two  interesting  problems.  The  first 
experiment  is  the  refinement  of  surface  material 
classmaps  using  monocular  segmentations  of  high 
resolution  panchromatic  imagery.  The  second 
experiment  involves  the  fusion  of  high  resolution 
stereo  disparity  maps  with  moderate  resolution  surface 
material  classmaps.  Our  two  experimental  sites,  CIVILI 
and  GAOi,  are  used  to  illustrate  the  process  and  utility 
of  surface  material  classmap  refinement  and  fusion. 

5.1.  Classmap  Refinement  using  High 
Resolution  Panchromatic  Imagery 

Surface  material  classmaps  generated  by  traditional 
spectral  classifiers  tend  to  have  isolated  or  small 
groupings  of  pixels  assigned  to  a  class  which  are  out  of 
context  with  its  immediate  neighbors.  For  example. 


the  CIVILI  classmap  in  Figure  12  has  shallow  water 
pixels  in  shadow  regions  located  on  city  streets  and 
shadow  pixels  in  the  middle  of  waterbodies  composed 
of  shallow  water  pixels. 

One  post-classification  processing  technique  to  rectify 
classification  inconsistencies  employs  a  class  majority 
operator  which  retains  or  changes  the  central  pixel’s 
class  assignment  based  on  the  immediate 
neighborhood’s  class  composition  and  a  set  of  user- 
specified  decision  rules  [Townsend  86].  A  second 
technique  uses  context  and  conditional  probabilities  to 
modify  classifications  [Wharton  82]These  post¬ 
classification  processing  techniques  either  use 
information  encoded  in  the  classmap  to  make  a 
decision,  or  a  priori  knowledge  concerning  common 
mixed  pixel  confusions.  The  technique  which  we  are 
pursuing  utilizes  ancillary  spatial  scene  information  in 
the  form  of  high  resolution  black  and  white  aerial 
imagery  to  perform  the  refinement  of  the  surface 
materi^  classmap. 

5.1.1.  Monocular  Fusion 

Figures  3  and  5  are  the  left  images  from  a  stereopair  of 
high  resolution  black  and  white  aerial  imagery  with 
spatial  resolution  of  approximately  1.2  meter.  These 
images  each  cover  a  subscene  area  centrally  located  in 
the  multispectral  images  in  Figures  10  and  11. 

We  use  surface  illumination  infonnation  represented 
by  the  segmentation  of  the  high  resolution 
panchromatic  images  into  very  fine  surface  patches  of 
nearly  homogeneous  intensity  to  provide  a  basis  for 
classmap  refinement  [McKeown  aiKl  Perlant  92].  Each 
nearly  uniform  intensity  region  identifies  a  surface 
patch  having,  in  the  ide^  case,  a  nearly  homogeneous 
surface  material  type  within  that  region.  'These  patches 
are  used  to  guide  a  statistical  analysis  of  the  classmap 
based  on  the  assumption  that  such  patches  correspond 
closely  with  physical  surfaces  of  the  scene.  We 
attempt  to  verify  or  modify  the  class  assignments 
encoded  in  the  surface  material  classmaps  by 
examining  the  classmap  statistics  within  each  of  the 
corresponding  patch  regions. 

5.1.2.  Multisensor  Registration 

Before  the  spatial  segmentation  information  can  be 
applied  to  the  classmaps,  the  surface  material 
classmaps  must  be  registered  to  the  high  resolution 
black  and  white  aerial  imagery.  Two  arising  issues 
from  the  registration  of  the  moderate  resolution 
classmaps  (8  meter)  to  the  finer  resolution  aerial 
imagery  (1.2  meter)  are  maintenance  of  geometric 
accuracy  and  preservation  of  the  class  assignment 
information. 

'The  first  issue  is  concerned  with  providing  accurate 
geometric  correspondence  or  image  registration 
between  two  different  image  sets.  'The  multispectral 
imagery  was  acquired  by  an  electro-optical  imaging 
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Figure  3:  civil  l  Left  Image  ( 1 .2  meter) 


Figure  4:  CIVIL  l  Refined  SIS2  Disparity  Image 


Figure  5;  GAOi  Left  Image  ( 1 .2  meter) 

system  while  the  panchromatic  imagery  was  captured 
by  a  film-based  system  then  subsequently  digitized. 
Each  system  has  its  own  inherent  optical  and  imaging 
characteristics  which  affect  the  spatial  and  radiometric 
properties  of  the  resulting  digital  image.  These  image 
properties  include  distortion,  rotation,  translation,  and 
intensity  level  dynamic  range.  To  overcome  these 
image  differences  during  the  image-to-image 
registration  process,  we  first  correspond  the  images  to 
a  map-based  reference  system  using  a  database  of 
image  chips  with  associated  map  coordinates  (i.e 
latitude  and  longitude).  Once  this  image-to-map 
correspondence  is  completed,  we  have  the  ability  to 
determine  the  pixel  coordinates  from  one  image  set  to 
another  image  set  by  utilizing  the  image-to-map  and 
map-to-image  correspondence  model  in  CONCEPTMAP 
[McKeown  84,  McKeown  87,  Perlant  and  McKeown 
90].  This  model  provides  the  geometric  image 
registration  between  the  classmaps  and  panchromatic 
imagery. 

The  second  issue  of  preserving  the  class  assignments  in 


Figure  6:  GAOl  Refined  SlS2  Disparity  Image 

the  classmaps  is  equally  as  important.  Resampling  of 
the  classmap  to  the  aerial  imagery  must  preserve  the 
original  intensity  levels  of  the  classmaps  since  these 
intensity  levels  encode  the  class  assignment  at  each 
pixel  location.  A  registration  procedure  utilizing 
interpolation  would  destroy  the  encoding  scheme  and 
result  in  erroneous  pixel  class  assignments.  Therefore, 
a  nearest-neighbor  assignment  procedure  is  used  to 
ensure  the  preservation  of  the  classmap’ s  original  class 
assignment  at  each  pixel  location. 

Figures  14  and  15  show  the  area  of  image 
correspondence  between  the  high  resolution  aerial 
imagery  ( 1 .2  meter)  and  the  generated  surface  material 
classmaps  (8  meter).  To  register  the  classmaps  to  the 
high  resolution  aerial  imagery,  the  image-to-map  and 
map-to-image  correspondence  model  provided  by 
CONCEPTMAP' s  landmark  database  is  employed, 
resulting  in  the  registered  surface  material  classmaps 
shown  in  Figures  16  and  17.  The  registered  surface 
material  cla.ssmaps  appear  blocky  due  to  the  effects  of 
using  image-to-image  pixel  correspondence  between 
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the  8  meter  classmaps  and  1.2  meter  aerial  imagery 
using  nearest-neighbor  resampling. 

5.1.3.  Refinement 

Once  we  have  registered  the  imagery,  we  can  use  the 
region  boundaries  in  the  panchromatic  imagery  to 
select  classmap  values  from  the  registered  classmap 
imagery.  A  histogram  of  classmap  values  within  each 
region  is  computed  and  a  statistically  significant  value 
is  selected,  based  upon  the  sample  size.  In  cases  where 
no  class  can  be  selected,  we  set  the  points  in  the  region 
to  unclassified.  Additional  processing  can  include  the 
use  of  multiple  segmentations  and  merging  of  classmap 
values  across  multiple  regions  [McKeown  and  Perlant 
92]. 

Figures  19  and  21  demonstrate  Uie  effects  of  this 
refinement  process.  The  blocky  appearance  in  the 
registered  classmaps  has  been  improv^  using  the  fine 
grain  intensity  patches  obtained  from  the  high 
resolution  panchromatic  imagery.  The  unclassified 
pixels  in  the  refined  surface  material  classmap  are 
indicated  in  black.  One  significant  improvement 
generated  by  the  refinement  process  is  illustrated  by 
the  correct  reassignment  of  soil  pixels  to  concrete 
pixels  on  the  General  Accounting  Office  Building’s  top 
level  roof  as  shown  in  Figures  17  and  21.  Ground 
features,  such  as  grass  areas  and  asphalt  streets,  have 
been  diminished  in  area  or  extended  into  other  surface 
materials.  This  is  illustrated  by  the  tile  building, 
immediately  below  the  General  Accounting  Office 
Building,  with  =ts  border  areas  of  grass  reassigned  to 
asphalt  and  portions  of  the  its  rooftop  modified  to 
asphalt.  These  reassignments  can  be  attributed  to  the 
difference  in  view  angle  between  the  Daedalus  ATM 
multispectral  imagery  (nadir  viewing)  and  the  aerial 
panchromatic  images  (off-nadir  viewing). 

A  more  quantitative  analysis  of  improvement  requires 
a  highly  accurate  ground-truth  surface  material 
classification.  This  is  a  current  goal  of  our  work  in  in 
assessing  accuracy  for  supervised  classification 
(Section  3.5)  as  well  as  in  assessing  the  improvements 
due  to  refinement. 

5.2.  Fusion  of  Classmap  with  Disparity  Images 
Various  procedures  for  merging  multiscnsor  and 
multiresolution  satellite  image  data  have  been  explored 
by  the  remote  sensing  community  to  create  composite 
images  that  provide  more  visutd  information  to  the 
image  analyst  in  an  interpretable  maruier  [Chavez,  et. 
al.  91].  Previous  composite  image  generation  schemes 
range  from  merging  Landsat  TM  and  SPOT  HRV 
image  data  sets  [Welch  and  Ehlers  87)  to  temperature 
maps  with  SPOT  HRV  panchromatic  imagery  [Schott 
89].  Along  these  same  lines,  we  have  implemented  a 
method  of  merging  a  refined  surface  material  clas.smap 
with  a  corresponding  refined  StS2  disparity  image  t^ 
generate  a  composite  image,  displaying  surface 
material  and  height  information  simultaneously. 


As  previously  discussed,  the  refined  classmap  encodes 
the  surface  material  type  for  each  image  pixel.  The 
refined  disparity  maps  [McKeown  and  Perlant  92] 
contains  height  information  as  derived  by  an  area- 
based  stereo  system.  Si,  and  a  feature-based  stereo 
system,  S2  [Hsieh,  et.  al.  92]  The  disparity  images  are 
encoded  such  that  the  brighter  the  pixel  intensity  level, 
the  closer  the  pixel  surface  is  to  the  viewer.  Dark 
pixels  are  at  or  below  the  ground  plane.  Black  pixels 
indicate  that  a  consistent  or  refined  height  estimate  was 
not  possible  as  a  result  of  merging  the  results  of  the 
stereo  systems.  For  example,  in  Figures  4  and  6, 
building  rooftops  appear  light  gray  to  white  and  ground 
level  appears  medium  to  dailc  gray. 

Our  classmap-disparity  image  merging  procedure 
utilizes  a  hue-saturation-value  color  model  [Foley  and 
Van  Dam  83,  Smith  78]  to  transform  color  images, 
defined  in  terms  of  red,  green  and  blue  (RGB) 
components,  into  corresponding  hue,  saturation  and 
value  (HSV)  components.  To  fuse  the  surface  material 
classmap  and  disparity  image  together,  the  refined 
surface  material  classmap  is  transformed  from  RGB  to 
HSV  space.  Next,  the  refined  surface  material 
classmap’s  V  component  is  replaced  by  a 
corresponding  refined  SlS2  disparity  image  which  has 
been  scaled  (i.e.  linearly  stretched  to  fill  the  dynamic 
range  of  0-255  intensity  levels).  Finally,  the  inverse 
transformation  fi’om  HSV  to  RGB  is  performed.  The 
resulting  color  image  contains  both  surface  material 
type  and  height  information  for  visual  inspection. 

The  refined  surface  material  classmaps,  in  Figures  19 
and  21,  were  fused  with  their  respective  refined  S1S2 
disparity  images  shown  in  Figures  4  and  6.  The 
resulting  surface  material-height  composite  images  arc 
illustrated  in  Figures  20  and  22.  TTie  color  scheme 
from  the  refined  surface  material  classmap  has  been 
preserved  in  the  surface  material -height  images  with 
color  brightness  indicating  closeness  to  the  viewer, 
analogous  to  the  refined  disparity  image’s  brightness 
coding.  For  example,  bright  asphalt,  tile  and  concrete 
features  support  the  presence  of  a  building  rooftop 
while  dark  grass  and  asphalt  features  indicate  lawn 
areas  and  streets  at  ground  level,  respectively.  The 
Interior  Department  Building,  in  Figure  20,  with  its 
erroneous  soil  features  alerts  the  need  for  modification 
of  the  surface  material  classification. 

Utilizing  the  refined  S1S2  disparity  image  for  height 
information,  a  three-dimensional  perspective  view  of 
the  refined  surface  material  scene  can  be  generated. 
Perspective  views  of  civiLl  and  GAOi  refined  surface 
material  classmaps  arc  shown  in  Figures  24  and  26. 
The  color  scheme  from  the  refined  surface  material 
classmap  has  been  preserved  in  the  three-dimensional 
scene.  Figures  25  and  27  arc  also  three-dimensional 
refined  surface  material  scene,  identical  to  Figures  24 
and  26,  but  with  the  image  texture  incorporated  to 
improve  scene  structure.  Both  the  surface  material- 
height  image  and  the  perspective  views  of  the  refined 
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surface  material  classmap  provide  visualizations  of 
scene  information.  This  is  a  useful  starting  point  for 
interactive  or  automated  verification  and  modification 
of  the  surface  material  classmap  and  the  stereo 
disparity  estimates. 

Beyond  visualization,  we  are  interested  in  using  the 
surface  material  class  to  aid  in  the  refinement  and 
interpretation  of  the  disparity  information.  We  believe 
that  the  coupling  of  surface  material  classification  will 
allow  us  to  move  toward  an  automated  interpretation  of 
the  scene  in  terms  of  objects  such  as  buildings,  trees, 
and  roads.  Ambiguities  such  as  an  asphalt  surface  road 
adjacent  to  an  asphalt  roofed  building  can  be 
disambiguated  using  height  information.  Similarly, 
errors  in  height  information  can  be  detected  by  looking 
for  implausible  material  information.  For  example, 
isolated  tree  stands  when  viewed  purely  from  a  height 
standpoint  might  be  subject  to  noise  processing  based 
upon  their  lack  of  regular  structure.  Given  that  we  can 
also  acquire  spectral  information  consistent  with  trees 
or  vegetation  we  may  be  able  to  recognize  such 
features  directly. 

6.  Conclusions 

Multispectral  imagery  has  provided  valuable 
information  to  the  remote  sensing  community  for  the 
past  20  years.  In  this  paper,  we  discuss  opportunities 
for  the  image  understanding  community  to  utilize 
multispectral  information  in  conjunction  with  high 
resolution  panchromatic  imagery.  We  provide  a  brief 
overview  of  spectral  properties  for  the  two  major 
satellite  multis^ctral  systems,  Landsat  TM  and  SPOT 
HRV,  and  describe  an  aiibome  multispectral  system, 
the  Daedalus  ATM.  A  brief  discussion  of  manual 
multispectral  photointerpretation  and  automated 
classification  techniques  were  presented  including 
applications  in  many  disciplines  such  as  forest^, 
agricultural,  regional  planning,  and  wildlife 
management. 

We  described  a  supervised  classification  procedure, 
MBCLASS,  for  multispectral  imagery  using  the 
Daedalus  ATM  scanner  data  over  Washington,  D.C.. 
The  goal  is  to  automatically  generate  surface  material 
classmaps  that  can  be  used  in  conjunction  with 
monocular  and  stereo  scene  models  generated  from 
higher  resolution  panchromatic  aerial  imagery. 

Finally,  two  new  research  areas  were  presented: 
refinement  of  coarse  multispectral  classmai»  and  the 
fusion  of  surface  material  classmaps  with  stereo 
disparity  images.  These  two  areas  show  that 
multispectral  imagery  can  provide  significant 
additional  scene  information  to  support  canographic 
applications  using  image  understanding  systems  and 
techniques. 

In  recent  years.  United  States  government 
organizations  and  professional  remote  sensing  societies 
have  proposed  multispectral  mapping  satellite  systems 


to  provide  high  spatial  resolution  (10  meter  or  finer) 
stereoscopic  and  multispectral  coverage  of  the  Earth’s 
surface  [Colvocoresses  90,  Light  90].  We  believe  that 
the  ability  to  collect  multispectral  imagery  at  these 
spatial  resolutions  provides  the  scale  needed  to  enable 
the  application  of  spatial  and  structural  analysis  to 
complex  scenes  in  urban/suburban  environments. 
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Abstract 

We  present  a  method  of  recovering  shape  from  shading 
that  solves  directly  for  the  surface  height.  By  using  a  dis¬ 
crete  formulation  of  tlie  problem,  we  are  able  to  achieve 
good  convergence  behavior  by  employing  numerical  so¬ 
lution  techniques  more  powerful  than  gradient  descent 
methods  derived  from  variational  calculus.  Becau.se  we 
directly  solve  for  height,  we  avoid  the  problem  of  finding 
an  integrable  surface  maximally  consistent  with  surface 
orientation.  Furthernrore,  since  we  do  not  need  addi¬ 
tional  constraints  to  make  the  problem  well  posed,  we 
use  a  smoothness  constraint  only  to  drive  the  system 
towards  a  good  solution;  the  weight  of  the  smoothness 
term  is  eventually  reduced  to  near  zero.  Also,  by  solving 
directly  for  height,  we  can  use  stereo  processing  to  pro¬ 
vide  initial  and  boundary  conditions.  Our  shape  from 
shading  technique,  as  well  as  its  relation  to  stereo,  is 
demonstrated  on  both  synthetic  and  real  imagery. 

1  Introduction 

The  problem  of  extracting  shape  from  the  shaded  im¬ 
age  of  a  surface  has  received  considerable  attention;  an 
excellent  survey  is  presented  in  Horn  (1990).  However, 
the  computation  of  shape  from  shading  has  been  typi¬ 
cally  characterized  as  that  of  finding  surface  orientation, 
rather  than  surface  height.  Converting  orientation  in¬ 
formation  into  height,  oi  integrating  shading  methods 
with  other  techniques  for  determining  shape,  has  been 
less  well  considered. 

In  this  paper  we  develop  a  direct  method  of  comput¬ 
ing  height  from  shading.  Solving  for  height,  as  opposed 
to  orientation,  has  at  least  two  advantages;  First,  we 
do  not  need  to  include  additional  constraints  to  ensure 
integrability;  any  .solution  necessarily  corresponds  to  a 
real  surface.  Second,  a  formulation  expressed  in  terms 
of  height  is  more  naturally  integrated  with  other  meth¬ 
ods  of  recovering  shape,  such  as  stereo  processing. 

*The  work  reported  here  and  the  use  of  the  Connection  Ma- 
chine(tm)  were  partially  supported  by  the  Defense  Advanced 
Research  Projects  Agency.  A  variant  of  this  paper  h^ls  been 
published  (Leclerc  and  Bobick  1991). 


We  first  derive  a  discrete  formulation  of  the  shape 
from  shading  problem,  and  present  a  .solution  method 
that  uses  a  continually  decreasing  smoothness  term  to 
drive  the  system  to  a  good  solution.  Specifically,  we 
find  the  smoothest  surface  giving  rise  to  the  input  im¬ 
age.  A  simple  extension  allows  us  to  .solve  for  albedo 
and  light  source  direction.  Next,  we  consider  the  inte¬ 
gration  of  this  height  from  shading  technique  with  other 
methods  of  determining  shape.  In  particular,  we  em¬ 
ploy  stereo  processing  to  provide  initial  conditions  for 
the  shading  analysis.  We  also  note  that  stereo  and  shad¬ 
ing  are  complementary  techniques:  regions  in  die  image 
where  stereo  fails  because  of  the  lack  of  interesting  visual 
events  are  good  candidate  regions  for  shading  analysis. 
We  demonstrate  our  approach  on  both  .synthetic  and  real 
imagery. 

2  Orientation  versus  Height 

2.1  Recovering  Orientation 

The  beusic  aissumption  underlying  all  approaches  to  shape 
from  shading  is  the  image  irradiance  equation; 

Hx,y)  =  R(n{x,y))  (1) 

which  states  that  image  intensity  I  at  a  point  (,x,  y)  is 
a  function  R  of  the  surface  normal  n  at  the  point  on  a 
surface  that  projects  to  {x,y)  in  the  image.  .Note  that 
the  function  R  typically  contains  other  vfariables  such  as 
viewer  direction,  light  .source  direction,  and  albedo,  all  of 
which  are  typically  assumed  to  be  known.  We  refer  to  the 
image  irradiance  equation  as  an  assumption  because  it 
assumes  that  only  the  surface  orientation,  not  the  surface 
position,  determines  the  intensity  of  the  reflected  light. 

The  image  irradiance  equation  allows  us  to  character¬ 
ize  the  shape  from  shading  problem  as  finding  a  surface 
z(i,  y)  such  that  the  surface  normals  satisfy  the  equa¬ 
tion.  However,  because  Equation  1  is  expressed  in  terms 
of  surface  orientation,  and  not  surface  height,  most  for¬ 
mulations  of  the  shape  from  shading  problem  have  fo¬ 
cused  on  recovering  surface  orientation  at  each  point.  If 
we  specify  surface  orientation  using  the  parameters  (p,  q) 
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representing  {zx,Zy),  the  first  derivative  of  z  with  respect 
to  X  and  y,  then  we  can  write  the  image  irradiance  equa¬ 
tion  as: 

/(x,y)  =  R{p{x,y),q(x,y)) 

and  we  can  express  the  shape  from  shading  problem  as 
solving  for  two  functions,  p(x,y)  and  g(x,y),  such  that 
the  irradiance  equation  holds.  Doing  so,  however,  gives 
rise  to  two  fundamental  difficulties.  First,  the  problem 
is  highly  underconstrained.  For  each  point  (i,j)  in  an 
image  there  is  one  data  point  but  two  unknowns, 

p(i,j)  and  The  clearest  example  of  this  lack  of 

constraint  is  shown  in  Horn  (1990).  Additional  con¬ 
straints,  such  as  the  smoothness  of  the  orientations,  are 
required  to  select  a  particular  solution.  Second,  arbi¬ 
trary  functions  p(x,y)  and  q(x,y)  will  not,  in  general, 
correspond  to  orientations  of  some  continuous  and  dif¬ 
ferential  surface  z(x,  y)  To  do  so  it  must  be  the  case 
that  the  cross  derivatives  are  equal:  Py  =  qz-  Often, 
additional  processing  is  required  to  generate  a  surface 
satisfying  this  constraint  (e.g.,  (Frankot  and  Chellappa 
1988)). 


2.2  Recovering  Height 

A  more  direct  approach  to  recovering  shape  from  shad¬ 
ing  is  to  directly  find  a  surface  z(x,y)  that  minimizes 
the  photometric  error.  Doing  so  removes  the  problem 
of  finding  an  integrable  surface:  the  recovered  function 
z{x,  y)  is  a  real  surface  whose  surface  normals  accurately 
predict  the  image  intensities.  The  direct  recovery  of  sur¬ 
face  height  forms  the  basis  of  the  approach  here. 

We  note  that  the  direct  recovery  of  height  has  been 
proposed  before  by  Horn  and  Brooks  (1989)  but  dis¬ 
missed  as  computationally  divergent.  We  believe  that 
this  result  is  due  to  the  method  of  computation:  by  in¬ 
voking  the  calculus  of  variations  they  derive  a  compu¬ 
tational  scheme  equivalent  to  gradient  descent  in  many 
variables.  This  approach  is  known  to  have  poor  numer¬ 
ical  convergence  properties.  The  method  we  present  in 
the  next  sections  demonstrates  a  discrete  method  for  di¬ 
rectly  recovering  height  from  shading. 

Recently,  Horn  (1990)  developed  an  approach  that 
considered  solving  for  three  functions  simultaneously: 
z(x,y)  was  added  to  the  functions  p(x,y)  and  q(x,y). 
The  objective  function  includes  a  term  ((z*  —p)^-t-(zy  — 
(/)■)  which  drives  the  three  functions  z,  p,  and  q  to  ap¬ 
proximately  represent  the  same  real  surface.  However, 
the  recovered  surface  z  never  exactly  corresponds  to  the 
orientations  (p,  q)  used  to  compute  the  photometric  er¬ 
ror.  In  the  following  section,  we  present  a  method  in 
which  p  and  q  are  derived  from  z,  thereby  eliminating 
this  source  of  error. 


3  Height  from  Shading 

3.1  Discrete  Formulation 

Recently,  most  formulations  of  the  shape  from  shading 
problem  have  been  expressed  as  a  problem  in  the  calcu¬ 
lus  of  variations  (Horn  and  Brooks  1989).  In  this  view, 
the  task  of  shape  from  shading  is  one  of  recovering  a 
function  that  minimizes  a  functional.  Thus  for  the  case 
of  height  from  shading,  one  would  seek  to  recover  the 
function  z(x,y).  The  attraction  of  this  approach  is  that 
it  provides  an  elegant  framework  in  which  to  describe  the 
shading  problem  and  in  which  to  derive  necessary  condi¬ 
tions  that  constrain  the  solutions.  In  particular,  Euler’s 
equation  expresses  a  necessary  condition  in  terms  of  the 
derivatives  of  the  function  z(x,y).  This  condition  can 
then  be  manipulated  into  an  iterative  solution  method. 

Unfortunately,  the  iteration  equations  derived  from 
Euler’s  equation  yield  methods  equivalent  to  gradient 
descent.  For  example,  the  difference  equations  in  Brooks 
and  Horn  (1985)  can  be  derived  by  taking  the  gradient 
of  their  objective  function  with  respect  to  each  z,j.  Gra¬ 
dient  descent  algorithms  are  known  to  have  poor  conver¬ 
gence  properties  for  systems  of  many  variables.  To  avoid 
this  problem  we  formulate  a  shape  from  shading  objec¬ 
tive  function  in  a  purely  discrete  manner,  permitting  us 
to  consider  the  problem  as  one  of  ordinary  calculus  in 
many  variables  (see  (Szeliski  1991)).  This  approach  al¬ 
lows  the  use  of  numerical  solution  methods  with  better 
convergence  properties. 

Here  we  present  our  discrete  formulation  of  an  objec¬ 
tive  function  to  be  minimized  in  solving  the  shape  from 
shading  problem.  As  mentioned  in  the  previous  section, 
our  method  of  computing  height  from  shading  initially 
drives  the  computation  using  a  smoothness  term.  Thus 
our  objective  function  contains  a  smoothness  term  along 
with  a  photometric  error  term: 

E  =  '£(^-X){R(p,j,  qij )  -  lii  f  +  A(tif_;  U?. )  (2) 

•j 

where  p,q,u,v  are  not  independent  variables  but  are  de¬ 
fined  as  the  symmetric  first  and  second  finite  differences 
of  the  variables  z  = 


1, 

Pi} 

- 

1, 

'Hi 

=  ^i  +  lj  ~ 

“1“  Zj  — ij 

Vij 

=  ‘rl+i  ~ 

2Z|  j  -|“  2|  J_1 

A  represents  a  continuation  parameter,  0  <  A  <  1,  that  is 
gretdually  decreased  to  near  zero.  The  particular  smooth¬ 
ness  term  above  represents  deviation  from  a  plane,  and 
may  in  fact  be  too  restrictive  for  some  cases,  even  for 
small  A.  We  are  currently  considering  a  smoothness  term 
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that  measures  variation  in  curvature  as  opposed  to  ori¬ 
entation. 

Note  that  so  far  we  have  not  discussed  the  reflectance 
function  R-,  to  make  oiir  derivation  explicit  it  is  essential 
to  choose  some  particular  R.  The  results  presented  here 
employ  a  Lambertian  shading  model  where: 

Rij  =  .  Hi )  =  n.7  ■  ^ 

^i+pli+4 


where  n,j  is  the  unit  vector  surface  normal,  and  1  = 
(a,  b,  c)  is  the  unit  light  source  vector  scaled  by  the 
albedo.  For  now  we  assume  the  scaled  light  source  vec¬ 
tor  is  known;  in  the  next  section  we  consider  solving  for 

1. 

Given  the  objective  function  expressed  in  Equation  2, 
we  can  derive  the  gradient  of  £■:  a  vector  whose  ele¬ 
ments  are  the  partial  derivatives  of  E  with  respect  to 
the  state  variables  Zij.  First,  define  Nij  and  s/D^j  to 
be  the  numerator  and  denominator  of  Rij  above.  Then, 
the  elements  of  the  gradient  are 


dE 

dzij 


(l-A)x 
{Ri-xj-h, 


(3) 


i-lj— A-lj  (  ^i-lj  \ 

>+l,J  V  / 


I  ~  f L  A 

VA.-. 

,  ^.i+l  “  A  j+l  Z' 

+2A  {— 2ujj  —  2vij  + 

-I- 


Explicit  derivation  of  the  gradient  is  essential  if  we  are  to 
make  use  of  more  powerful  numerical  methods  for  mini¬ 
mizing  the  objective  function,  as  discussed  in  Section  4. 

3.2  Source  Direction  and  Albedo 

Though  the  above  discussion  assumes  a  known  light 
source  direction  and  albedo  I  =  (a,  6,  c)  we  can  also  con¬ 
sider  minimizing  E  with  respect  to  these  parameters,  as 
did  Horn  and  Brooks  (1985)  within  their  variational  for¬ 
mulation.  In  fact,  the  processing  of  real  imagery  usually 
requires  such  estimation  since  light  source  direction  and 
albedo  are  rarely  known  accurately.  Furthermore,  one 
can  show  that  the  objective  function  is  highly  sensitive 
to  errors  in  albedo.  To  avoid  this  problem  we  explicitly 
solve  for  the  albedo  as  well  as  the  light  source  direction. 

Following  our  approach  we  need  to  construct  the  gra¬ 


dient  of  E  with  respect  to  1; 

flF  —  1 

*  = 

Using  these  equations  we  can  simply  consider  the  scaled 
light  source  vector  as  just  another  set  of  variables;  do¬ 
ing  so  has  in  practice  yielded  good  results  (see  sections  .5 
and  6).  Note  that  the  above  is  a  set  of  three  simultane¬ 
ous  linear  equations  in  a,  b,  and  c,  and  could  be  solved 
directly  when  z  is  known. 

3.3  Existence  and  Uniqueness 

Whether  we  formulate  shape  from  shading  as  a  problem 
of  differential  equations  or  ers  one  of  minimization,  we 
should  consider  the  issue  of  the  existence  of  a  solution 
and  whether  a  solution  is  unique.  Results  in  the  continu¬ 
ous  domain  (Blake  ei  al.  1985,  Bruss  1982)  have  shown  a 
unique  solution  to  the  shape  from  shading  problem  under 
the  restrictive  conditions  that  the  light  source  direction 
is  equal  to  the  viewing  direction,  boundary  conditions 
are  completely  specified,  and  that  the  data  are  noise  free. 
Recently,  Olien.sis  (1991)  showed  that  if  there  is  at  least 
one  visible  point  on  the  surface  whose  normal  is  parallel 
to  the  light  source,  then  there  is  a  unique  surface  corre¬ 
sponding  to  the  image.  With  respect  to  existence,  Horn, 
Szeliski,  and  Yuille  (1990)  recently  demonstrated  some 
impossible  shaded  images  that  cannot  be  generated  by 
shading  a  surface  with  continuous  first  derivatives. 

However,  any  computational  solution  to  the  shape 
from  shading  problem  has  three  characteristics  that 
make  existence  and  uniqueness  difficult  to  consider. 
First,  using  a  discrete  formulation  with  quantized  vari¬ 
ables  makes  some  continuous  analysis  inapplicable.  For 
example,  boundary  conditions  no  longer  can  propagate 
arbitrarily  far.  Second,  real  images  must  be  considered 
as  discrete  samplings  of  underlying  continuous  images, 
and  the  method  of  computing  discrete  derivatives  can 
greatly  affect  predicted  image  intensities  for  a  given  z. 
Third,  and  perhaps  most  important,  real  images  have 
noi.se,  whether  it  is  real  noise  induced  by  the  imaging 
system,  quantization  of  image  intensities,  or  deviations 
from  the  assumed  reflectance  function.  In  such  circum¬ 
stances,  it  is  usually  possible  to  construct  multiple  so¬ 
lutions  whose  objective  function  measures  are  approxi¬ 
mately  equal. 

An  example  of  the  difficulties  inherent  in  discrete  for¬ 
mulations  is  shown  in  Figure  1  which  displays  Lam¬ 
bertian  shaded  images  of  two  synthetic  surfaces,  illumi¬ 
nated  from  two  different  light  source  directions.  The  top 
row  shows  a  pair  of  intersecting  hemispheres  illuminated 
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(c)  (cl) 


Figure  1;  Demonstration  of  the  nonuni(|iieness  of  shape 
from  shading  when  there  are  no  boundary  conditions,  allow¬ 
ing  for  a  small,  but  nonzero,  photometric  error.  These,  and 
aU  other  synthetic  images  in  this  paper,  are  produced  by 
shading  a  height  map,  z  =  (sij).  (a,b)  Two  images  of  a  sur¬ 
face  consisting  of  two  bumpy  hemispheres,  shaded  from  the 
northeast  and  northwest.  (c,d)  Two  images  of  a  very  dif¬ 
ferent,  wrinkled  surface,  also  shaded  from  the  northeast  and 
northwest.  When  shaded  from  the  northeast,  this  surface 
looks  the  same  as  the  hemispheres;  (a)  and  (c)  have  an  aver¬ 
age  absolute  difference  in  intensity  of  less  than  0.1%.  When 
shaded  from  the  northwest,  however,  we  see  the  tremendous 
differences  between  the  two.  In  short,  very  <lifFerent  surfaces 
can  look  the  same  under  identical  viewing  conditions. 


from  the  northeast  and  the  northwest.  The  bottom  row 
.shows  a  rough  oscillating  surface.  When  shaded  in  ex¬ 
actly  the  same  northeast  direction  as  the  hemispheres, 
the  wrinkled  surface  yields  an  image  quite  similar  to  the 
image  above  it,  with  an  average  difference  of  less  than 
0.1%.  Thus,  if  the  top  left  image  is  taken  a.s  a  data  im¬ 
age,  then  the  surface  of  the  bottom  row  will  produce  a 
low  photometric  error,  and  hence  a  low  value  for  E  in 
Equation  2  whenever  the  smoothness  weight  A  is  near 
zero.  The  space  of  these  undesired  .solutions  is  deter¬ 
mined  by  the  definitions  of  the  discrete  derivatives  of  z. 
Note  that  these  solution  are  real  surfaces  and  that  the 
ambiguity  cannot  be  attributed  to  the  noniiitegrability 
of  surface  orientations. 

The  implication  of  this  demonstration  is  that  when 


computing  a  surface  that  satisfie.s  a  shaded  image,  ad¬ 
ditional  con.straints  must  be  iinpo.sed  by  the  solution 
method.  In  the  sections  that  follow  we  drive  the  sys¬ 
tem  with  an  initial  smoothness  tt'rm  in  order  to  select 
as  smooth  a  surface  as  possible  with  low  |)hotomi*t ric 
error.  Whether  this  is  the  prefi'ired  .solution  is  nin  lear: 
are  snK)oth  surfaces  [uelerable?  We  are  currently  invr's- 
tigating  using  a  general  position  view  of  preferenct'  to 
select  the  .solution  whose  shading  changes  tin'  least  with 
movement  of  the  light  .source  direct  itni.  or  with  .i  chang<' 
in  the  definition  of  discrete  derivates. 

One  should  be  aware  that  the  iniestioii  of  nnirpie- 
ness  becomes  even  harder  to  consider  if  the  light  soiirci' 
and  albedo  are  allowi'd  to  vary.  For  example,  there  is 
a  concavity/convexity  amhigiiity  depending  on  whetln'r 
the  light  .source  is  seen  as  coming  from  above  or  Ixdow. 
Within  an  optimization  framework,  a  poor  choice  of  so¬ 
lution  method  and  initial  condition  can  yield  multiple 
solutions.  For  complicated  .synthetic  or  realistic  images 
where  there  are  enough  variations  in  orientation  we  havi' 
not  seen  multiple  solutions  for  the  light  source  direction. 

4  Solution  Method 

Our  goal  is  to  find  a  surface  that  is  consistent  with  tlie 
shaded  image.  When  there  is  no  noisi',  this  means  finding 
a  surface  that  incurs  zero  photomet  ric  error.  However, 
as  seen  in  Figure  J,  there  are  ty()ically  many  surfaces 
that  have  E  fa  0.  Which  of  these  surfaces  should  we 
choose?  Our  criterion  is  to  choose  the  smoothest  such 
surface. 

To  find  the  smoothest  surface  that  incurs  zero  pho¬ 
tometric  error,  we  use  a  continuation  method  (Leclerc 
1989),  whereby  we  begin  by  finding  a  local  minimum  of 
Equation  2  for  A  =  1.  Once  a  minimum  is  found,  A  is 
decreased,  and  a  .search  for  a  minimum  of  this  new  ob¬ 
jective  function  is  beg\in,  using  the  previous  minimum 
as  initial  condition.  This  procedure  is  repeated  until  A 
is  sufficiently  close  to  zero.  When  there  is  no  noise,  the 
theoretical  limit  for  A  is  exactly  zero.  In  the  presence  of 
noi.se,  the  appropriate  final  value  for  A  depends  on  the 
extent  of  the  noise;  for  nonzero  A,  the  global  minimum 
value  of  E  is,  in  general,  also  nonzero. 

Our  solution  method  imj>lements  the  standard  conju¬ 
gate  gradient  algorithm  FIIPHMN  (from  (Pre  d  al. 
1986))  in  conjunction  with  the  line  search  algorithm 
DBRENT  as  an  iterative  minimization  technique.  This 
algorithm  simply  requires  the  construction  of  two  func¬ 
tions:  one  that  compntt's  the  value  of  E,  Equation  2,  and 
one  that  computes  the  gradient  of  E  with  respect  to  the 
state  vector  z.  Equation  4.  To  impose  boundary  condi¬ 
tions,  i.e.,  fix  some  elements  of  z  to  their  known  correct 
values,  we  define  the  gradient  to  be  zero  at  those  points. 
We  use  a  conjugate  gradient  technique  rather  tlian  sim¬ 
pler  gradient  descent  algorithms  because  the  former  is  a 
much  more  efficient  technique  for  optimizing  functions 


Figvire  2;  Result,  using  hierarchical  continuation  method  on 
a  similar  input  image  to  that  of  Figure  1,  without  boundary 
conditions,  (a)  Input  image  of  true  surface,  (b)  Image  of 
true  surface  illuminated  with  light  source  at  90'  from  (a), 
(c)  Image  of  recovered  surface  using  original  light  source  di¬ 
rection.  (d)  Image  of  recovered  surface  using  rotated  light 
source. 


of  many  variables  (Szeliski  1991). 

Given  that  the  solution  found  at  the  first  step  is 
very  smooth,  we  would  expect  that  solutions  found  at 
each  subsequent  step  would  also  be  relatively  smooth, 
within  the  constraints  imposed  by  the  image  data.  Al¬ 
though  we  cannot  guarantee  that  the  solution  found  for 
a  small  A  is  indeed  the  smoothest  surface  possible,  ex¬ 
periments  demonstrate  that  the  solutions  recovered  are 
indeed  smooth  surfaces  incurring  small  photometric  er¬ 
ror. 

Figures  1  and  2  illustrate  the  above  approach.  For 
Figure  1,  the  system  began  with  A  =  1/16,  while  for 
Figure  2,  the  system  started  with  A  =  1.  In  both  cases, 
the  average  photometric  error  is  very  small;  the  average 
absolute  difference  between  the  images  of  Figures  1(a) 
and  1(c)  and  between  those  of  2(a)  and  2(c)  is  less  than 
0.1%.  However,  the  recovered  surface  for  Figure  2  is 
much  smoother,  as  we  can  see  by  the  image  in  the  lower 
right.  Indeed,  the  recovered  surface  corresponds  to  that 
of  the  true  surface — the  surface  used  to  generate  the  orig¬ 
inal  synthetic  image — to  within  less  than  2%  average  ab¬ 


solute  differenci'.  In  tliis  particular  example,  no  bound¬ 
ary  conditions  were  iiupos<‘d,  yet  ihe  solution  recovered 
is  correct,  'flie  conjngati'  gradient  algorithm  guarantees 
that  the  system  is  stahh'  whether  or  not  boimdary  con¬ 
ditions  are  impo.sed.  We  a)-e  invi-st  igating  the  conditions 
under  which  boundary  conditions  are  required  to  recover 
the  correct  .solution. 

Another  difference  in  tin'  inocessing  of  Figures  1  and 
2,  is  that  the  latter  was  produced  using  a  hierarchical 
technique  (Terzoponlos  1983).  Specifically,  the  original 
image  is  initially  blurred  and  snbsatnpled  from  its  initial 
resolution,  in  this  ca.se,  M  x  61.  down  to  a  resolution  of 
S  X  8.  The  system  begins  with  A  =  1,  which  is  progres¬ 
sively  reduced  to  1/16  by  a  factor  of  l/\/2.  The  resultant 
surface  is  then  bilinearly  int.erjiolated  to  the  next  higher 
resolution,  16  x  16,  and  the  process  begun  anew,  but 
with  starting  and  ending  A  erpial  to  half  of  those  of  the 
|)revions  resolution.  At  each  stage,  the  input  image  is 
an  appro]>riately  blurred  and  snbsampled  version  of  the 
original  image.  'I'his  |uoredur<'  is  repeated  to  the  full 
re.solutioji  of  the  initial  image.  Eveti  though  blurring 
and  snbsampling  the  original  image  does  not  generate 
exactly  the  same  image  as  .shading  the  blurred  and  sub¬ 
sampled  surface  (Hon  and  Peleg  1989),  it  appears  to  be 
:t  sufficiently  good  starting  j^oint  for  the  minimization. 
Using  the  hierarchy  can  save  at  least  a  factor  of  two  in 
computation  time. 

Finally,  to  recover  albedo  aiul  light  source  direction, 
we  have  employed  two  strategies.  The  most  direct  is 
to  incorporate  the  light  .source  direction  parameters  into 
the  state  variable  vector  and  to  solve  for  z  and  1  simul¬ 
taneously  within  the  conjugate  gradient  framework.  In 
this  case,  it  is  nere.ssary  to  scale  the  contribution  of  the 
gradient  elements  of  Eciuation  A  by  the  number  of  pix¬ 
els  in  the  image  to  make  their  magnitude  commensurate 
with  the  other  elements  of  the  gradient  vector.  This  pro¬ 
cedure  has  yielded  gooil  results  for  many  synthetic  im¬ 
ages;  however,  we  have  found  cases  where  the  solution 
is  trapped  in  poor  local  minima.  The  second  technique 
is  to  use  an  a  priori  surface  estimate,  generated  by  some 
other  method  such  as  stereo,  and  to  minimize  the  objec¬ 
tive  function  by  varying  only  the  light  source  parameters. 
After  finding  tin'  best  solution  for  the  light  source,  those 
parameters  are  fixed,  and  -7  is  then  allowed  to  vary. 


5  Stereo  and  Shading 

To  this  point  we  have  developed  a  method  for  extracting 
.shape  from  shading  that  directly  operates  on  heights  z 
of  a  surface.  As  mentioned,  one  of  the  advantages  of  this 
approach  is  that  it  allows  for  the  incorporation  of  other 
methods  of  determining  surface  height  into  the  shape  re¬ 
covery  process.  In  this  section  we  develop  the  relation¬ 
ship  between  stereo  processing  and  shape  from  shading; 
it  is  our  belief  that  the  two  methods  are  inherently  com¬ 
plementary  and  will  function  much  more  effectively  in 
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an  integrated  fashion. 

The  most  obvious  connection  between  shading  and 
stereo  is  that  stereo  is  an  explicit  method  for  providing 
initial  and  boundary  conditions  for  the  shading  problem. 
We  believe  stereo  is  particularly  appropriate  for  this  task 
because  stereo  is  sensitive  to  variations  in  height,  not  ori¬ 
entation.  This  condition  results  in  the  linear  decrease  of 
the  discriminability  of  relative  heights  as  absolute  dis¬ 
tance  increases.  Shading,  however,  reflects  change  in 
orientation,  and  thus  does  not  lose  discrimination  power 
with  increasing  absolute  distance.  Given  stereo  informa¬ 
tion  to  coarsely  describe  surface  shape,  shading  analysis 
can  solve  for  finer  surface  variations. 

We  demonstrate  the  ability  of  stereo  to  provide  ini¬ 
tial  conditions  for  shading  processing  in  Figures  3  and 
4.  The  top  left  images  (3a  and  4a)  are  each  one  half  of 
a  stereo  pair,  where  regions  of  significant  albedo  varia¬ 
tion  (eyes  and  nostril  in  Figure  3  and  the  background 
in  both)  have  been  manually  removed.  Displayed  at  the 
top  right  (3b  and  4b)  are  the  stereo  reconstructed  sur¬ 
faces,  computed  using  Fua’s  stereo  algorithm  (Fua  1991) 
(people  acquainted  with  the  subject  in  Figure  3  could 
not  recognize  him  from  this  image).  Using  the  stereo 
reconstructed  surfaces  as  initial  conditions,  our  shading 
algorithm  recovered  the  surfaces  displayed  in  the  bottom 
two  images  of  each  figure.  The  images  on  the  bottom 
left  (3c  and  4c)  are  the  recovered  surfaces  shaded  from 
the  same  direction  as  the  original  images.  For  Figure  3, 
we  first  used  the  stereo  solution  to  solve  for  the  light 
source  direction  and  albedo,  and  then  allowed  the  sur¬ 
face  to  vary.  For  Figure  4,  we  measured  the  light  source 
direction  in  the  laboratory.  The  bottom-right  images 
(3d  and  4d)  display  the  recovered  surfaces  shaded  from 
a  different  direction.  Though  some  creases  invisible  to 
the  light  source  direction  have  been  introduced,  the  sur¬ 
faces  capture  most  of  the  important  shape  characteris¬ 
tics  including  wrinkles  in  the  forehead  of  Figure  3.  The 
introduction  of  creases  is  most  likely  caused  by  using 
a  smoothness  term  that  prefers  planar  patches;  creases 
give  large  patches  of  no  smoothness  penalty  plus  thin 
ridges  of  high  penalty.  Even  though  the  weight  of  the 
smoothness  term  is  eventually  reduced  to  near  zero,  the 
system  can  no  longer  remove  the  creases.  A  higher  order 
smoothness  term  may  solve  this  problem. 

Aside  from  providing  initial  and  boundary  conditions, 
there  is  a  much  deeper  relationship  between  stereo  and 
shading,  and  that  relationship  is  derived  from  the  con¬ 
ditions  under  which  each  of  the  two  methods  operate 
best.  Recovery  of  distance  information  from  stereo  pro¬ 
cessing  requires  the  ability  to  make  accurate  matches  be¬ 
tween  corresponding  pixels  in  a  stereo  pair.  Such  accu¬ 
rate  matches  can  occur  only  where  there  are  significant 
events  in  the  image  intensities  that  can  disambiguate 
pixel  matches.  Such  events  include  discontinuities  in  sur¬ 
face  orientation  and  albedo,  such  as  material  boundaries 
and  resolvable  textures.  Where  no  such  events  occur, 


all  stereo  algorithms  determine  surface  structure  by  in¬ 
terpolation,  whether  explicitly  (e.g.  (Grimson  1982))  or 
implicitly  by  choice  of  objective  function  (e.g.  (Barnard 
1989)).  Grimson  justified  this  interpolation  in  stating 
that  “No  news  is  good  news,”  implying  that  the  lack  of 
significant  events  could  be  viewed  as  permitting  the  use 
of  some  assumed  interpolation  function. 

Shading  analysis,  however,  operates  best  in  exactly 
those  regions  of  an  image  where  stereo  processing  is 
forced  to  interpolate.  Thus,  “No  news  is  better  news" 
in  the  sense  that  lack  of  significant  visual  events  may 
be  used  as  an  indication  that  shading  analysis  is  ap¬ 
propriate.  One  way  of  viewing  this  relationship  is  that 
shading  analysis  should  provide  the  interpolation  func¬ 
tion,  as  opposed  to  making  some  specific  assumptions 
about  the  most  appropriate  type  of  spline,  e  g.  fractured 
thin-plates  or  membranes.  One  implication  of  this  type 
of  approach  is  that  smooth  albedo  variations  would  be 
incorrectly  interpreted  as  shape  information;  the  effec¬ 
tiveness  of  makeup  to  give  the  illusion  of  greater  depth 
variation  is  an  example  of  such  an  incorrect  interpreta¬ 
tion. 

Many  stereo  algorithms  (Fua  1991,  Hannah  1982)  al¬ 
ready  provide  a  confidence  measure  reflecting  the  degree 
of  constraint  present  in  the  pixel  matches.  We  are  cur¬ 
rently  investigating  the  use  of  this  measure  as  a  method 
for  generating  an  integrated  stereo  and  shading  algo¬ 
rithm  that  applies  each  where  best  suited.  Essential  to 
this  approach  is  a  shape  from  shading  technique  based 
on  surface  heights,  not  orientations.  The  technique  de¬ 
veloped  here  should  provide  the  necessary  basis. 

6  Summary 

We  have  presented  a  method  of  recovering  shape  from 
shading  which  solves  directly  for  the  surface  height.  By 
using  a  discrete  formulation,  we  are  able  to  employ  nu¬ 
merical  solution  methods  more  powerful  than  gradient 
descent,  giving  good  convergence  behavior.  Because 
height  is  directly  recovered,  we  avoid  the  problem  of  find¬ 
ing  an  integrable  surface  maximally  consistent  with  sur¬ 
face  orientation.  Furthermore,  since  we  do  not  need  ad¬ 
ditional  constraints  to  make  the  problem  well  posed,  we 
use  a  smoothness  constraint  only  to  drive  the  system  to  a 
good  solution:  the  smoothest  surface  that  incurs  no  pho¬ 
tometric  error.  Eventually  we  remove  the  smoothness 
term,  preventing  the  system  from  walking  away  from  the 
true  solution.  Our  solution  technique  uses  a  continua¬ 
tion  method  in  the  smoothness  parameter,  embedded  in 
a  hierarchical  conjugate  gradient  minimization  scheme. 
A  simple  extension  allows  for  the  solution  of  light  source 
direction  and  albedo  as  well  as  surface  height.  We  have 
demonstrated  this  technique  on  both  synthetic  and  real 
imagery. 

In  addition,  we  have  begun  to  explore  the  relation¬ 
ship  between  shading  and  stereo.  In  particular,  because 
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(c) 


(d) 


Figure  3:  Font  faces  of  Oscar,  (a)  Grey  Jevel  image  of  a  face  with  regions  of  significant  albedo  change  manually  removed. 
This  image  is  one  of  a  stereo  pair,  (b)  Surface  recovered  by  stereo  processing.  Note  the  coarse  resolution,  (c)  Shaded 
image  of  recovered  surface  when  shaded  using  same  light  source  direction  as  solved  for  in  (a),  (d)  Same  surface  shaded  from 
different  direction.  Though  it  contains  some  creases  invisible  to  original  light  source  direction,  the  surface  captures  most  of 
the  important  shape  characteristics,  including  the  wrinkles  in  the  forehead. 
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Figure  4:  Laboratory  experiment  on  a  styrofoam  mannequin  spray-painted  nsiiif!,  ninttc  white  inaint.  Panels  (a)-(d)  are  as 
in  Figure  3.  In  this  case,  the  illiiminant  direction  was  measured  explicitly,  and  did  not  need  to  be  estiinatetl  from  the  image. 


we  have  a  formulation  in  terms  of  surface  height  we  can 
use  stereo  processing  to  provide  initial  and  boundary 
conditions.  We  note  that  shading  and  stereo  are  com¬ 
plementary  techniques  for  two  reasons;  First,  relative 
depth  discrimination  from  stereo  decreases  with  abso¬ 
lute  depth,  whereas  orientation  discrimination,  as  deter¬ 
mined  by  shading,  does  not.  Second,  regions  in  the  im¬ 
age  where  stereo  fails  because  of  lack  of  interesting  visual 


events  are  good  candidate  regions  for  shading  analysis. 
One  goal  for  future  work  is  to  use  the  confidence  mea¬ 
sures  of  stereo  systems  to  invoke  tlie  application  of  our 
shading  analysis  aiul  to  control  tiie  balance  between  the 
shading  and  stereo  .solutions. 

Another  extension  of  this  work  is  to  u.se  a  piecewise 
constant  albedo  model  that  allows  different  regions  in  the 
image  to  have  different  and  unknown  albedos.  An  algo- 


rithm  has  been  implemented  in  which  only  the  position 
of  the  albedo  discontinuities  are  known;  these  disconti¬ 
nuities  should  be  recoverable  using  edge  detection  when 
they  are  the  only  ones  present.  This  algorithm  has  per¬ 
formed  well  on  all  of  our  original  synthetic  e.\am)>les  of 
constant  albedo,  and  on  additional  synthetic  images  of 
mottled  surfaces  with  two  different  albedos,  much  like 
zebra  skin. 
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Abstract 

In  this  paper,  we  discuss  a  technique  for  model- 
based  object  recognition  in  Synthetic  Aperture 
Radar  (SAR)  images  which  combines  projec¬ 
tive  invariants  and  deformable  templates.  The 
models  are  comprised  of  deformable  templates 
whose  configuration  is  constrained  by  projec¬ 
tive  invariants.  An  operator  outlines  an  area 
of  interest  which  provides  initial  values  for  the 
parameters  of  the  deformable  template.  The 
constraints  on  the  template  are  represented  as 
a  system  of  polynomials  derived  from  the  pro¬ 
jective  invariants.  A  non-linear  system  solver 
then  optimizes  the  alignment  of  model  features 
and  image  features  subject  to  the  constraints. 

The  technique  has  been  successful  in  locating 
geographic  features  in  aerial  SAR  images  with 
models  constructed  from  USGS  maps. 

1  Introduction 

Images  produced  by  Synthetic  Aperture  Radar  are  diffi¬ 
cult  to  interpret  -  the  resolution  is  poor,  the  usual  cues 
(shading,  shadows,  etc.)  are  missing,  and  the  appear¬ 
ance  of  an  object  can  change  radically  with  small  changes 
in  pose.  Image  analysts  have  told  us  that  in  some  cases 
it  can  be  difficult  just  to  orient  the  image  correctly.  The 
experiment  described  here  is  a  “proof  of  concept”  study 
for  the  design  of  an  automatic  model-based  object  recog¬ 
nition  module  for  an  image  analyst’s  workstation. 

2  The  Approach 

The  underlying  design  of  the  system  combines  two 
model-based  techniques  for  object  recognition  -  de¬ 
formable  templates[Lipson  et  ai,  1990]  and  projective 
invariantsfPorsyth  et  ai,  ]  [Coelho  ei  al.,  1991).  By  using 
deformable  templates  we  avoid  having  to  make  a  priori 
assumptions  about  the  correspondence  of  model  features 
and  image  features.  By  constraining  the  allowable  con¬ 
figurations  of  the  deformable  templates  with  projective 

*Work  at  GE  was  supported  in  part  by  the  DARPA 
Strategic  Computing  Vision  Program  and  the  Air  Force  Of¬ 
fice  of  Scientific  Research  under  Contract  No.  F49620-89-C- 
0033  and  by  the  DARPA  Strategic  Computing  Vision  Pro¬ 
gram  under  Contract  No.  MDA972-91-C-0053. 


invariants,  we  are  assured  that  the  configuration  of  the 
template  is  a  valid  projection  of  the  model. 

2.1  Projective  Invariants 

Shape  chareicteristics  measured  in  images  depend  not 
only  on  the  intrinsic  properties  of  an  object’s  shape, 
but  also  on  the  position  and  orientation  of  the  camera 
with  respect  to  the  object  cind  the  intrinsic  parameters  of 
the  camera  (e.g.  focal  length,  frame  aspect  ratio,  etc.). 
Shape  descriptors  which  are  unaffected  by  these  projec¬ 
tive  transformations  can  be  constructed.  Such  descrip¬ 
tors,  known  as  projective  invariants,  can  be  matched  to 
object  properties  regardless  of  camera  viewpoint  and  pa¬ 
rameters.  Essentially,  the  use  of  projective  invariants 
factors  the  object’s  pose  and  the  camera’s  imaging  pa¬ 
rameters  out  of  the  recognition  process. 

In  the  initial  implementation  we  have  employed  the 
five-coplanar  points  invariant.  Other  planar  invariants 
suitable  for  this  technique  are  five-coplanar  lines  (the 
dual  of  five-coplanar  points),  a  pencil  of  four  lines  and 
the  joint  invari2mts  of  two  conics  /refpami.  It  is  also  rea¬ 
sonable  to  exploit  the  invariants  available  from  multiple 
views,e.g.,  the  invariant  of  nine  points  observed  from  two 
arbitrary  perspective  views.* . 

Five  coplanar  points  give  rise  to  two  invariants  which 
have  found  important  applications  in  machine  vision 
(Barrett  et  al.,  1991,  Mohr  and  Morin,  1990]  and  pho- 
togrammetry  [Molfitt  and  Mikhail,  1980].  These  invari¬ 
ants  have  been  related  to  the  familiar  cross  ratio  that  is 
obtained  when  the  image  points  fall  on  a  line  [Brill  and 
Barrett,  1983]. 

To  derive  these  projective  invariants,  consider  five 
points  on  a  plane.  We  represent  these  points  as  pi  = 
(*«.!/•. ^i)-  Clearly,  a  change  in  scale  factor  pi  — ►  Apj 
cannot  be  observed  and  this  must  be  taken  into  account 
in  deriving  the  invariants.  As  in  the  case  of  the  cross 
ratio,  lli.'so  invariants  are  ratios  of  scalars.  Consider  a 
squari'  nn’rix  dffined  by  three  of  the  five  points. 


Under  a  general  projective  transformation  T 
Hi;lt  =  T 

*  See  a  review  of  the  nine  point  invariant  and  its  applica¬ 
tions  in  “3D  Model  Alignment  Without  Computing  Pose”, 
by  J.L.  Mundy  et  ai  in  this  proceedings. 


831 


It  can  be  shown  that  there  are  only  two  functionally 
independent  ratios  of  the  determinants  of  the  matrices, 
of  the  five  points  which  are  invariant  for  projective 
transformations  and  the  homogeneous  scalar  factor. 

j  _  rfe<(H43i)<fe<(H52i) 
det(M42i)‘fe^(M53i) 

_  |T|(A4  A3  A  i)(ie<(in43i)  171(^5^2  Ai)rfe<(m52i) 

iT((A4A2Ai)cfe<(m42i)|T|(A5A3Ai)fl(e<(ni53i) 

^  cfe<(m43i)</e<(m52i) 

de<(m42i  )det(m53i ) 

in  the  same  way 

^  det(m42i)det(m53i) 
det(m432)de<(m52i) 

Note  that  if  two  of  the  points  triples  are  collinear  the 
point  matrix  mijt  becomes  singular  and  the  correspond¬ 
ing  invariant  is  undefined.  Since  points  and  lines  are 
dual  in  the  projective  plane,  we  have  immediately  that 
these  functions  are  also  invariants  for  a  system  of  five 
coplanar  lines,  no  three  of  which  are  coincident. 

2.2  Deformable  Templates 

Deformable  templates  can  be  used  to  detect  features  in 
images[Lipson  et  al.,  1990].  The  templates  are  specified 
by  a  set  of  parameters,  in  our  case  model  points  and  the 
projective  invariant  relationships  among  them.  This  en¬ 
ables  a  priori  knowledge  about  the  expected  relationship 
of  the  image  features  to  guide  the  detection  process.  The 
deformable  templates  interact  with  the  image  data  in  a 
dynamic  manner.  An  objective  function  is  defined  which 
“attracts”  the  template  to  the  image  features,  which,  in 
this  experiment  are  vertices  defined  by  points  of  high 
curvature  along  edges  in  image  intensity.  The  maximum 
of  the  objective  function  corresponds  to  the  best  fit  with 
the  image  data.  The  objective  function  is  given  by; 

if  A/  jtD 

Where  M  is  the  set  of  model  features,  D  is  the  set  of 
image  features  and  r,j  is  the  Euclidean  distance  from  the 
location  of  a  given  image  feature,  j,  to  a  given  template 
feature,  1.  Note  that: 

•  No  assumption  is  made  about  the  correspondence 
between  image  and  model  features. 

•  Since  the  constraints  on  the  template  configuration 
are  projective  invruiants  derived  from  the  model, 
the  template  can  only  take  on  configurations  that 
correspond  to  valid  projections  of  the  model. 

2.3  Solving  Constraints 

The  role  of  the  constraint  solver  is  to  solve  the  prob¬ 
lem  of  finding  a  configuration  of  template  features  which 
satisfies  all  of  the  constraints  defined  by  the  model,  and 
agrees  as  well  as  possible  with  the  image  features. 

The  two  goals  of  finding  the  global  maximum  of  the 
objective  function,  V/(x)  =  0,  and  satisfying  the  con¬ 
straints,  h(x)  =  0,  are  combined  to  give  a  constrained 


minimization  problem.  A  linear  approximation  to  this 
optimization  problem  is: 

V2/(x)dx  =  -V/(x) 

Vli(x)  dx  =  — h(x)  '  ' 

Since  the  two  goals  cannot  in  general  be  simul¬ 
taneously  satisfied,  a  leasl-square-error  satisfaction  of 
V /(x)  =  0  is  sought.  The  constraint  equations  are  mul¬ 
tiplied  by  a  factor  y/c,  which  determines  the  weight  given 
to  satisfying  the  constraints  versus  minimizing  the  cost 
function.  Each  iteration  of  (1)  has  a  line  search  that 
minimizes  the  least-square-error: 

m(x)  =  |V/(x)p  +  c  |li(x)|^  (2) 

which  is  a  merit  function  similar  to  the  objective  of  the 
standard  penalty  method. 

Starting  with  ^  =  0,  system  (1)  converges  to  the 
unconstrained  global  minimum  first,  and  so  avoids  local 
minima  and  singularities  on  the  constraint  surface.  This 
convergence  is  efficient,  because  the  fitting  error  /(x)  is 
well  approximated  by  a  quadratic,  cuid  so  the  Hessian 
V"/  is  almost  constant.  When  the  Hessian  V^/  is  con¬ 
stant,  the  best-fit  surface  is  a  linear  subspace  with  zero 
curvature  and  no  singularity,  a  lot  simpler  than  the  con¬ 
straint  surface. 

The  convex  fitting  function  /(x)  can  be  viewed  as  a 
regularizer  in  the  solution  ofdx.  It  makes  the  constraint 
problem  well-posed  by  using  empirical  data  whenever  ad¬ 
ditional  constraints  are  needed  to  pin  down  free  variables 
in  h(x)  =  0.  Levenberg  and  Marquardt  have  shown  that 
varying  ^/c  by  factors  of  10  is  an  effective  method  to  force 
convergence  for  nonlinear  systems  [Marquardt,  1963, 
Press  et  al.,  1988]. 

3  Implementation 

This  algorithm  was  implemented  in  C-t--(-  as  an  object 
recognition  sub-system  of  an  image  analyst’s  workstation 
being  developed  by  General  Electric.  The  workstation 
provides  facilities  for  displaying  images,  overlaying  line 
drawings  and  annotations,  performing  geometric  manip¬ 
ulations  (scaling,  panning,  and  control  point  warping, 
etc.)  and  low-level  image  processing  functions  (convolu¬ 
tion,  contrast  adjustment,  edge  detection,  etc.). 

To  use  the  object  recognition  system,  the  operator  se¬ 
lects  a  model  from  the  library  and  selects  the  region  of 
interest  in  the  image.  The  system  segments  that  region 
of  the  image  if  necessary  (it  may  have  it  cached  from  a 
previous  run),  then  attempts  to  find  a  subset  of  the  ver¬ 
tices  in  the  region  that  are  consistent  with  a  projection  of 
the  model,  using  the  techniques  described  in  the  previous 
sections.  A  confidence  value,  or  response,  is  calculated 
as  the  value  of  the  objective  function  for  the  final  con¬ 
figuration  as  the  percentage  of  the  maximum  attainable 
value  of  the  objective  function.  The  final  configuration 
of  the  template  and  the  confidence  value  are  displayed. 
The  flow  of  the  data  through  the  system  is  shown  in 
Figure  1. 

The  following  sections  describe  this  process  in  more 
detail. 
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Figure  1:  The  flow  of  data  through  the  SAR  image  analysis  system. 


3.1  Area  Selection  and  Image  Segnieiitatlou 

The  user  starts  the  processing  by  selecting  an  area  of  the 
image  with  the  mouse.  The  user  places  a  rubber-banding 
rectangle  around  the  area  in  the  image  under  consider¬ 
ation.  With  the  area  selected,  a  right-mouse  click  starts 
the  processing. 

The  image  is  segmented,  if  necessary  (it  may  be  cached 
from  a  previous  run)  into  edges  and  vertices  using  a  mod¬ 
ified  form  of  the  Canny  edge  detector  followed  by  line 
segmentation  based  on  chain  curvature  extrema  [Canny, 
1983,  Asadaand  Brady,  1984].  The  program  next  selects 
all  the  vertices  from  the  segmentation  of  the  image  that 
are  contained  within  the  region  of  interest  created  by  the 
user. 

3.2  Determining  the  Initial  Configuration 

There  is  likely  to  be  quite  a  few  vertices  contained  within 
the  region  of  interest.  The  model,  however,  relates  ex¬ 
actly  five  vertices  in  the  invariant  relationship.  This  cre¬ 
ates  the  problem  of  determining  the  starting  configura¬ 
tion  for  the  template  features.  Upon  initial  consider¬ 
ation,  the  number  of  possible  initial  configurations  ap¬ 
pears  to  make  the  problem  nearly  intractable,  however 
a  number  of  simple  and  fast  filters  can  be  applied  which 
reduce  the  number  of  candidates  considerably. 

1.  The  initial  configuration  does  not  need  to  be  very 
accurate.  This  allows  groups  of  adjacent  vertices  to 
be  merged  and  replaced  by  their  centroid. 

2.  The  positions  of  the  vertices  in  the  initial  config¬ 
uration  must  approximately  satisfy  the  projective 
invariants  defined  by  the  model. 


3.  The  topology  of  the  model  is  known,  therefore  as¬ 
signments  which  would  result  in  an  impossible  con¬ 
figuration  can  be  ruled  out. 

By  using  these  three  strategies,  we  are  typically  able 
to  reduce  the  possible  initial  configurations  by  a  factor 
of  over  10®. 

3.3  Running  the  Constraint  Solver 

At  this  point  there  exist  a  number  of  possible  initial  con¬ 
figurations  (in  our  experiments  as  many  as  50)  that  could 
potentially  lead  to  the  correct  solution.  For  each  con¬ 
figuration,  a  separate  instance  of  the  constraint  solver 
process  is  spawned.  Since  each  instance  is  independent, 
this  allows  for  parallel  processing  by  spawning  shells  to 
various  available  machines.  As  each  instance  of  the  con¬ 
straint  solver  finishes,  it  outputs  the  final  configuration 
and  the  value  of  the  objective  function.  Higher  values 
of  the  objective  function  indicate  greater  agreement  be¬ 
tween  template  (and  therefore  model)  features  and  image 
features.  All  the  resultant  values  are  sorted,  the  largest 
is  selected  and  the  corresponding  assignments  are  dis¬ 
played.  A  response  figure  is  calculated  eis  a  percentage 
of  the  maximum  attainable  value  of  the  objective  func¬ 
tion. 

3.4  Map  Overlay  and  Warping 

Now  the  correspondence  between  the  model  vertices  and 
the  image  vertices  which  produce  the  greatest  response 
is  known.  If  the  response  is  greater  than  a  prespecified 
level  (for  example,  85%),  this  determines  that  the  user 
did  in  fact  select  an  area  of  the  image  that  corresponds 
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to  the  model.  Then  using  the  model/image  vertices  cor¬ 
respondence,  it  is  a  simple  matter  to  warp  the  map  and 
overlay  it  on  top  of  the  SAR  image.  This  not  only  posi¬ 
tions  and  orients  the  instance  of  the  selected  model  for 
the  user,  but  also  many  other  features  on  the  image  that 
are  not  easily  seen. 

4  Description  of  the  Experiment 

A  sample  image  used  in  this  experiment  is  shown  in  Fig¬ 
ure  2.  The  image  encompasses  an  area  of  about  20  miles 
by  20  miles  in  northwestern  Pennsylvania.  The  most 
prominent  features  are  3  small  lakes.  Some  other  eas¬ 
ily  visible  features  include  roads  and  cleared  powerline 
corridors. 

The  United  States  Geological  Survey  (USGS)  map  of 
the  same  area  was  located  and  digitized.  A  section  of 
this  map  is  shown  in  Figure  3. 

The  largest  of  the  three  lakes  (Twelve  Mile  Pond)  was 
selected  as  the  image  feature  we  wished  to  locate.  A 
model  of  this  lake  was  constructed  using  the  five-point 
invariant  and  data  measured  from  the  map.  Five  points 
around  the  perimeter  of  the  lake  were  selected  for  the 
model.  The  points  were  chosen  by  selecting  points  of 
the  high  curvature  around  the  lake.  Contours  of  high 
curvature  are  most  likely  to  be  detected  no  matter  what 
the  viewing  angle  of  the  image. 

Figure  4  shows  the  result  when  the  region  of  interest 
contains  an  instance  of  the  modeled  lake.  The  response 
is  99%.  This  means  that  after  the  constraint  solver  ad¬ 
justed  the  template,  nearly  perfect  agreement  of  model 
and  image  features  was  obtained. 

In  Figure  5,  we  have  performed  a  control  point  warp  on 
the  map,  using  the  resultant  assignments  of  model  and 
image  features,  and  overlayed  it  on  the  original  image. 

Figure  6  show  a  negative  result.  The  region  of  interest 
does  not  contain  an  instance  of  the  modeled  lake  and 
consequently  the  response  is  9%, 

The  total  execution  time  for  all  of  these  examples  was 
less  than  two  minutes  on  a  single  Sun  SPARCstation  2. 
The  majority  of  the  execution  time  is  consumed  by  the 
multiple  constraint  solvers.  When  the  system  is  allowed 
to  run  constraint  solvers  on  multiple  machines  the  run¬ 
ning  time  drops  in  proportion  to  the  number  of  machines 
available. 

5  Conclusions  and  Future  Work 

In  this  paper  we  have  described  a  technique  for  interpret¬ 
ing  images  produced  by  Synthetic  Aperture  Radar  that 
draws  upon  work  in  the  areas  of  projective  invariants 
and  deformable  templates  and  have  shown  some  sample 
results  of  experiments  with  this  technique.  Work  is  cur¬ 
rently  underway  to  expand  the  system  to  use  a  variety  of 
projective  invariants,  to  allow  the  use  of  edge  and  area 
data  from  images,  as  well  as  non-planar  models. 
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Figure  4:  The  result  when  the  region  of  interest  contains  the  modeled  lake. 


Figure  5:  The  original  image  with  the  map  warped  and  overlayed. 
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Figure  6: 


a  sample  result  when  the  region  of  interest  does  not  contain  the  modeled  lake. 
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Abstract 

It  is  generally  believed  that  the  detailed  analysis  of 
remotely  sensed  imagery  requires  the  extraction  of  a 
variety  of  partial  image  domain  cues  coupled  with  the 
use  of  a  priori  or  contextual  information.  In  some 
cases  there  are  fundamental  limits  to  the  variety  and 
t}^  of  information  that  may  be  extracted  from  a  single 
image  or  stereo  pair.  However,  in  most  cases  a 
sufficient  variety  of  cues  can  be  extracted;  the  major 
issue  is  in  how  to  utilize  disparate  scene  cues  to 
achieve  a  more  complete  and  accurate  overall  scene 
interpretation. 

The  focus  of  this  paper  is  to  examine  how  estimates  of 
three-dimensional  scene  structure,  as  encoded  in  a 
scene  disparity  map,  can  be  improved  by  the  analysis 
of  the  original  monocular  imagery.  This  paper 
describes  the  utilization  of  surface  illumination 
information  provided  by  the  segmentation  of  the 
monocular  image  into  fme  surface  patches  of  nearly 
homogeneous  intensity  to  remove  mismatches 
generated  during  stereo  matching.  These  patches  are 
used  to  guide  a  statistical  analysis  of  the  disparity  map 
based  on  the  assumption  that  such  patches  correspond 
closely  with  physical  surfaces  in  the  scene.  We  present 
refinement  results  on  complex  urban  scenes  containing 
various  man-made  and  natural  features  and 
demonstrate  the  improvements  due  to  monocular 
fusion  with  a  set  of  different  region-based  image 
segmentations.^ 
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1.  Introduction 

One  common  problem  for  systems  that  interpret 
multiple  sources  of  sensed  data  is  the  fusion  of  partial 
results  from  a  variety  of  sources.  This  problem 
appears  under  many  guises.  For  example,  given  a  set 
of  different  scene  descriptions  generate  from  a  single 
image  using  a  variety  of  image  analysis  techniques, 
how  does  one  intelligently  combine  such  partial 
information?  [Shufelt  and  McKeown  90].  The 
introduction  of  additional  sensor  types,  temporal 
imagery,  and  multiple-look  imagery  create  dimensions 
along  which  information  fusion  must  be  performed;  as 
such,  the  complexity  of  the  problem  can  increase.  In 
some  cases,  increased  amounts  of  data  provide 
improved  information.  This  may  not  necessarily 
follow,  however,  complex  systems  having  different 
sources  of  error  may  not  reinforce  correct  partial 
interpretations  nor  refute  incorrect  ones. 

Thus,  the  key  issue  is  the  integration  of  many  different 
sources  of  partial  information.  In  computer  vision  (and 
in  particular,  three-dimensional  scene  analysis),  the 
goal  is  to  generate  an  interpretation  of  the  scene  that  is 
as  close  as  possible  to  the  actual  scene  imaged.  Such 
an  interpretation  can  include  the  delineations  and 
heights  of  buildings,  a  digital  elevation  model,  and  the 
centeriine  and  width  of  roads  in  a  transportation 
network.  Our  belief  is  that  no  individual  computer 
vision  technique  can  reliably  provide  a  complete  scene 
reconstruction.  To  achieve  good  performance,  we  need 
to  gather  a  variety  of  information,  extracted  by  various 
processes  from  the  imagery,  and  synthesize  this 
disparate  information  into  a  consistent  model.  Figure  1 
shows  a  possible  structure  for  such  a  scene 
interpretation  system. 

From  the  three-dimensional  scene  (G)  we  generally 
acquire  two-dimensional  imagery  generate  by  a 
variety  of  different  sensors.  For  example,  a  stereo  pair 
of  intensity  images  would  represent  such  an  imagery. 
As  is  well  understood,  the  problem  of  interpreting  the 
two-dimensional  image  (I)  as  a  three-dimensional 
scene  is  underconstrained.  In  certain  cases,  we  may 
have  access  to  high-level  knowledge  about  the  contents 
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Figure  1:  Data  fusion  in  image  analysis 


of  the  scene,  or  particular  objects  that  can  be  found  in 
the  scene.  Such  knowledge  can  loosely  be  called  a 
Model  (M).  For  example,  in  the  case  of  aerial  imagery 
we  may  have  knowledge  about  the  sensor  resolution, 
the  general  characteristics  of  the  scene  (airport,  urban 
area,  rural  area),  etc.  From  the  representation  (I),  we 
try  to  extract  features  that  will  allow  us  to  interpret  the 
scene  {Ai}.  These  features  are  typically  segmentations, 
edge  maps,  disparity  maps,  intensity  maps,  and  the 
like.  These  can  be  thought  of  as  a  set  of  intrinsic 
images  and  primitives  for  intermediate  and  high-level 
vision  [Barrow  and  Tenenbaum  78,  Poggio  88].  In 
order  to  fuse  the  information  embodied  in  these 
different  "images",  we  need  a  common  framework  of 
representations  (formed  by  the  (Ei)).  This  framework 
needs  to  allow  many,  if  not  all,  of  the  [Ai]  features  to 
be  represented.  The  utilization  of  a  common 
representation  makes  information  fusion  simpler  and 
allows  the  generation  of  an  interpretation  (F),  which 
then  allows  the  generation  of  our  scene  model  (G‘). 
This  model  can  be  used  to  iterate  through  the  fusion 
process  again  in  conjunction  with  extra  knowledge 
about  the  scene  obtained  from  (M).  This  initial 
interpretation  of  the  scene  can  help  in  the  extraction  of 
features  (Ai),  the  transformation  of  the  features  in  the 
common  representation,  the  merging  process,  and  even 
the  generation  of  the  scene  model. 

Depending  on  the  interpretation  of  the  scene  for  which 
we  are  looking,  we  may  need  a  varying  amount  of 
information;  in  most  cases,  more  information  is 
generally  desirable.  For  instance,  many  techniques 


extract  most  of  the  necessary  information  for  scene 
interpretation  from  a  single  intensity  image;  such 
techniques  are  said  to  apply  monocular  analysis.  It  is 
possible  to  take  advantage  of  stereo  disparity,  however, 
to  obtain  more  information  that  may  be  useful  for 
disambiguation  of  monocular  interpretations. 
Techniques  utilizing  stereo  imagery  are  said  to  apply 
binocular  analysis  or  stereo  analysis.  Other 
information  such  as  global  constraints  or  world  models 
can  be  useful  for  further  interpretation  and 
disambiguation,  but  we  believe  that  stereo  analysis  is  a 
necessary  step  towards  a  coherent  interpretation  of  the 
scene. 

In  this  paper  we  describe  a  technique  to  merge 
information  extracted  from  aerial  imagery  using  a 
common  region-based  representation  and  show  how 
disparate  scene  cues  can  be  integrated  to  achieve  a 
more  complete  and  accurate  overall  scene 
interpretation.  The  particular  task  at  hand  is  the 
refinement  of  stereo  disparity  data  using  monocular 
intensity  information.  In  Section  2  we  describe 
techniques  to  improve  the  accuracy  of  a  stereo 
disparity  map  using  a  single  segmentation  of  the  left 
intensity  image  of  a  stereo  pair.  Thus,  we  are  able  to 
recover  from  mismatches  generated  during  stereo 
matching  by  re-utilizing  the  intensity  image  that  was 
originally  used  in  the  matching  process.  In  Section  3 
we  discuss  some  experimental  results  on  disparity 
refinement  and  describe  techniques  that  allow  for  the 
integration  of  additional  scene  segmentations  to 
provide  for  a  more  robust  refinement  process.  Finally. 


840 


in  Section  5  we  give  some  future  directions  of  this 
work  in  building  extraction  and  built-up  area  analysis 
and  speculate  on  how  these  techniques  could  be 
integrated  into  a  more  general  three-dimensional  scene 
interpretation  system. 

2.  One  approach  to  information  fusion 

In  our  research  we  utilize  scene  domain  cues  derived 
from  monocular  analysis  and  stereo  analysis  of 
left/right  stereo  image  pairs.  In  the  case  of  monocular 
analysis,  one  source  of  information  is  a  region  based 
segmentation  of  the  left  or  right  image.  In  the  case  of 
stereo  analysis,  our  cues  are  primarily  disparity  maps 
derived  from  area-based  (si)  and  feature-based  (S2) 
stereo  matching  systems  [McKeown  86,  Hsieh,  et.  al. 
92]. 

These  image-based  cues  provide  different  information 
concerning  properties  of  man-made  stmctures  and 
terrain  surfaces  in  the  scene.  In  the  case  of  three- 
dimensional  reconstruction,  we  can  make  the 
assumption  that  the  scene  is  composed  of  surfaces 
whose  information  content  is  primarily  in  terms  of 
surface  orientation  and  radiometry.  Under  these 
assumptions,  we  will  see  how  estimates  of  three- 
dimensional  scene  structure  (as  encoded  in  a  scene 
disparity  map)  can  be  improved  by  the  analysis  of  the 
original  monocular  imagery. 

Previous  woik  in  the  general  area  of  surface  fitting 
include  use  of  planar  patches  [Eastman  and 
Waxman  87]  and  both  planar  and  quadratic  patches 
[Hoff  &  Ahuja  89]  to  fit  sparse  stereo  data  to  object 
surfaces.  Work  in  the  area  of  analysis  and  correction 
of  stereo  matching  as  a  post-processing  step  include 
[Mohan  &  Nevada  89]  and  [Cochran  &  Medioni  89]. 
The  work  of  Luo  and  Maitre  [Luo  90]  used  a  single 
image  segmentation  composed  of  relatively  small 
patches  as  a  basis  for  fitting  planar  and  quadratic 
surfaces.  We  use  this  basic  idea,  but  simplify  the  patch 
fitting  to  one  based  solely  on  a  statistical  estimate  of 
the  disparity  values.  We  extend  this  idea  to  include 
multiple  segmentations,  each  providing  a  different 
estimate  of  the  location  of  the  partial  object  surface. 

We  have  two  sources  of  information  that  can  be 
viewed  as  different  representations  of  the  physical 
surfaces  found  in  the  scene:  disparity  maps  resulting 
from  different  stereo  matchers  providing  the  heights  of 
the  surfaces  in  the  scene  and  the  initial  intensity  images 
representing  the  radiometric  properties  of  the  surfaces 
in  the  scene.  Figures  2,  4,  and  5  are  examples  of 
"initial"  data  used  for  these  data  fusion  experiments. 
Figure  2  is  a  high  resolution  aerial  image  containing  a 
variety  of  buildings  with  complex  shapes,  typical  of  an 
industrial  area.  Figures  4  and  S  are  disparity  maps 
computed  using  si  area-based  and  S2  feature-based 
methods,  respectively.  Figure  3  is  a  reference  ground 
truth  disparity  map  derived  using  a  manual  3 
dimensional  segmentation  of  the  scene.  These  images 
are  a  few  of  the  many  possible  intrinsic  images,  (Ai), 


in  our  general  framework.  It  is  important  to  note  that, 
as  in  the  intrinsic  image  paradigm,  ^ese  two  sources  of 
information  are  "registered".  That  is,  there  is  a  pixel- 
by-pixel  correspondence  between  points  in  the 
intensity  image  and  points  in  the  disparity  map.  In 
some  many  cases  one  issue  complicating  the  use  of 
multi-source  information  is  the  accurate  registration  or 
correspondence  between  the  infonnation  sources 
themselves. 

An  intensity  image,  subject  to  sampling  and 
digitization  errors,  poses  difficulties  for  monocular 
analysis  techniques  such  as  segmentation.  On  the  other 
hand,  most  stereo  matching  Sgorithms  arc  fooled  by 
different  variations  in  the  stereo  pairs,  which  cause 
mismatches  in  the  disparity  maps.  The  mismatches  in 
disparity  maps  primarily  result  from  geometric  and 
radiometric  differences  in  the  left  and  right  images, 
rather  than  local  digitization  or  sampling  errors  in  the 
intensity  images.  Thus,  it  is  possible  to  use 
information  from  the  intensity  images  to  reduce  the 
number  of  mismatches  introduced  by  stereo  matching 
processes. 


2.1.  Region  based  interpretation 

Our  approach  utilizes  surface  illumination  information, 
provid^  by  the  segmentation  of  the  monocular  images 
into  fine  surface  patches  of  nearly  homogeneous 
intensity,  to  remove  mismatches  generated  during 
stereo  matching.  First,  we  segment  the  intensity  image 
into  uniform  intensity  regions.  These  regions 
correspond  to  approximately  planar  surfaces  in  the 
image.  We  assume  that  the  orientation  and  surface 
material  are  the  primary  factors  for  the  radiometry  of 
the  image.  Under  these  assumptions,  uniform  image 
radiometry  is  produced  by  a  planar  surface,  of  a  certain 
orientation  and  material,  in  the  scene. 

These  surfaces  should  have  continuous  linear  disparity 
values  (i.e.,  the  disparity  values  of  these  regions  are 
represented  by  continuous  linear  functions).  Since  the 
disparity  map  contains  some  noise,  however,  most  of 
the  regions  segmented  in  the  intensity  image  have 
disparity  functions  that  are  neither  linear  nor 
continuous.  Ideally,  we  would  like  to  approximate  the 
actual  disparity  functions  over  the  uniform  intensity 
regions  by  the  appropriate  linear  functions. 

The  problem  of  approximating  a  surface  in  three- 
dimensional  space  to  a  reasonable  planar  surface  is  a 
difficult  one;  we  approximate  such  surfaces  by 
horizontal  surfaces.  Then,  the  disparity  values  for  each 
region  will  be  the  same  for  each  pixel,  and  the  problem 
is  reduced  to  the  selection  of  the  best  value  for  the 
heights  of  these  surfaces.  The  general  problem  is  that 
of  locating  of  the  surface  which  satisfies  the  equation 

ax+by+cz+d=0 

Given  (x,y),  we  should  be  able  to  obtain 
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Figure  2:  DC38008  Industrial  Scene  Figure  3:  DC38008  Disparity  Reference 


Figure  5:  DC38008  S2  Disparity 


Figure  4:  DC38008  Si  Disparity 


z  =  (-ax-by-d)/c 

We  assume  here  that  z’=  -dVc’  only.  Then  the  problem 
is  to  find  ('dVc’)  that  best  fits  the  surface  so  that 

ax+by+c*(-d7c’  )+d~=0 

or  to  find  z’  so  that  z-z’  would  have  a  minimal  value 
over  the  region  (this  can  be  the  weighted  mean  of  the  z 
distribution  or  the  most  ’representative’  value  of  the  z 
distribution).  In  other  words,  we  need  only  select  a 
single  disparity  value  for  each  region.  Since  we  are 
using  an  over-segmentation  of  the  image,  a  piecewise 
planar  disparity  map  gives  a  good  approximation  of  the 
relief  in  the  scene.  Furthermore,  since  we  are 
intere.sted  in  building  extraction  in  aerial  images,  this 
approximation  will  be  adequate. 


This  region-based  interpretation  has  been  developed 
for  two  different  applications.  We  show  how  this 
approach  can  support  information  fusion  from  different 
segmentations  and  well  as  across  multiple  disparity 
estimates  based  upon  a  local  decision  making 
evaluation.  In  Section  3  we  describe  how  improved 
disparity  maps  may  be  obtained  by  correcting  the 
mismatches  produced  by  stereo  matchers  and  by 
refining  the  disparity  discontinuities. 

2.2.  Intensity  Segmentation  Techniques 

The  general  scene  .segmentation  problem  is,  of  course, 
a  very  difficult  one  and  has  a  long  history  in  image 
processing  and  computer  vision.  There  are  no 
universal  segmentation  techniques  that  work  well 


842 


across  a  variety  of  imagery  and  tasks.  Such  low  level 
algorithms  typically  differ  in  their  approaches;  they 
may  utilize  intensity-based,  area-based,  or  edge-based 
techniques.  Some  systems  combine  these  techniques 
into  hybrid  algorithms.  We  have  concentrated  on  those 
segmentation  methods  that  produce  (nearly)  unifomi 
intensity  regions  because  we  wish  to  detect  those 
image  regions  that  correspond  to  oriented  surface 
patches  in  the  scene.  We  utilize  a  region  segmentation 
algorithm  based  upon  the  histogram  splitting 
paradigm  [Ohlander  78]  and  a  region  growing 
algoritirm  [Yakimovsky  76]  which  takes  into  account 
edge  strength  and  shape  criteria  [McKeown  84]. 
Interestingly,  while  neither  of  these  methods  give 
completely  satisfactory  segmentation  results,  Uicy 
provide  good  over-segmentations  that  rarely  merge 
object/background  boundaries.  Both  techniques  will 
also  provide  different  segmentations  based  upon 
modification  of  a  small  set  of  parameters. 

In  our  initial  experiments  we  generated  three  scene 
segmentations;  two  by  using  different  parameters  for 
histogram  selection,  and  one  by  using  region  growing. 
These  segmentations  provided  the  basis  for  our  work  in 
intensity/disparity  fusion,  the  goal  of  which  was  to 
produce  an  improved  three-dimensional  scene 
interpretation.  Figures  7,  9,  and  9  show  examples  of 
these  segmentations  on  the  ix:38008  industrial  intensity 
image.  As  a  preprocessing  step,  we  smoothed  the 
original  image  (Figure  6)  using  a  Nagao  filter. 

2.2.1.  Machineseg 

One  of  the  major  difficulties  with  region  growing 
techniques  in  complex  scenes  is  the  difficulty  in 
determining  automatic  stopping  conditions  for  the 
merging  procedure.  MACHINESEG  [McKeown  84]  is  a 
region  growing  system  that  tries  to  preserve  edges 
between  regions  and  stops  the  growing  procedure  when 
certain  shape  or  spectral  criteria  are  not  satisfied  inside 
the  region.  It  adds  a  decision  procedure  to  evaluate  the 
effect  of  the  next  merge  operation  and  either  allows  the 
merge  to  proceed  or  to  be  rejected.  In  the  case  of 
disparity  map  refinement,  we  want  the  regions  to  be 
sufficiently  unifonn  that  they  could  be  treated  as  planar 
(or  at  least  "soft")  surfaces.  We  also  limited  the  size  of 
ffie  generated  regions  so  that  very  small  regions  could 
not  be  generated,  as  these  could  be  considered  noise  or 
non-representative  regions.  Since  we  are  not 
considering  the  regions  smaller  than  our  noise 
threshold  (20  pixels)  our  segmentation  in  Figure  7  is 
not  a  complete  partition  of  the  image.  However,  it 
does  produce  regions  for  most  of  the  important 
surfaces  in  the  image. 


2.2.2.  Colorseg 

This  histogram  splitting  technique  is  based  on  the 
extraction  of  regions  with  limited  intensity  ranges  (in 
other  words,  region  of  approximately  uniform 
intensities).  The  technique  searches  for  the  peaks  in  the 


histogram  of  the  image  and  segments  the  regions 
whose  intensity  values  fall  in  windows  around  these 
peaks.  The  regions  are  then  removed  from  the  image 
and  the  process  continues  until  all  the  pixels  in  the 
image  have  been  removed.  This  process  results  in  a 
segmentation  composed  of  connected  regions,  eaeh 
having  an  intensity  range  less  than  a  certain  threshold. 
This  technique  does  not  guarantee  preservation  of  the 
edges  (in  particular,  smdl  edges)  but  it  may  ignore 
local  noise  with  strong  edges  that  other  techniques  will 
classify  as  regions.  As  in  the  previous  technique,  we 
removed  very  small  regions  (less  than  20  pixels)  that 
could  be  considered  as  noise,  for  further  processing. 

In  our  experiments,  we  generated  different 
segmentations  with  different  segmentation  techniques. 
For  instance,  using  the  colorseg  technique  we 
generated  two  segmentations  of  the  im''«Tes,  one  with 
"uniformity"  defined  as  a  maximum  .  0  intensity 

levels  inside  the  region  (to  tolerate  sensor  noise  and 
allow  for  imperfect  planar  surfaces)  and  another  with 
"uniformity"  defined  as  a  maximum  of  20  intensity 
levels  (to  tolerate  more  noise).  An  estimation  of  the 
noise  or  the  average  intensity  range  for  the  surfaces  in 
the  image  is  a  delicate  problem,  and  the  use  of  different 
segmentations  to  estimate  the  intensity  range  inside  the 
regions  does  not  necessarily  increase  the  reliability  of 
the  process.  It  is  thus  important  that  we  obtain 
different  segmentations  of  the  scene  that  are  not 
consistent,  such  as  those  in  Figures  8  and  9.  The 
fusion  of  these  data  may  overcome  some  of  the 
inherent  problems  of  a  single  segmentation  since  they 
provide  different  local  evaluation  contexts  for  disparity 
estimates  in  the  scene.  In  the  following  sections  we 
show  how  we  can  merge  information  using  different 
intensity  segmentations. 


2.3.  Disparity  map  results 

Our  initial  height  information  for  the  industrial  scene 
was  derived  using  two  different  stereo  matching 
algorithms.  Given  these  sets  of  height  information, 
which  may  or  may  not  be  reliable  or  unique,  it 
becomes  necessary  to  use  a  data  fusion  process  in  order 
to  maximize  the  amount  of  useful  information  gained 
from  these  sets  of  height  estimates. 

We  used  2  different  matching  techniques,  one  area- 
based  (SI)  and  the  other  feature-based  (S2).  si  uses  the 
method  of  differences  technique  on  neighborhoods  of 
the  image  in  hierarchical  fashion  [Lucas  84,  McKeown 
86].  S2  performs  a  hierarchical  matching  of  epipolar 
intensity  scanlines  in  the  left  and  right  image  [Hsieh 
90,  Hsieh,  et.  al.  92].  The  results  of  these  stereo 
matching  algorithms  are  different;  Si  gives  us  a  dense 
disparity  map  (i.e.,  a  map  containing  a  disparity  value 
for  each  pixel  in  the  image),  while  S2  gives  us  a  sparse 
disparity  map  (i.e.,  a  map  containing  a  disparity  value 
for  those  pixels  corresponding  to  peaks  or  vaUeys  in 
the  intensity  images).  We  interpolate  the  sparse  S2 
matches  into  a  dense  disparity  map  by  step 
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Figure  8:  COLORSEG  SEGio  Figure  9:  colorseg  seg20 


interpolation.  Our  fusion  mechanism  will  have  to 
correct  mismatches  by  the  si  and  S2  stereo  systems  as 
well  as  those  introduced  by  the  interpolation  method. 
One  can  easily  observe  differences  in  the  disparity 
estimates  produced  by  Si  and  S2  shown  in  Figures  4 
and  5,  respectively. 

3.  Fusion  Experiments 

The  goal  of  refinement  is  to  remove  mismatches, 
improve  the  location  of  disparity  discontinuities,  and  to 
obtain  the  best  height  estimate  for  each  point  in  the 
scene.  The  refinement  process  is  can  be  decomposed 
into  two  stages.  First,  we  perform  monocular 
segmentation  on  the  left  image  of  the  stereo  pair.  We 
generate  a  region  representation  containing  intrinsic 


attributes  of  region  area,  intensi.y  distribution,  and 
adjacent  neighbors.  For  each  of  the  disparity  maps  we 
obtain  a  histogram  of  height  estimates  for  each  region 
in  the  segmentation.  A  set  of  disparity  attributes  is 
computed  for  each  region  in  each  segmentation 
including  frequency  statistics  for  the  disparity  values, 
and  estimate  of  the  'best’  disparity  value,  and  a 
confidence  score  for  this  value.  This  allows  the 
computation  to  proceed  at  a  .symbolic  level  on  a 
region-by-region  basis. 

In  the  second  stage  we  merge  this  information 
represented  in  the  region  data  structure  by  selecting  the 
'correct'  value  from  the  available  information,  and  by 
comparing  scores  based  on  the  nature  and  quality  of 
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the  different  pieces  of  information.  The  goal  then  is  to 
evaluate  the  quality  or  confidence  in  the  information  so 
as  to  maximize  the  amount  of  accurate  data  we  merge 
from  our  different  information  sources. 

In  general  there  are  several  approaches  to  disparity 
map  refinement  using  monocular  data.  The 
combinations  are  quite  obvious,  and  in  the  remainder 
of  this  section  we  describe  experiments  using  the  first 
three  techniques: 

1.  Disparity  refinement  using  one  segmentation. 

2.  Disparity  refinement  using  several 

segmentations. 

3.  Disparity  refinement  using  one  segmentation 
and  several  disparity. 

4.  Disparity  refinement  using  several 

segmentations  and  several  disparity  maps. 

3.0.1.  Simple  disparity  refinement 

In  this  first  approach,  a  histogram  is  constructed  for 
each  segmentation  region.  The  values  of  each 
histogram  are  the  disparity  values  in  each  region.  The 
most  representative  value  of  each  histogram  is  then 
selected.  In  our  case,  this  value  was  simply  that  of  the 
highest  peak  in  the  histogram.  We  chose  this  value  for 
two  reasons.  The  step-interpolated  S2  disparity  maps 
result  in  disparity  histograms  having  only  a  few  values, 
which  correspond  to  real  height  values  or  matching 
noise.  If  the  matching  is  reasonably  robust,  the  noise 
will  introduce  local  maxima  in  the  histogram  that  will 
be  smaller  in  magnitude  than  the  best  height  estimate. 
Further,  a  typical  region  histogram  for  an  S2  disparity 
map  exhibits  one  or  two  large  peaks  and  a  few  noise 
pet^  that  influence  the  average  value  of  the 
histogram,  making  it  less  reliable  as  a  representative 
value. 

For  non-horizontal  regions  and  Si  results,  the  average 
disparity  may  suffice  for  a  reasonable  measure  of  the 
height  of  the  region.  A  confidence  score  can  be 
generated  for  these  disparity  values  based  on  the 
characteristics  of  the  histograms  (and,  conceivably,  on 
the  type  of  disparity  map  used  as  well  as  the  nature  of 
the  region  histograms).  Finally,  this  disparity  value  is 
assigned  to  the  entire  region,  under  the  assumption  that 
it  will  be  a  better  estimate  of  the  height  for  the  whole 
region.  In  most  cases,  this  removes  a  large  number  of 
the  mismatches,  but  whenever  our  initial  assumptions 
about  scene  radiometry  are  not  valid,  our  height 
estimates  may  differ  from  the  conect  height  value. 

We  implemented  this  experiment  using  COLORS  EG  at 
10  (SEGIO)  and  20  (SEG20)  intensity  levels  as 
previously  shown  in  Figures  8  and  9.  A  refined 
disparity  map  based  on  these  segmentation  regions  and 
the  SI  and  S2  disparity  values  was  produced.  Regions 
that  did  not  generate  a  consistent  disparity  estimate,  are 
set  to  black.  Figures  10  and  11  show  the  results  of  the 
disparity  improvement  process  for  the  SEGio  and  SEG20 
se^entations  using  the  Si  disparity  map.  Figures  12 


and  13  show  the  results  of  the  disparity  improvement 
process  for  the  S2  disparity  map. 

A  visual  comparison  between  the  original  disparity 
maps  and  the  refined  maps  is  quite  striking, 
particularly  in  the  case  of  Si.  The  result  of  the 
refinement  process  seems  to  greatly  improve  the 
sharpness  of  the  buildings,  particularly  using  the 
coarser  SEG20  segmentation.  In  the  case  of  S2  the 
improvement  is  not  as  dramatic,  but  the  building 
boundaries  are  sharpened  and  several  of  the  building 
surfaces  are  made  more  homogeneous. 


3.0.2.  Multi-segmentation  disparity  refinement 

In  this  second  approach  to  refinement,  we  merge 
different  height  estimates,  given  different  intensity 
segmentations(SEGlO,  SEG20)  and  then  merge  the 
results  across  the  different  segmentations.  We  refine 
the  disparity  estimate  for  each  pixel  by  locating  the 
intensity  region  to  which  it  belongs,  for  each  of  the 
image  segmentations.  This  list  of  regions  can  then  be 
searched  to  obtain  the  disparity  estimate  attribute 
(computed  for  a  given  disparity  map)  as  well  as  a 
confidence  score  for  this  estimate.  TTie  confidence 
score  is  then  used  to  select  the  best  disparity  value, 
which  is  then  assigned  to  the  pixel.  Currently  a  simple 
decision  is  made  to  select  the  disparity  value  having 
the  highest  confidence  score. 

An  attempt  is  made  to  maximize  the  score  for  each 
pixel  in  the  entire  image.  This  is  done  by  selecting  a 
disparity  value  in  all  of  the  regions  resulting  from  the 
union  of  the  segmentations.  In  other  words,  the 
segmentations  were  merged  and  the  best  height  value 
was  selected  for  each  of  these  regions,  by  utilizing  the 
confidence  scores  computed  for  each  region.  The 
scoring  method  currently  in  use  takes  into  .account 
information  about  the  nature  of  the  segmentation  used. 

In  particular,  higher  confidences  can  be  assigned  to 
sufficiently  large  regions  in  a  constrained  segmentation 
such  as  SEGIO  than  to  the  equivalent  regions  in  SEG20. 
Information  of  this  nature  must  be  incorporated  in  the 
confidence  function  for  each  segmentation  region. 

Figures  15  and  14  show  the  results  of  merging  the 
SEGIO  and  the  seg20  segmentations  for  the  S2  and  the 
SI  disparity  maps,  respectively.  Depending  on  the 
confidence  scores  of  the  disparity  values  selected  for 
each  segmentation,  we  were  able  to  obtain  improved 
disparity  estimates  for  some  of  the  regions.  Comparing 
these  results  to  Figures  12  and  13,  disparity  maps 
obtained  with  the  simple  method,  we  observe  some  of 
the  failings  of  both  approaches.  The  initial 
segmentations,  in  some  cases,  are  under-segmented 
instead  of  over-segmented,  resulting  in  the  grouping  of 
regions  that  should  have  been  assigned  different  height 
estimates.  Another  factor  is  the  confidence  evaluation 
function  for  the  regions  of  the  segmentation,  which 
only  takes  simple  properties  of  the  disparity  histograms 
of  each  region  into  account. 
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Figure  12:  S2  refined  SEGio 

3.03.  Multi-Disparity  Disparity  Reflnement 

In  this  approach,  several  different  disparity  maps  are 
merged  using  a  single  segmentation,  looking  for 
consistent  areas  across  disparity  maps.  This  approach 
is  similar  to  the  simple  disparity  improvement 
approach,  except  that  we  now  attempt  to  select  the  best 
disparity  value  based  on  a  set  of  differing  confidence 
scores.  The  score  established  for  each  disparity  map  at 
each  pixel  should  be  dependent  on  the  stereo  matching 
algorithm  used  to  generate  the  map,  and  should  also 
take  into  account  the  nature  of  the  possible  mismatches 
resulting  from  each  stereo  matching  technique. 


The  major  problem  with  all  of  the  refinement 


Figure  13:  S2  refined  SEG20 

approaches  discussed  in  this  paper  is  the  development 
of  a  reasonable  confidence  evaluation  function  for  each 
set  of  data.  Currently,  confidence  is  evaluated  by  a 
scoring  function  that  utilizes  the  standard  deviation  and 
the  disparity  range  of  the  histogram  for  each  region,  as 
well  as  the  size  of  the  region.  Ideally,  this  scoring 
function  would  also  take  into  account  the  nature  of  the 
disparity  map.  As  an  initial  experiment,  we  defined  a 
similar  scoring  function  for  each  disparity  map  and 
checked  for  disparity  consistency  across  segmentation 
regions.  Figure  16  shows  the  re.sult  of  this  experiment 
using  the  MACHINESEG  segmentation  previously  shown 
in  Figure  7.  Again,  areas  where  disparity  values  differ 
significantly  between  si  and  .S2  are  marked  in  black. 
These  areas  often  correspond  to  regions  of  occlusion 


Figure  10:  Si  refined  SEGIO 


Figure  11:  si  refined  SEG20 
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Figure  14:  Si  refined  with 
Merge  SEGIO  /  SEG20 


Figure  15:  S2  refined  with 
Merge  SEGIO  /  SEG20 


Figure  16:  Merge  Si  and  S2 
using  MACHINESEG 


and  shadow.  Future  work  needs  to  integrate  other 
monocular  cues,  such  as  illumination  and  shadow 
analysis  into  this  process  [Irvin  89). 

4.  Performance  Evaluation 

In  this  section  we  give  a  quantitative  performance 
evaluation  of  the  improvement  in  disparity  map 
accuracy  when  compared  to  a  ground  truth  3 
dimensional  segmentation  '>f  the  scene.  We  also  show 
results  for  two  addition^)'  coi  u>lex  urban  scenes,  CiviLi 
and  GAOl.  Figures  17  aii  '  .  3  show  the  left  intensity 
image  and  reference  t.'sna,  ity  map  for  the  ClviLl  site. 


Figures  19  and  20  show  the  COLORSEG  segmentations 
using  10  and  20  levels,  respectively.  Figures  21  and  22 
show  the  original  Si  and  S2  disparity  maps.  Figures  23 
and  24  show  the  result  of  multi-segmentation  disparity 
refinement  using  SEGIO  and  SEG20.  A  parallel  set  of 
results  for  the  GAO  site  is  presented  in  Figures  25  to  32. 
The  error  analysis  methodology  has  been  described  in 
detail  in  [Hsieh,  et.  al.  92). 


4.1.  Global  Scene  Error 

Tables  1,  2,  and  3  give  detailed  statistics  for  each  of  the 
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Figure  19:  CIVIL  i  colorseg  10  Figure  20:  civil  i  colorseg  20 


stereo  matching  systems,  si  and  S2,  and  show  the  since  they  can  be  caused  by  single  point  errors  that 

improvement  due  to  our  fusion  technique.  These  may  occur  in  either  the  calculated  or  reference 

statistics  on  based  on  the  global  error  between  the  disparity  map. 

reference  disparity  map  and  the  disparity  results 

produce  by  the  stereo  systems  and  the  refinement  1*  clear  from  these  tables  that  the  refinement  process 

process.  As  a  global  measure  of  accuracy  we  present  quantitatively  improved  both  the  average  error  and  the 

the  average  pixel  disparity  error  for  the  entire  scene  percentage  ot  points  within  +/-  one  pixel  of  the 

and  the  percentage  of  points  having  an  estimate  within  reference  disparity.  Overall  average  pixel  error  on  the 

+/-  one  pixel  disparity  from  the  reference  for  the  entire  l^ree  scenes  ranged  from  1.59  pixels  for  S2  on  GAOl 

scene.  The  use  of +/- one  pixel  disparity  reflects  .some  down  to  .02  pixels  for  si  refined  in  civiLl.  The 

of  the  accuracy  limitations  in  the  reference  disparity  percentage  improvement  for  refinement  with  respect  to 

map.  These  simple  parameters  give  us  an  idea  of  the  within  +/-  one  pixel  disparity  metric  was 

magnitude  of  the  errors  in  the  scene,  but  do  not  give  consistently  nearly  8-10%  on  all  tests, 

much  insight  into  their  distribution.  Other  error  ^ 

metrics  such  as  min/max  error  are  not  very  reliable  '"teresting  issue  m  interpreting  the  global  statistics 

IS  knowing  the  frequency  distribution  ot  the  disparities 
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Figure  24:  civil  l  S2  Refined  Disparity 


Figure  23:  CIVILl  si  Refined  Disparity 

in  the  scene.  Obviously  error  metrics  based  on  small 
populations  are  not  robust.  Figure  33  plots  the 
distribution  of  disparity  values  in  the  ground-truth 
reference  disparity  map  for  DC38008.  One  can  see  that 
the  majority  of  the  pixels  are  at,  or  close  to,  the  ground. 
Figure  34  shows  the  distribution  of  disparity  values  for 
each  of  the  stereo  systems  and  their  refinements. 
While  the  curves  coincide  quite  well,  there  are 
noticable  areas,  particularly  in  the  disparity  range 
between  5  and  15  where  none  of  the  stereo  results 
significantly  approach  the  ground  truth  segmentation. 
Data  for  the  other  scenes,  GAOi  and  civil i  exhibit  a 
similar  pattern.  This  distribution  analysis  is  useful  for 
understanding  the  structure  of  the  scene,  and  in 
interpreting  the  meaning  of  the  statistics  presented  in 
the  following  section. 


4.2.  Effects  of  Disparity  Jumps  and  Structures 

One  way  to  address  some  of  the  issues  that  are  hidden 
by  the  global  statistics  discussed  in  the  previous 
section  is  to  measure  the  influence  of  the  actual 
disparity  value  on  matching  accuracy  for  each  of  the 
methods.  That  is,  as  we  attempt  to  recover  larger  and 
larger  disparity  values,  does  our  matching  error 
increase?  Alternatively,  is  the  matching  error 
independent  of  the  disparity  range?  The  graphics  in 
Figures  35,  39,  and  37  plot  the  average  error  in  pixel 
disparity  at  each  disparity  level  for  each  of  the  test 
examples.  Each  contains  four  graphs  showing  the 
results  for  SI,  S2,  and  the  refinement  results  for  Si  and 
S2.  In  general,  these  graphs  indicate  that  the  greater  the 
actual  disparity,  the  more  likely  the  various  matching 
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Figure  25:  GAO i  Scene  Figure  26:  GAO i  Disparity  Reference 


Figure  27:  GAOi  colorseg  10 

algorithms  will  make  a  mistake.  This  is  true  for  both 
larger  positive  and  negative  disparity.  These  errors  are 
reflected  in  both  a  higher  average  error  and  a  lower 
percentage  of  points  within  +/-  one  pixel  of  the  actual 
disparity.  However,  as  we  observed  in  Figure  34  the 
number  of  pixels  at  higher  disparity  is  significantly 
smaller,  making  the  error  estimates  less  robust  that 
those  for  disparities  between  +/-  2  of  zero. 

In  areas  with  man-made  structures,  global  accuracy 
statistics  such  as  those  described  in  the  previous 
section  do  not  adequately  convey  the  quality  of  the 
stereo  matching  system  with  respect  to  the  buildings  in 
the  scene.  In  most  cases,  buildings  may  cover  only  a 
small  portion  of  the  scene,  and  the  background  terrain 
will  statistically  dominate  the  scene-wide  estimate  of 
disparity  quality.  Thus,  we  require  a  method  that 


Figure  28:  GAOl  colorseg  20 

allows  buildings  to  be  evaluated  independently  or  as  a 
class  of  objects  in  the  scene.  Additionally,  there  are 
several  metrics  that  can  be  u.sed  to  evaluate  both  the 
disparity  estimate  and  the  quality  of  the  depth 
jumps  [Hsieh,  et.  al.  92].  For  the  purpose  of  this  paper 
we  show  average  disparity  error  in  pixels  for  each  of 
the  buildings  in  the  three  scenes.  We  have  assigned 
building  ID's  based  upon  a  manually  compiled  ground- 
truth  disparity  map  so  that  taller  buildings  have  larger 
numeric  ID’s.  Figures  36,  38,  and  40  plot  the  average 
error  in  pixel  disparity  against  regions  IDs,  .sorted  in 
increasing  order  of  disparity.  From  these  graphs  one 
can  get  a  sense  that  there  are  several  cases  where  the 
refinement  value  for  a  particular  building  is  worse  than 
the  original  stereo  estimate.  Given  the  good  global 
error  improvements  shown  in  Tables  1,  2.  and  3.  we 
are  currently  looking  at  the  specific  effect  of 


850 


Figure  29:  GAOi  si  Disparity 


'ft  4 


Figure  31:  GAOl  si  Refined  Disparity 


refinement  in  these  buildings.  Without  such 
performance  analysis  tools  it  would  be  impossible  to 
locate  and  understand  these  problems. 


These  statistics  allow  us  to  pinpoint  problems  at  a 
much  finer  grain  of  detail  than  can  be  accomplished 
with  global  analysis.  Thus  we  can  identify  specific 
buildings  in  the  scene  and  try  to  understand,  at  the 
algorithmic  level,  whether  there  are  specific  situations 
where  matching  could  be  improved.  Once  identified, 
these  improvements  should  have  an  overall  positive 
effect  on  the  rest  of  the  scene.  The  result,  of  course, 
can  be  subjected  to  the  same  rigorous  performance 
analysis.  Once  we  commit  to  working  on  complex 
scenes,  as  opposed  to  synthetic  controlled  images,  the 
visual  inspection  of  disparity  results  to  discover  small 


Figure  30:  GAOI  S2  Di.sparity 


■ 


Figure  32:  Gaoi  .S2  Refined  Disparity 


variations  in  performance  becomes  very  unsatisfactory, 
except  possibly  at  the  earliest  stages  of 
experimentation.  Such  manual  inspection  greatly  limits 
our  ability  to  detect  subtle  conceptual  bugs  or 
recognize  possibilities  for  algorithmic  improvement. 
We  can  perform  systematic  analysis  across  multiple 
scenes.  For  example,  in  applying  statistics  that  take 
into  account  the  disparity  jump  for  individual 
buildings,  we  can  aggregate  performance  information 
for  all  buildings  across  all  scenes  to  achieve  a  larger 
statistical  sample. 

5.  Conclusions 

We  have  described  a  set  of  fusion  processes  that  allow 
us  to  improve  the  quality  of  disparity  maps,  and  we 
have  demonstrated  the  use  of  information  fusion  to 
improve  disparity  map  analysis.  The  major  feature  of 
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Global  Error  Estimate  for  Stereo  Matching 
Using  Figure  3  as  ground  truth 


Stereo 

Method 

Min/Max 

Disparity 

Average 
Error  in 
pixels 

%  of  points 
within  +-  1 
pixel  disparity 

Ground  Truth 
Disparity  Range 

SI 

-12/13 

.11 

61% 

-2/14 

S2 

-5/14 

.30 

75% 

-2/14 

Slref 

-5/13 

.08 

72% 

-2/14 

S2ref 

-4/14 

.03 

83% 

-2/14 

Table  1:  Statistics  for  different  stereo  matching  methods  on  DC38008 


Global  Error  Estimate  for  Stereo  Matching 
Using  Figure  18  as  ground  truth 


Stereo 

Method 

Min/Max 

Disparity 

Average 

Error  in 
pixels 

%  of  points 
within  +-  1 
pixel  disparity 

Ground  Truth 
Disparity  Range 

SI 

-15/17 

.16 

50% 

-9/20 

S2 

-9/20 

.38 

58% 

-9/20 

Slref 

-10/17 

.02 

59% 

-9/20 

S2ref 

-7/20 

.40 

_ _ : _ 1 

68% 

-9/20 

Table  2:  Statistics  for  different  stereo  matching  methods  on  CIVILl 

Global  Error  Estimate  for  Stereo  Matching 

Using  Figure  26  as  ground  truth 

Stereo 

Method 

Min/Max 

Disparity 

Average 

Error  in 
pixels 

%  of  points 
within  +-  1 
pixel  disparity 

Ground  Truth 
Disparity  Range 

SI 

-20/21 

.16 

35% 

-2/32 

S2 

-1/27 

1.59 

45% 

-2/32 

Slref 

-20/21 

.46 

47% 

-2/32 

S2ref 

-1/27 

.96 

55% 

-2/32 

Table  3:  Statistics  for  different  stereo  matching  methods  on  GAOl 


the  information  fusion  technique  described  here  is  the  The  representation  framework  (an  intensity 
definition  of  a  common  frame  for  information  fusion,  segmentation)  can  be  used  in  conjunction  with 
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Disparity  Frequency  Disparity  Frequency 


Figure  35;  DC38:  Average  Disparity  Error 
in  Pixel  Disparity 

different  types  of  intrinsic  images.  The  approach 
developed  here  treats  homogeneous  intensity  regions 
as  surfaces,  which  allows  three-dimensional 
information  to  be  extracted  readily. 

Many  research  issues  remain  to  be  explored.  The  new 
disparity  maps  generated  by  the  information  fusion 
process  contain  regions  which  each  have  only  one 
disparity  value.  In  many  cases,  these  unique  values  are 
not  the  best  possible  disparity  estimates  for  the  regions, 
and  a  refinement  process  may  need  to  be  invoked  to 
correct  these  estimates.  One  approach  might  be  to  use 


Figure  36:  DC38:  Average  Disparity  Error 
by  Building 

the  new  di.sparity  map  itself  as  input  to  a  verification 
process  which  could  refine  disparity  estimates  for  each 
pixel  or  for  those  regions  with  low  confidence  scores. 

Other  sources  of  information  could  be  utilized  at  the 
refinement  stage  to  further  enhance  the  disparity  map. 
One  promising  approach  would  be  the  use  of  left/right 
consistency,  such  as  left/right  matching  of  low 
confidence  regions  or  local  correlation  for  these 
regions.  Again,  it  would  be  important  to  use  as  much 
information  as  possible,  while  conservatively  adjusting 
or  refining  data  based  on  its  confidence  scores.  In  the 
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Figure  37:  CIVILI:  Average  Disparity  Error 
in  Pixel  Disparity 


40  50  60  70 

fiagion  tO  (sorted  by  disparity) 

Figure  38:  ClviLl:  Average  Disparity  Error 
by  Building 


Disparity 

Figure  39:  GAOl:  Average  Disparity  Error 
in  Pixel  Disparity 

ideal  situation,  no  additional  information  would  refine 
the  disparity  estimates;  it  would  merely  verify  the  truth 
of  the  disparity  map. 


20  40  60  80  100  120  140  160  ISO  200 

rtegion  ID  ('sorted  by  disparity) 

Figure  40:  GAOl:  Average  Disparity  Error 
by  Building 

fusion  approaches  described  here  provide  a  means  for 
data  integration  that  may  prove  useful  in  other  aspects 
of  scene  interpretation. 


Many  improvements  can  be  obtained  by  the  use  of 
better  segmentations  and  scoring  functions,  and  by 
addressing  the  assumption  that  only  flat  horizontal 
surfaces  are  responsible  for  the  imaged  radiometry  and 
by  using  a  more  sophisticated  surface  model  such  as 
non-horizontal  planar  surfaces  or  quadratic  surfaces. 
Finally,  it  seems  feasible  that  multispectral  data  could 
be  integrated  by  similar  techniques.  The  information 
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Abstract 

In  order  for  rule-based  interpretation  systems  to  be  at 
^  practical  it  is  important  that  there  are  tools  available 
to  create,  maintain,  and  evaluate  the  system’s 
knowledge  base.  This  paper  describes  a  suite  of  tools 
for  the  SPAM  image  inteipretation  system,  that  extends 
our  previous  research  in  this  area  [McKeown,  et.  al. 
89].  These  tools  implement  methods  for  interactive 
and  automatic  knowl^ge  acquisition  and  knowledge¬ 
base  evaluation.  In  addition,  recent  inteipretation 
results  from  the  SPAM  system  will  be  discussed.^ 

1.  Introduction 

SPAM  is  a  production  system  architecture  for  the 
interpretation  of  aerial  imagery  with  applications  to 
automated  cartography  and  digital  mapping 
[McKeown,  et  al.  85,  CSReview  89].  It  tests  the 
hypothesis  that  the  interpretation  of  aerial  imagery 
requires  substantial  knowledge  about  the  scene  under 
consideration.  Knowledge  about  the  type  of  scene, 
whether  airport,  suburban  housing  development,  or 
urban  city,  aids  in  low-level  and  intermediate  level 
image  analysis,  and  will  drive  high-level  interpretation 
by  constraining  search  for  plausible  consistent  scene 
models,  spam  has  been  applied  in  two  task  areas: 
airport  and  suburban  house  scene  analysis.  In  this 
section  we  describe  the  spam  architecture,  then  briefly 
discuss  related  work  in  knowledge  acquisition  and 
knowledge-based  vision. 

1.1.  Background:  The  SPAM  Architecture 

As  with  many  computer  vision  systems,  SPAM  attempts 
to  interpret  the  2-dimensional  image  of  a  3- 
dimensionzd  scene.  A  typical  input  image  is  shown  in 


^This  research  was  sponsored  by  the  Air  Force  Office  of 
Scientific  Research  under  Contract  AFOSR-89-0199.  The  views 
and  conclusions  contained  in  this  document  are  those  of  the  authors 
and  should  not  be  interpreted  as  representing  the  official  policies, 
either  expressed  or  implied,  of  the  Air  Force  Office  of  Scientific 
Research,  or  of  the  United  States  Government 


Figure  1.  The  particular  goal  of  the  SPAM  system  is  to 
interpret  an  image  segmentation,  composed  of  image 
regions,  as  a  collection  of  real-world  objects.  For 
example,  the  output  for  the  image  in  Figure  1  would  be 
a  model  of  the  airport  scene,  describing  where  the 
runway,  taxiways,  terminal-building(s),  etc.,  are 
individually  located.  SPAM  uses  four  basic  types  of 
scene  interpretation  primitives:  regions,  fragments, 
functional-areas,  and  models.  SPAM  performs  scene 
interpretation  by  transforming  image  regions  into 
scene  fragment  interpretations.  It  then  aggregates 
these  fragments  into  consistent  and  compatible 
collections  called  functional-areas.  Finally,  it  selects 
sets  of  functional-areas  to  form  models  of  the  scene. 

As  shown  in  Figure  2,  each  interpretation  phase  is 
executed  in  the  order  given,  spam  drives  from  a  local, 
low-Ievel  set  of  interpretations  to  a  more  global,  high- 
level,  scene  inteipretation.  There  is  a  set  of  hard-wired 
productions  for  each  phase  that  control  the  order  of 
rule  executions,  the  forking  of  processes,  and  other 
domain-independent  tasks.  However,  this  "bottom-up" 
organization  does  not  preclude  interactions  between 
phases.  For  example,  prediction  of  a  fragment 
interpretation  in  functional-area  (FA)  phase  will 
automatically  cause  SPAM  to  reenter  local-consistency 
check  (LCC)  phase  for  that  fragment  Other  forms  of 
top-down  activity  include  stereo  verification  to 
disambiguate  conflicting  hypotheses  in 
model-generation  (MODEL)  phase  and  to  perform  linear 
alignment  in  region-to-fragment  (RTF)  phase. 

The  first  phase  of  SPAM  is  called  region-to-fragment 
(RTF).  TTiis  is  a  traditional  heuristic  classification 
process,  using  knowledge  about  the  classes  of  features 
that  occur  in  the  scene  to  map  a  segmentation  to  a  set 
of  interpretations.  Lot«d  properties  of  the 
segmentation,  such  as  shape,  texture,  or  height,  are 
used  to  decide  on  which  interpretations  to  generate. 
Examples  of  the  type  of  knowledge  used  in  region-to- 
fragment  would  be  runways  are  typically  50  to  80 
meters  wide,  or  houses  are  8  to  10  meters  high. 

The  second  phase  of  SPAM  is  called  local-consistency 
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Figure  2:  Interpretation  phases  in  SPAM. 


check  (LCC).  This  phase  performs  a  modified 
constraint  satisfaction  between  the  interpretations 
generated  in  region-to-fragment.  The  knowledge  in 
this  phase  consists  of  the  constraints  between  the 
different  classes  of  objects.  An  example  constraint 
would  be  runways  have  perpendicular  taxiways.  There 
are  many  such  constraints  in  the  system,  with  each 
constraint  providing  weak  support  to  the  participating 
hypotheses.  Thus,  it  is  not  the  action  of  a  single 
constraint,  but  the  collective  action  of  several 
constraints  which  causes  two  interpretations  to  be 
found  consistent  with  one  another. 

The  functional-area  (FA)  phase  groups  together  those 
interpretations  that  support  one  another,  where  support 
is  computed  from  the  results  of  the  previous  phase.  A 
functional-area  is  defined  as  a  group  of  interpretations 
that  are  similar  in  function  and  are  in  close  physical 
proximity  to  one  another.  For  instance,  a  terminal 
functional-area  is  defined  to  contain  only  terminal¬ 
building,  parking-lot,  road,  and  parking-apron 
interpretations.  It  is  physically  represented  by  the 
convex  hull  of  the  features  in  the  functional-area. 

Finally,  the  model-generation  (MODEL)  phase  uses  the 
functional-areas  and  combines  them  based  on  a  number 
of  heuristics,  including  number  of  conflicts,  number  of 
supported  interpretations,  and  area  of  coverage. 
Conflicts  are  identified  and  resolved,  either  using 
support  within  the  context  of  the  model,  or  by  invoking 
some  process  which  would  provide  additional 
knowledge.  For  instance,  if  in  the  context  of  a  model  a 
region  of  the  image  was  interpreted  as  both  a  taxiway 
and  a  hangar-building,  a  stereo  process  could  be 
invoked  to  help  resolve  the  conflict  based  on  the 
region’s  height  estimate.  Multiple  models  are 


commonly  generated  and  these  models  can  be  used  as 
contexts  for  further  processing. 

1.2.  Related  Work 

There  has  been  a  large  body  of  work  in  traditional 
knowledge-based  systems  to  support  knowledge 
acquisition  and  validation  [Gupta  91].  Typical  issues 
include  the  validation  of  knowledge-based  systems, 
checking  for  completeness  and  consistency  in 
knowledge  bases,  and  system  performance  evaluation. 
The  application  areas  are  broad,  spanning  medical 
diagnosis,  design,  process  control,  and  configuration 
analysis.  However,  little  work  in  this  area  has  been 
focused  within  the  context  of  computer  vision  tasks. 
One  explanation  is  that  there  are  few  end-to-end 
knowledge-based  systems  for  computer  vision  in  the 
literature.  A  corollary  is  that  those  knowledge-based 
systems  that  have  been  built  are  "one-person"  systems 
that  do  not  survive  much  past  the  thesis  defense.  Some 
of  the  earliest  "high-level"  vision  systems  applied  to 
aerial  image  analysis  include  Matsuyama's 
system  [Matsuyama  80],  the  Sigma  system  [Hwang 
84],  and  SPAM  [McKeown,  et.  al.  85].  More  recently, 
researchers  have  explored  the  use  of  generic 
knowledge  [Huertas,  et.  al.  89]  and  investigated  the 
formalization  of  knowledge  to  support  high-level 
vision  [Strat  &  Smith  88]. 

Our  previous  work  in  this  area  [McKeown,  et.  al.  89] 
focused  on  the  use  of  compilation  tools  to  translate  a 
high-level  .schema  based  constraint  representation  into 
OPS5  productions.  We  also  developed  static  analysis 
tools  to  aid  in  the  display  and  debugging  of  the  model 
descriptions  generated  by  spam.  As  a  result  of  the 
long  term  use  of  this  system,  our  research  focus  has 


shifted  toward  issues  in  automating  portions  of  the 
knowledge  acquisition  process  and  in  the  development 
of  tools  to  aid  in  the  diagnostic  analysis  of  finer  levels 
of  system  behavior.  In  the  following  section  we 
discuss  recent  work  in  these  areas. 

This  paper  describes  recent  woiic  with  the  SPAM  image 
interpretation  system.  Sections  2  and  3  discuss 
revisions  and  additions  to  our  knowledge  acquisition 
and  analysis  tools.  Experimental  results  from  each 
phase  of  processing  in  SPAM  are  presented  and 
discussed  in  Section  4.  Finally,  Section  5  discusses 
current  and  future  research. 

2.  Issu^  in  Knowledge  Acquisition  for  SPAM 
In  our  previous  work  in  knowledge  compilation  SPAM 
was  separated  into  domain-dependent  and  independent 
parts  to  facilitate  generalization,  as  well  as  to  aid  in  our 
ability  to  maintain  the  system.  The  system  is  generated 
from  a  knowledge  representation  that  is  not  dependent 
on  the  implementation  language  (which,  in  the  current 
SPAM  system,  is  OPS5).  With  the  knowledge-base 
decoupled  from  OPS5,  it  is  easier  to  create  and  maintain 
the  large  knowledge-bases  needed  to  analyze  complex 
scenes. 

Tables  1  and  2  summarize  the  types  of  knowledge  used 
in  SPAM,  the  problems  encountered  in  acquiring  and 
using  this  knowledge,  and  the  tools  developed  to 
address  these  problems.  Several  tools  appear  to  be 
missing  (e.g.,  FACHK).  These  may  be  added  as 
necessary  as  the  development  of  this  line  of  research 
continues.  Each  of  the  currently  implemented  tools 
will  be  discussed  in  detail  in  the  following  sections. 

2.1.  Knowledge  Acquisition  of  Spatial  and 
Structural  Constraints 

RTFCHK,  a  knowledge  acquisition  tool  for  the  first 
phase  of  SPAM,  allows  the  user  to  interactively  study 
and  modify  schemas  encoding  spam’s  shape 
knowledge.  Correctness  of  individual  rules  is 
determined  by  comparing  their  results  against  the 
ground-truth  interpretations  of  a  scene.  The  results  are 
sorted  by  correctness  and  by  pass/fail  into  four 
displays.  Figure  3  shows  a  con^sion  matrix  generated 
from  applying  a  runway  rule  to  data  from  Washington 
National  Airport.  The  matrix  display  shows  true 
positives  in  the  upper  left,  false  negatives  in  the  upper 
right,  false  positives  in  the  lower  left,  and  true 
negatives  in  the  lower  right.  A  rule  containing 
constraints  that  are  too  restrictive  will  result  in  many 
false  negatives,  while  loose  constraints  will  cause 
many  false  positives.  A  "perfect  rule"  will  have  no 
false  positives  or  negatives.  Upon  viewing  these 
results  the  user  may  edit  the  schema  and  test  it  again. 

RTFCHK  was  employed  to  improve  a  set  of  RTF  schema 
that  had  been  generated  automatically  from  ground- 
truth  files.  The  result  was  a  set  of  RTF  schema  that 
exhibited  improved  performance  over  all  of  our  airport 
test  scenes  (see  discussion  in  Section  4.1).  It  is 


accepted  that  shape-classification  alone  cannot 
accurately  classify  complex  scenes,  but  the  RTF 
schemas  produced  using  this  tool  provide  better  sets  of 
interpretations  for  the  later  phases  of  spam. 

A  tool  very  similar  to  RTFCHK,  called  LCCCHK,  aids  the 
user  in  the  knowledge  acquisition  process  for  the 
second  phase  of  spam.  For  this  phase,  pairs  of  regions 
are  tested  against  one  another  for  attributes  like 
closeness  and  perpendicularity,  lccchk  allows  single 
constraints  to  be  applied  to  classes  of  objects,  and  the 
results  are  compart  against  the  ground-tmth.  Again, 
the  user  is  presented  with  a  confiision  matrix  display 
from  which  they  may  interactively  inspect  the  results 
and/or  make  modifications  to  the  knowledge  base. 

2.2.  Automatic  Shape  Constraint  Generation 
One  way  for  a  user  to  develop  new  rtf  schema  is  to 
allow  them  to  measure  many  example  regions.  As 
more  examples  are  viewed,  the  user  can  confidently 
assign  reasonable  values  for  the  shape-attribute 
constraints.  Given  that  we  have  generated  several 
ground-truth  datasets  containing  many  examples  of 
every  class,  this  process  can  be  automated. 

RTFANALYZE  is  a  tool  that  creates  new  RTF  schema 
when  given  one  or  more  scenes  with  ground-truth.  It 
statistically  analyzes  the  associated  ground-truth 
segmentation  to  produce  values,  by  class,  for  the 
various  shape  attributes.  Currently,  the  process  is 
exhaustive  in  that  every  attribute  is  computed  for  every 
class. 

Given  sufficient  examples,  RTFANALYZE  produces 
schema  that  perform  well  across  a  range  of  airports^. 
The  automatic  schema  generator  produces  effective 
rules  much  faster  than  a  user  can  interactively,  but 
often  manual  adjustments  will  improve  a  mle’s 
performance.  The  combination  of  automatic 

generation  followed  by  manual  adjustment  is  currently 
our  normal  mode  of  operation  for  producing  RTF 
schema. 

2.3.  Semi-Automatic  Structural  Constraint 
Generation  via  Functional  Ground  Truth 

Much  of  spam’s  effort  in  hypothesis  refinement  is 
performed  in  the  LCC  phase.  Because  the  space  of 
possible  constraints  is  large,  manual  methods  of 
knowledge  acquisition  are  inadequate.  Automatic  or 
semi-automatic  methods  for  augmenting  LCC 
knowledge  are  not  only  useful,  but  they  become 
necessary  to  ensure  that  the  complexity  of  the  scene  to 
be  interpreted  is  captured. 

The  combinatorics  of  the  second  phase  make 


^Because  the  process  is  statistics  based,  this  qualifier  depends  on 
how  representative  the  examples  are  of  the  target  class.  Of  course, 
other  machine  learning  techniques,  such  as  decision  trees  or 
clustering,  can  be  used  to  help  solve  this  problem. 


Phase 

Tool(s) 

Kind(s)  of 
Knowledge 

RTF 

RTFCHK, 

RTFANALYZE, 

FAANALYZE 

Shape 

LCC 

LCCCHK, 

FAANALYZE 

Local  geometric 

FA 

FAANALYZE, 

SPAMEVALUATE 

Grouping 

MODEL 

1 

SPAMEVALUATE 

Global  geometric. 
Conflict 

DifTicuIties  Encountered 
Using  this  Knowledge 


Difficult  to  visualize  using 
cttribute/value  representation;  Large 
variations  across  scenes;  Can  require 
many  examples;  Shapes  of  object 
classes  aren’t  unique. 


Context  dependent;  Many  weak 
constraints  required. 


Subjective  functional  definitions 
somewhat  ambiguous;  Very  dependent 
on  knowledge  in  LCC. 


Many  knowledge  sources  (from 
different  levels  of  processing)  are 
required. 


Table  1:  Types  of  Knowledge  Used  in  SPAM. 


Tool 

Phase(s) 

Interaction 

Problems 

Addressed 

RTFCHK 

RTF 

Textual, 

Graphical 

Allows  graphical  evaluation  of  shape 
constraints. 

RTFANALYZE 

RTF 

Textual 

Compiles  shape  constraints  from 
examples. 

LCCCHK 

LCC 

Textual, 

Graphical 

Allows  graphical  evaluation  of  local 
geometric  constraints. 

FAANALYZE 

RTF,  LCC,  FA 

Graphical, 

Textual 

Generates  shape  and  local  geometric 
constraints  from  examples  using 
contexts 

SPATS 

All 

Textual 

Provides  statistical  feedback  on  spam’s 
performance 

SPAMEVALUATE 

All 

Graphical, 
Textual, 
Static  graftfiical 

Allows  interactive  graphical  and  textual 
access  to  aU  phases  of  spam  results. 

Table  2:  Summary  of  Tools  for  Knowledge  Acquisition  in  spam. 


exhaustive  observation  of  example  scenes  a  veiy 
lengthy  process,  as  every  pair  of  regions  must  be  tested 
using  every  constraint  across  several  data  sets. 
Furtheimore,  the  LCC  constraints  would  be 
extrapolated  based  on  the  observed  relationships 
between  arbitrary  pairs  of  regions.  This  makes  litUe 
sense  when  regions  are  unrelated;  for  example,  a 
taxiway  is  usually  perpendicular  to  an  adjacent 
runway,  but  has  no  consistent  relationship  to  distant 
runways. 

In  order  to  reduce  the  combinatorics  and  to  be  more 
precise  about  what  context  is  used  in  the  analysis,  we 
decided  to  analyze  only  the  structural  xlationships 


between  elements  of  each  ftinctional-area  type.  This 
assumes  that  interactions  between  objects  of  different 
functional-areas  is  weak.  The  schema  produced  in  this 
manner  make  more  sense  because  they  are  derived 
from  local  relationships.  Generating  LCC  schema 
automatically  requires  example  (ground-truth) 
functional-areas. 

FAANALYZE  is  a  tool  for  semi-automatically  acquiring 
and  analyzing  ground-truth  functional-areas.  It  has  an 
interactive  mode  that  allows  the  user  to  specify  a 
functional-area  by  clicking  on  its  constituent  regions. 
The  system  only  allows  the  user  to  add  regions  which 
belong  to  the  correct  functional-area  type.  For 
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Figure  3:  RTFCHK  display  after  single  schema  has  fired  on  ground-truth  segmentation  from  Washington 
National  Airport. 


example,  when  entering  a  hangar  functional-area,  the 
user  may  choo.se  only  hangar  buildings,  roads,  grassy- 
areas,  and  parking-aprons.  Figure  4  is  a  snapshot  of 
the  display  as  a  user  adds  regions  to  an  example  hangar 
functional-area. 

Once  ground-truth  functional -areas  have  been  created, 
FAANALYZE  performs  a  statistical  analysis,  recording 
the  observed  values  of  the  LCC  constraints.  Choosing 
which  constraint  types  (distance,  orientation)  to  use  is  a 
difficult  problem,  and  is  currently  performed  by  the 
user.  We  are  investigating  the  use  of  observed 
geometric  distributions  to  decide  which  constraints 
perform  best.  As  discovered  with  the  RTF  phase,  a 


manual  post-processing  pha.se  helps  to  improve  the 
quality  of  the  constraints  generated  automatically. 

2.4.  Acquiring  Relative  Shape  Constraints 

The  program  FAANALYZE  addresses  two  issues.  First, 
we  wanted  to  generate  LCC  schema  automatically  by 
observing  the  relationships  between  region-types 
within  functional-areas,  as  discussed  previously. 
Second,  we  wanted  to  explore  relative  shape 
constraints. 

We  define  relative  shape  attributes  as  the  ratio  between 
an  attribute  of  functional-area  .seed-region  and  the 
same  attribute  for  an  element  of  the  functional-area. 
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Figure  4:  FAANALYZE;  Selecting  regions  to  build  an  example  functional-area. 


For  instance,  we  hypothesized  that  if  the  seed  region 
was  larger  than  average,  then  one  might  expect  the 
elements  around  it  to  be  larger,  as  well.  Several 
working-memory  elements  representing  relative  shape 
constraints  for  hangar-buildings  and  runways  are 
shown  in  Figure  5.  Unlike  the  first  phase  of  spam, 
which  tests  the  absolute  shape-attributes  of  regions,  the 
relative  constraints  take  into  account  the  relative  size 
of  the  airport  or  subsets  of  it.  The  numbers,  therefore, 
are  unitless. 

Generating  relative  shape  constraints  and  LCC  schema 
automatically  requires  example  functional-areas.  Once 
sample  functional-areas  have  been  created, 
FAANALYZE  analyzes  them,  recording  the  observed 
values  of  both  the  relative  shape  constraints  and  the 
LCC  geometries.  The  output  is  in  the  form  of  working- 
memory  elements  and  LCC  schema.  These  are  used  to 
generate  a  new  SPAM  system. 

The  relative  shape  constraints  can  be  used  in  the 


following  three  ways: 

•  To  filter  incorrect  hypotheses  from  each 
functional-area; 

•  As  a  measure  of  confidence  for  entire  functional- 
areas;  if  the  ratios  are  outside  the  expected 
ranges,  then  either  the  seed  region  of  the 
functional-area  is  incorrect  or  many  constituents 
of  the  functional-area  are  wrong; 

•  To  filter  out  unlikely  hypotheses  before  the  LCC 
phase  fires,  reducing  the  number  of  competing 
hypotheses  in  the  LCC  phase. 

The  relative  shape  constraints  do,  in  fact,  filter  out 
hypotheses  from  functional-areas.  However,  we  found 
that  the  local-consistency  phase  also  filters  many  of  the 
.same  hypotheses.  We  investigated  using  the  relative 
shape-constraints  to  evaluate  functional-areas  as  a 
whole,  but  the  difference  between  a  correct  and 
incorrect  functional-area  was  found  to  be  difficult  to 
quantify.  This  leaves  only  the  third  use,  for  filtering 
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(make  fa-constraint  '^fa-context  hangar-building 
''‘from-class  hangar-building  '‘to-class  road 

^attribute  area  '^minimum  0  '^maximum  1818693  ''average  442} 

(make  fa-constraint  ^fa-context  hangar-building 
^from-class  hangar-building  ^to-class  tarmac 

^attribute  perimeter  ^minimum  0  ^maximum  73371  ''average  728) 


(make  fa-constraint  ^fa-context  runway 
^from-class  runway  '^to-class  taxiway 

^attribute  area  ''minimum  1  ''maximum  157  ''average  9) 

(make  fa-constraint  ^fa-context  riinway 

''from-class  runway  ^to-class  taxiway 

^attribute  perimeter  ^minimum  4  ^maximum  170  ^average  16) 

Figure  5:  Example  relative  shape  constraints. 


out  hypotheses  before  the  LCC  phase  fires.  Because 
these  constraints  almost  duplicate  a  subset  of  the  LCC 
phase’s  results,  we  could  use  the  relative  shape- 
constraints  without  changing  the  final  results.  Since 
relative  shape-constraints  are  faster  than  the  LCC 
geometries,  this  should  reduce  spam’s  total  running 
time. 

3.  Analysis  and  Evaluation  Tools 
Post-run  analysis  and  evaluation  provides  important 
feedback  about  the  correctness  of  die  knowledge-base 
and  ways  in  which  it  can  be  improved.  Some  of  our 
older  tools,  such  as  spats  [McKeown,  et.  al.  89], 
produce  statistical  summaries  of  the  performance  of  the 
SPAM  system.  Static  run  analysis  can  provide  good 
summaries  of  system  performance,  allowing  a  user  to 
easily  gauge  the  aggregate  affects  of  modifying  a 
system’s  inputs.  However,  it  can  be  difficult  reasoning 
from  this  analysis  about  the  specifics  of  why  a  result 
came  to  be.  For  example,  knowing  that  a  certain 
number  of  taxiways  failed  to  pass  a  particular  rule 
provides  no  information  as  to  which  taxiways  failed,  or 
why.  While  there  is  currently  no  automatic  diagnt^tic 
tool  for  a  SPAM  user  to  consult,  it  seemed  most  natural 
to  provide  a  graphical  tool  that  would  allow  the  user  to 
interact  with  the  results  from  a  SPAM  run.  Our 
experience  has  found  SPAMEValuate  to  be  an 
effective  diagnostic  tool. 

3.1.  Interactive  Performance  Analysis 

SPAMEVALUATE  provides  an  interactive  medium  for 
browsing  SPAM  results  and  evaluating  proposed 
modifications.  It  generates  run  statistics  that  augment 
those  already  generated  by  spats.  It  produces 
exhaustive  and  summary  output.  Dumps  are  a  human- 
readable  reproduction  of  the  results  of  each  phase; 
summaries  are  statistietd  synopses  of  the  results.  For 
example,  the  dump  of  the  RTF  phase  lists  each  region, 
along  with  every  RTF  schema  it  passed.  The  summary, 
on  the  other  hand,  is  a  chart  showing  the  breakdown  of 
correct  interpretations  versus  incorrect,  allowing  the 
user  to  infer  which  interpretations  tend  to  be  mistaken 
for  one  another.  Statistics  are  available  for  the  first  3 


phases  of  spam;  a  sampling  is  shown  in  Figure  6. 

Using  the  graphical  mode  of  spamevaluate,  the  user 
can  select  regions  by  clicking  with  a  mouse  and  then 
display  very  specific  information  culled  from  the  SPAM 
run.  The  user  can  jump  from  phase  to  phase,  observing 
how  the  results  from  each  of  the  phases  interact.  The 
Figure  7  shows,  in  overlapping  windows,  the  various 
phases  of  results  as  the  user  displays  each  in  turn. 
Since  the  results  are  displayed  graphically,  the  user 
gets  a  more  intuitive  feel  for  the  meaning  of  the  results. 
For  example,  a  statistical  summary  might  show  that 
many  taxiways  are  failing,  but  the  graphical  display 
would  show  that  the  failing  taxiways  tend  to  be  longer 
than  average,  implying  that  the  problem  might  lie  in 
the  RTF  phase. 

Another  feature  of  SPAMEVALUATE  is  the  ability  to 
invoke  schema  of  the  first  and  second  {biases  on 
individual  regions,  similar  to  the  functionality  provided 
by  RTFCHK  and  LCCChk.  Since  tracing  through  the 
results  often  suggests  possible  deficiencies  in  the 
current  ruleset,  the  user  can  test  such  conclusions  by 
invoking  individual  schema  on  any  chosen  region.  The 
rule  fires  in  a  verbose  mode,  showing  which 
constraints  passed  or  failed,  and  by  how  much,  as  seen 
in  Figure  8. 

In  practice,  SPAMEVALUATE  has  been  invaluable  in  our 
eviuation  of  the  spam  system.  Sometimes  a  problem 
is  solved  by  adjusting  a  constraint,  or  by  adding  a  new 
schema  to  express  a  new  piece  of  kiwwledge. 
spamevaluate  also  facilitates  access  to  information 
necessary  to  study  new  methods  of  processing  in  the 
(more  complex)  fonctional-area  and  model  phases.  It 
is  useful  to  be  able  to  reason  backwards  from  an 
interpretation  result  containing  an  error  to  the  specific 
piece  of  knowledge  causing  tiiat  error.  For  instance, 
we  recently  changed  the  FA  phase  to  remove 
competing  hypotheses  from  within  a  functional-area, 
resulting  in  unique  labels  for  the  constituent  objects 
(see  Section  4.2.2).  spamevaluate  aided  the 
^gorithm  development  by  permitting  us  to 
interactively  trace  back  relationships  from  sets  of 
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6«n«ric  Image:  mo£fettl 
Scenetype:  airport 

[RTF  DUMP:  interpretations  generated  by  ground-truth  region  id] 

ino££ettl_0 :  navigational-aid  parking-apron  parking-lot  grassy-area 
tarmac  maintenance-building  taxiway  terminal-building 
mo££ettl_l :  parking-lot  taxiway  road  terminal-building 
mof fettl_2 :  runway 

[RTF  STATS:  class  confusion  matrix] 

6.T.  INTERPRETATION 
I  SPAM  INTERPRETATION  — > 

V  RW  TH  MRD  ARD  HHRD  CB  CHB  HB  MB  TB  CT  PA  PL  GA  TM  PMR  NVA  APJ 

RN200000000000000000 
TN  0  36  0  15  0  0  0  10  IS  27  0  16  28  15  17  0  8  0 

ARD  010200000000000000 

HBOSOiO10977O6777O6O 
MBOlOOOOOOllOllllOlO 
TBOOOOOOOOOOOOOOOOOO 
PA  000000001103012010 

PL  030100002202322000 

GA0100000011041  11  8020 


[LCC  STATS:  pairwise  consistency  table] 


HYPOTHESIS-PAIR 

(t-t)c 

(t-t)i 

(t-f)C 

(t-f)i 

(f-f)c 

(f-f>i 

Total# 

taxiway — runway 

92% 

— 

— 

1% 

— 

0% 

64 

road — road 

2% 

— 

— 

12% 

— 

72% 

50 

hangar-buildi — road 

4% 

— 

— 

10% 

— 

63% 

108 

hangar-buildi- -hangar-buildi 

21% 

— 

— 

2% 

— 

69% 

92 

terminal -bui 1 — road 

0% 

— 

— 

10% 

— 

85% 

203 

parking-apron — hangar-buildi 

7% 

— 

— 

13% 

— 

67% 

188 

[FA  DOMP:  functional-area  constituents  by  type] 

FONCAREA_77  (terminal)—  mof fettl_l [TW] 
road:  moffettl_32* 

parking-apr:  moffettl_48[GA]  Bioffettl_47[GA] 
parking- lot :  mof fettl_0 [TN] 

FT]HCAREA_76  (road)—  mof fettl_l [TM] 

grassy-area:  nioffettl_0[TW]  moffettl_48*  moffettl_47* 

[FA  STATS:  comparison  of  generated  interpretations  to  ground-truth  by  FA  type] 
runway 

runway —  2  of  2  correct 
taxiway —  21  of  21  correct 
grassy-area--  11  of  11  correct 
tarmac —  0  of  1  correct  (1  GA) 

hangar 

road —  4  of  26  correct  (18  TN)  (4  PL) 
hangar-building —  28  of  97  correct  (68  TN)  (1  TM) 
parking-apron —  8  of  49  correct  (25  TN)  (8  HB)  (7  PL)  (1  GA) 
tarmac —  0  of  21  correct  (3  MB)  (18  GA) 

Figure  6;  Selected  statistics  generated  by  SPAMEVALUATE. 

functional-areas  to  LCC  constraints  between  specific  typws  of  knowledge  is  an  interesting  as  well  as 
pairs  of  objects.  important  element  of  our  research  goals.  In  this 

section,  we  present  some  experimental  results  from 
4.  Experiments  and  Results  of  “"f  Phases  of  processing  in  spam  We  discuss 

One  tangible  result  of  an  image  interpretation  system  these  results  and  evaluate  the  effects  of  the  knowledge 

should  be  the  generation  of  a  labeled  set  of  acquisition  tools  presented  m  the  previous  section, 

segmentations.  For  SPAM,  this  labeled  .set  includes  the 

context  in  which  the  interpretations  were  decided  upon  4.1.  Improving  Shape  and  Spatial  Constraints 

(the  model(s)  and  their  constituent  functional-areas),  as  As  seen  in  Section  2,  many  of  the  tools  developed  for 

well  as  the  set  of  interpretations  and  related  use  with  SPAM  were  created  specifically  for  the 

consistency  information.  However,  a  task  as  difficult  purpose  of  improving  the  results  emerging  from  a 

and  complex  as  aerial  image  interpretation  allows  single  processing  phase.  For  example,  the  program 

many  opportunities  to  contribute  to  the  body  of  RTFANALYZE  allowed  us  to  automatically  generate  sets 

research  knowledge,  in  addition  to  simply  generating  a  of  RTF  schemas  from  statistics  produced  from  our 

final  result.  Evaluation  of  the  effectiveness  of  different  airport  ground-truth  database.  Because  the  RTF  phase 
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commands  1 1  options 


keep-ps-files 


RTF_REG_AGflIHST_RULE  for  region  'iioffettl_48'  and  rule  'grassy-area' 
curvature  <Hin:  0,0000,  Hax:  0,1930):  0,0540 
area  (Min:  0,0000,  Max:  228774,9700):  61407,9933 
perineter  (Min:  0,0000,  Max:  2343,5700):  1543,0383 
fractional-fill  (Min:  0,2540,  Max:  1,0000):  0,3269 
compactness  (Min:  0,0200,  Hax:  0,0740):  0,0258 
orientation  (Min:  0,0000,  Max:  3,7290):  2,0438 
ellipse-length  (Hin:  0,0^,  Max:  832.0300):  534.2343 
ellipse-width  (Min:  0,0000,  Hax:  343.7000):  126.4943 
ellipse-linearity  (Min:  0.0000,  Max:  5.3000):  4.2234 
i*r-length  (Min:  0.0000,  Max:  911,3100):  568.6322 
mbr-uidth  (Min:  0.0000,  Max:  440.4200):  325.5702 
mbr-linearity  (Min:  0.0000,  Max:  8.0000):  6.5886 
Region  P8SSED. 


EDIT  nda  SHAPE- CONSTRAINTS> 


ksep-ps-fltes 


Firing  Gfts-border-RHs 

Firing  between  »offettl_2  and  •offettl_48 

precond—  distance, least:  [0.0000,100.00001  0.4503  true. 

*  distance,  least: 

consistent:  [0.0000,30.00001  0.4503  true 
inconsistent:  (75.0000.1000000000.0000)  0.4503  false 
CONCLUSION:  consistent 


Firing  RHs-are-parallel-to-GRs 

Firing  between  moffettl_2  and  moffettl_48 

precond—  distance, least:  [0.0000,100.0000]  0.4503  true. 

*  orientation,  parallel: 

consistent;  [0.0000,0.50001  0.0842  true 
inconsistent:  (0.9000,1000000000.0000)  0.0842  false 
CONaUSION;  consistent 


EDIT  nde  SHAPE- CONSTRAINTS> 


Figure  8:  SPAMEVALUATE;  Invoking  RTF  and  LCC  schema. 
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deals  only  with  local  properties  of  each  segmentation,  a 
brute-force  approach  is  tractable.  These  automatically 
generated  schema  were  found  to  improve  the 
classification  of  certain  object  classes  (e.g.,  taxiways) 
when  compared  with  the  hand-generated  schema. 
Figure  3  shows  these  differences.  For  example,  the 
number  of  true  positives  for  taxiways  increased  from 
1 1  to  28.  However,  the  number  of  false  positives  also 
increased  from  1 1  to  46.  This  trend  is  true,  in  general, 
across  all  object  classes.  The  issue  is  that  most  of  the 
false  positives  will  not  be  found  to  be  consistent  with 
other  hypotheses  during  spam’s  LCC  phase.  Therefore, 
it  is  more  important  for  a  larger  number  of  trae 
positives  to  proceed  past  RTF. 

From  evaluating  the  results  of  running  SPAM  on  several 
different  airi^rt  scenes,  it  became  clear  that  knowledge 
in  LCC  was  incomplete.  This  was  most  evident  when 
examining  functional-area  results,  as  it  was  easy  to  see 
that  certain  object  classes  were  incorrectly  supporting 
(or  not  supporting)  one  another.  For  the  LCC  phase, 
quantitative  result  evaluation  is  more  complicated 
because  constraints  are  not  universally  true.  For 
example,  though  distance  constraints  exist  between 
hangar-buildings  and  roads  (as  classes),  there  may  be 
no  such  relationship  between  a  specific  hangar¬ 
building/road  pair.  A  ground-truth  database  of 
constraints  to  permit  such  an  evaluation  would  be 
tedious  time-consuming  to  generate  (though  not 
imi^ssible).  In  lieu  of  generating  such  a  database,  we 
designed  several  experiments  to  assist  us  in  evaluating 
the  effectiveness  of  the  LCC  rules.  Two  of  these 
experiments  were; 

•  Replacing  the  output  from  RTF  with  a  set  of 
perfect  interpretations; 

•  Supplementing  the  output  from  rtf  with  a  set  of 
perfect  interpretations. 

The  set  of  perfect  interpretations  can  be  easily 
generated  from  the  ground-truth.  We  can  qualitatively 
evaluate  the  results  of  these  experiments  using  our 
knowledge  acquisition  tools  while  also  refining  the 
existing  krxrwledge  base.  We  can  then  quantitatively 
evaluate  the  results  of  renmning  the  SPAM  system. 

The  first  experiment  helped  to  detect  problems  with 
lack  of  support  between  correct  hypotheses  (false 
negatives).  Missing  support  would  indicate  either  a 
lack  of  constraints  or  incorrect  (too  narrow)  bounds  on 
the  existing  constraints,  spamevaluate  and  lccchk 
were  used  to  evaluate  proposed  constraints  as  well  as 
to  locate  and  fix  problems  with  existing  constraint 
bounds. 

The  second  experiment  aided  in  identifying  incorrect 
hypothesis  support  (false  positives).  As  stated 
previously,  SPAM  assumes  that  consistency  between 
two  inte^retations  is  determined  by  the  action  of 
multiple  constraints.  Therefore,  it  is  expected  that 
there  will  be  many  constraints  falsely  satisfied,  and  that 
they  will  be  distributed  randomly  among  the  available 
hypothesis  classes.  However,  if  a  rule  has  a  very 


permissive  bound,  then  many  false  positives  occur,  and 
the  constraint  poorly  discriminates  one  class  from 
another.  Again,  both  SPAMEVALUate  and  lccchk 
were  used  to  identify  useless  constraints  and  to  tighten 
bounds  on  permissive  constraints. 

In  addition  to  manual  modifications  to  the  knowledge 
base,  we  used  faanalyze  to  scmi-automaticaUy 
generate  constraints  for  the  LCC  phase,  faanalyze 
generates  all  possible  geometric  constraints  and 
performs  only  simple  pruning  of  the  results.  A 
significant  portion  of  these  constraints  provide  very 
little  discrimination  between  consistent  and 
inconsistent  hypotheses,  though  several  missing 
constraints  were  found.  Therefore,  creating  LCC 
schema  is  still  a  task  for  a  knowledgeable  user,  but 
FAANALYZE  is  Still  useful  as  an  advisory  tool, 
supplying  the  range  of  expected  values  once  a 
geometric  constraint  has  been  selected. 

4.2.  Functional  Groupings 

Viewing  the  results  of  FA  from  a  number  of  airports 
using  SPAMEVALUATE  revealed  several  problems,  the 
most  apparent  of  which  were: 

•  Large  functional-areas  —  The  generated 
functional-areas  covered  large  portions  of  the 
scene  and  included  many  regions. 

•  Multiple  interpretations  —  Within  a  functional- 
area,  a  given  region  could  have  multiple 
conflicting  interpretations. 

Discussion  of  and  solutions  to  these  problems  are 
reviewed  below. 

4.2.1.  Focusing  Functional-Areas 
The  ideal  functional-area  should  include  only  the  seed 
(or.  generating)  region  and  those  regions  consistent 
wiA  the  seed  ^at  surrounding  it.  SPAM  requires  small 
functional-areas  for  several  reasons.  One  purpose  of 
the  functional-area  is  to  provide  an  object  whose  set  of 
constituents  are  relevant  to  one  another,  i.e.,  they  exert 
some  constraint  on  most  of  the  other  members  of  that 
set.  If  a  functional-area  is  too  large  then  bounds  on 
constraints  between  members  of  that  functional-area 
will  tend  to  be  less  precise.  Additionally,  a  large 
number  of  functional-area  elements  makes  the  process 
of  deciding  ambiguous  interpretations  computationally 
more  difficult  (see  Section  4.2.2). 

Using  knowledge  that  was  collected  before  the  creation 
of  our  knowledge  acquisition  tools,  the  functional- 
areas  generated  by  SPAM  were  too  large,  both  in  size 
and  in  number  of  elements.  Some  of  this  was  corrected 
as  a  result  of  using  these  tools  to  improve  both  RTF  and 
LCC  results.  However,  enough  problems  remained  that 
we  felt  it  necessary  to  develop  a  method  for  focusing 
the  application  of  constraints  to  local  regions  around 
our  current  area  of  interest. 

A  natural  solution  to  this  problem  was  to  allow 
preconditions  in  local-consistency  rules  in  order  to 
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Class 

#  Ground 
TniUi 

Auto  True 
Positives 

Auto  False 
NegaUves 

Auto  False 
Positives 

Auto  True 
Negatives 

Hand  True 
Positives 

Hand  False 
NegaUves 

Hand  False 
PosiUves 

Hand  True 
NegaUves 

runway 

4 

3 

1 

0 

163 

0 

4 

0 

163 

Uxiway 

39 

28 

11 

46 

82 

11 

28 

11 

117 

road 

2 

1 

1 

20 

145 

0 

2 

3 

162 

tenninal-buildiiig 

9 

4 

5 

54 

104 

2 

7 

23 

135 

hangar-building 

17 

11 

6 

56 

94 

5 

12 

12 

138 

paiking-apron 

0 

0 

0 

55 

112 

0 

0 

59 

108 

paifcing-lot 

0 

0 

0 

74 

93 

0 

0 

98 

69 

grassy-area 

45 

43 

2 

57 

65 

35 

10 

14 

108 

tarmac 

6 

3 

3 

82 

79 

1 

5 

36 

125 

Table  3:  Comparison  of  RTF  confusion  matrices  for  hand  versus  auto  generation  for  San  Francisco. 


force  each  rule  to  focus  on  smaller  portions  of  the 
scene.  For  example,  we  use  a  maximum  distance  test 
to  prune  away  regions  that  arc  being  considered  for 
orientation  tests.  Because  the  functional-area  is 
computed  from  the  local-consistency  information,  this 
addition  reduces  the  number  of  interpretations 
considered  and,  therefore,  the  size  of  the  functional- 
area. 

Example  functional-areas,  before  and  after  addition  of 
preconditions,  are  shown  in  Figures  9  and  10, 
respectively.  The  seed  for  these  two  functional-areas  is 
the  same  (the  runway  region  highlighted  in  bold),  and 
the  interpretations  that  are  incorrect  or  ambiguous  are 
shown  in  grey.  The  change  in  size  is  obvious,  and  one 
can  see  that  not  only  is  the  new  functional-area  more 
focused,  but  it  contains  no  incorrect  or  ambiguous 
inteipretations  when  compared  to  the  ground-truth.  As 
expeaed,  the  preconditions  are  providing  a  context 
around  each  interpretation  within  which  we  can  apply 
the  spatial  constraints. 

4.2.2.  Ambiguous  Interpretations 

Another  problem  we  addressed  was  the  resolution  of 
ambiguous  interpretations  within  a  functional-area. 
Previously,  we  would  delay  resolution  of  such  conflicts 
until  MODEL  phase,  making  the  model-generation 
process  more  complicated.  Making  these  decisions  in 
FA  seems  more  intuitive,  as  functional-areas  emerging 
from  this  phase  are,  as  a  result,  internally  consistent. 

The  context  provided  by  the  functional-area  is  used  to 
decide  between  two  or  more  competing  inteipretations. 
Again,  the  solution  that  seemed  most  natural  was  to 
use  more  complicated  LCC  constraints.  An  example  of 
such  a  constraint  is  tarmacs  are  close  to  and  located  at 
the  ends  of  taxtways.  These  types  of  constraints  could 
be  implemented  as  functions  of  the  binary  constraints 
already  used  in  LCC.  This  implies  that  a  function 
involving  a  count  of  the  number  of  positive  and 


negative  links  could  be  an  effective  method  of 
determining  which  of  several  hypotheses  is  most 
supported.  This  function  should  have  the  property  that, 
when  evaluated  over  all  the  supporting  hypotheses, 
fewer  multi-constraint  successes  should  receive  a 
higher  confidence  than  many  single-constraint 
successes. 

We  chose  a  function  that  totaled  all  positive  links  from 
unambiguous  hypotheses  (regions  with  a  single 
interpretation  in  the  context  of  the  functional-area)  for 
each  ambiguous  interpretation  (regions  with  multiple 
interpretations  within  the  context  of  the  functional- 
area).  Those  unambiguous  hypotheses  supporting  the 
conflict  with  more  than  a  single  satisfied  constraint 
receive  extra  weight  in  the  confidence  computation. 
The  inteipnetation  with  the  most  weight  remained  in 
the  functional-area;  any  competing  interpretations  were 
filtered  out 

Table  4  shows  statistics  compiled  across  all  the  filtered 
functional-areas  generated  from  a  single  run.  The  chart 
shows  the  four  airport  functional-area  types,  broken 
down  into  their  respective  elements.  Each  element- 
type  is  followed  by  four  columns  showing,  in  order,  the 
number  of  correct  hypotheses  of  that  type  found  in 
functional-areas,  the  number  incorrect,  and  then  the 
results  of  the  filtering  (number  of  correct^incorrect 
hypotheses  filtered).  The  numbers  indicate  that,  in 
most  cases,  the  incorrect  hypotheses  are  being  filtered 
out  more  often  than  correct  ones.  For  instance,  every 
incorrect  hypothesis  of  the  runway  fiinctional-aieas  is 
filtered,  and  only  two  correct  hypotheses  are  removed. 
This  means  that  the  filtering  algorithm  and  the 
consistency  information  it  uses  are  generally  woiking 
well.  A  specific  observation  should  be  made  with 
regard  to  terminal-building  functional-areas.  From  the 
table  it  can  be  seen  that  paiking-aprons  within 
terminal-building  functional-areas  seem  to  be  filtered 
more  or  less  randomly.  This  is  due  in  part  to  there 


868 


Figure  9:  Functional-Area  from  Dulles 
International. 


Figure  10:  New  Functional  Area  after 
Preconditions. 


FA 

Type 

Element 

Type 

Correct 

Hypotheses 

Incorrect 

Hypotheses 

#  Correct 
Hypotheses 
Filtered 

# Incorrect 
Hypotheses 
Filtered 

runway 

runway 

3 

0 

0 

0 

taxiway 

24 

1 

0 

1 

grassy-area 

20 

6 

2 

6 

tarmac 

0 

19 

0 

19 

hangar 

hangar-building 

75 

319 

0 

0 

road 

79 

65 

3 

21 

parking-apron 

II 

219 

1 

117 

tarmac 

19 

236 

0 

140 

terminal 

terminal 

2 

105 

0 

2 

road 

102 

104 

10 

29 

parking-apron 

16 

335 

14 

281 

parking-lot 

43 

541 

0 

61 

road 

road 

128 

47 

0 

0 

grassy-area 

94 

126 

0 

29 

Table  4:  Dulles  functional-area  summary. 


being  such  a  small  ratio  of  correct  to  incorrect  building  functional-areas,  not  of  the  filtering  algorithm, 
terminal-building  functional-area  seeds.  This  is  the 
fault  of  the  consistency  information  for  terminal- 
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4.3.  Results  for  Model  Generation 

The  final  phase  of  spam  must  produce  a  consistent 
scene  model  by  merging  functional-areas  while 
resolving  any  ensuing  conflicts.  Multiple  models 
should  te  generated  when  competing  models  of  the 
same  scene  are  radically  different  in  layout.  Model 
generation  is  a  hard  problem,  conceptually  and 
combinatorially;  the  strictly  algorithmic  approach, 
generating  all  possible  scene  models,  can  be  shown  to 
be  NP-complete  (see  Section  4.3.2).  This  section 
outlines  three  different  methods  that  we  have  used  for 
generating  final  scene  models  from  functional-areas. 
These  are: 

1.  Generating  a  tiling  of  the  scene; 

2.  Generation  of  maximal  cliques  from  a  graph; 

3.  Using  heuristic  search. 

We  are  experimenting  with  each  of  these  methods, 
though  the  Juristic  search  method  currently  shows  the 
most  promise. 

4.3.1.  Tiling 

One  of  our  approaches  to  producing  a  scene  model 
involves  generating  a  tiling  of  the  scene  in  question. 
Such  an  algorithm  tries  to  choose  those  functional- 
areas  that  maximize  the  amount  of  the  scene  that  is 
covered.  This  would  be  a  trivial  problem  if  it  weren’t 
for  the  fact  that  fimctional-areas  overlap,  and 
inteipretations  for  those  regions  in  the  areas  of  overlap 
can  conflict  These  conflicts  must  either  be  avoided,  or 
resolved  intelligently,  as  they  determine  the  overall 
goodness  of  the  generated  model.  Therefore, 
constraints  are  applied  which  allow  the  system  to 
intelligently  choose  which  functional-areas  can  coexist 

The  algorithm  works  as  follows.  Initially,  the  model  is 
empty.  A  model  seed  (an  initial  functional-area)  is 
chosen,  either  automatically  or  by  the  user,  and 
consecutive  functional-areas  are  added  to  this  seed 
based  on  maximizing  the  munber  of  new  regions, 
minimizing  the  numter  of  hypothesis  conflicts,  and 
maximizing  support  from  existing  functional-areas. 
The  process  terminates  when  a  given  percentage  of  the 
input  regions  have  been  used  in  the  model. 

This  procedure  efficiently  generates  a  single  model 
which  is  very  dependent  on  the  quality  of  the  initial 
seed  functional-area.  We  believe  this  dependence 
could  be  lessened  by  creating  better  functional-area 
constraints.  Further  development  would  emphasize 
embedding  this  algorithm  in  a  framework  that  would 
allow  the  ability  to  generate  and  rank  multiple  models. 

4.3.2.  Maximal  Cliques 

Another  approach  to  model  generation  views  each 
functional-area  as  a  node  in  a  graph,  with  arcs  existing 
based  on  the  compatibility  of  two  functional-areas. 
Compatibility  can  be  computed  by  applying 
constraints,  such  as  those  used  in  Section  4.3.1,  to 
functional-area  pairs.  Once  the  arcs  have  been 
computed,  models  can  be  extracted  from  the  graph  by 


searching  for  maximal  cliques,  or  by  looking  for 
cliques  of  a  particular  size.  Algorithms  for  these 
problems  exist,  but  their  worst-case  performance  is 
known  to  be  exponential  in  the  number  of  inputs.  In 
this  case,  the  number  of  models  generated  can  be 
exponential  in  the  number  of  functional-areas.  For  our 
experiments,  only  the  smallest  sets  of  functional-area 
results  could  be  used  to  generate  models  (no  more  than 
40  or  50  functional-areas).  Modifications  to  this 
method,  such  as  adding  domain  knowledge  to  constrain 
the  search  for  cliques,  or  reducing  the  numbers  of 
nodes  in  the  graph  by  merging  functional-areas  of  the 
same  type,  could  help  make  the  process  tractable. 

4.3.3.  Heuristic  Search 

Our  most  recent  research  in  model  generation  uses 
heuristics  to  constrain  which  sets  of  functional-areas 
can  coexist  in  the  same  model.  A  best-first  search  is 
performed  in  the  space  of  possible  functional-area 
groups.  Conflicts  between  competing  hypotheses  from 
different  functional-areas  are  enumerated,  but  are 
currently  not  resolved. 

The  key  to  directing  the  search  for  models  efficiently 
and  intelligently  is  the  set  of  heuristics  used.  No 
attempt  is  made  to  rank  or  order  the  heuristics. 
Instead,  aU  heuristics  contribute  equally  to  the 
fimctional-area’s  score,  some  in  a  positive  maruier  and 
others  negatively.  The  score  improves  if  the 
functional-area  adds  new  regions  and/or  support  to  the 
current  model,  is  very  compact,  or  covers  a  lot  of  area. 
The  score  is  reduced  if  conflicts  are  added  to  the 
model.  The  scor*  is  normalized  according  to  the 
number  of  regions  in  the  functional-area  in  an  attempt 
to  allow  functional-areas  of  varying  sizes  to  participate 
equally  in  model-generation.  Other  heuristics  include 
the  density  of  the  entire  model,  and  the  density  of  the 
conflicts.  A  tight  group  of  conflicts  is  believ^  to  be 
better  than  one  wifii  conflicts  spread  out  all  over  the 
scene. 

All  the  model  heuristics  discussed  thus  far  ate  domain- 
independent.  There  are  instances  where  one  would  like 
to  bring  domain-dependent  constraints  to  bear,  such  as 
constraints  between  functional-areas,  much  like  the 
LCC  phase  evaluates  consistency  between  pairs  of 
regions.  Examples  of  such  constraints  are 
terminal-buildings  are  centrally  located  with  respect  to 
the  runways  (in  fact,  there  are  functional  reasons  for 
the  control-tower  or  terminal-building  area  to  be  more 
or  less  equidistant  from  each  runway)  and 
hangar-buildings  are  not  centrally  located.  Our 
experiments  with  these  constraints  indicate  that  they 
are  not  good  predictors  and  so  cannot  be  used  to 
intelligently  generate  and/or  fix  models,  though  they 
can  be  used  to  evaluate  existing  models.  Therefore, 
our  current  scheme  generates  multiple  models  (without 
using  scene  dependent  knowledge),  then  allows 
domain-dependent  constraints  to  be  used  to  rank  the 
models  produced.  Further  exploration  of  these  types  of 
constraints  is  material  for  future  work. 
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Figure  11;  Model  of  Moffett  AFB. 


Figure  12:  Functional-Areas  in  the  Model. 


5.  Current  and  Future  Work 

SPAM  supports  a  variety  of  research  activity  within  the 
context  of  image  understanding.  It  is  a  research 
vehicle  for  our  knowledge  acquisition  work, 
experience  with  systems  integration,  and  further 
research  in  task-level  parallelism  [Harvey,  et.  al. 
90,  Harvey,  et.  al.  91].  As  such,  there  are  many 
promising  areas  of  future  work.  Several  of  these  are 
outlined  below. 

As  with  any  system,  testing  on  many  cases  aids  in 
finding  implicit  assumptions.  More  airport  examples 
would  not  only  help  in  refining  spam’s  knowledge 
base,  but  this  would  allow  exploration  of  categories 
within  the  airport  domain.  For  example,  it  may  not  be 
as  effective  to  have  a  "general"  airport  knowledge  base 
as  it  would  to  have  a  separate  knowledge  base  for  each 
one  of  military,  international,  and  regional  airports. 

In  addition  to  other  airports,  different  segmentation 
methods  will  likely  provide  spam  with  new  challenges. 
We  have,  at  various  times,  experimented  with  using 
segmentations  obtained  from  automatic  methods,  such 
as  some  of  the  feature  extraction  systems  being 


developed  in  our  group  [McKeown  88,  Irvin 
89,  Shufelt  &  McKeown  90,  Hsieh,  et.  al.  92].  Issues 
include  how  much  of  the  knowledge  has  to  be  tailored 
according  to  the  source  of  the  segmentation,  as  well  as 
what  effects  are  produced  by  errors  in  the 
segmentation.  Such  feature  extraction  systems  could 
also  be  used  in  the  later  phases  of  SPAM  as  domain 
experts,  assisting  in  the  disambiguation  of  conflicting 
hypotheses. 

Expanding  some  of  our  work  in  automatic  constraint 
generation  for  SPAM,  such  as  was  done  with 
RTFANALYZE  and  FAANALYZE,  is  another  promising 
area  of  future  research.  Applying  clustering  techniques 
may  allow  the  system  to  automatically  filter  out 
constraints  that  result  in  little  or  no  discrimination 
among  competing  hypotheses.  One  could  augment 
such  a  system  with  interactive  capabilities  so  that  it 
could  suggest  rules  to  a  user  and  allow  that  user  to 
reject  or  refine  them. 

In  addition  to  improving  the  quality  of  the  functional- 
areas  produced  by  spam,  we  need  to  intelligently 
evaluate  the  coherence  of  each  functional-area,  as  well 


as  to  compare  them  to  one  another.  One  method  for 
doing  this  is  to  examine  the  support  for  each 
hypothesis  and  compare  it  to  a  "randomly"  supported 
hypothesis.  If  this  or  other  methods  are  found  to  be 
effective,  functional-areas  can  be  ranked,  unpromising 
functional-areas  can  be  excised,  and  model-generation 
can  proceed  based  on  this  ranking. 

Finally,  our  woik  with  the  new  methods  for  model- 
generation  is  still  somewhat  preliminary.  In  addition  to 
more  experimentation,  developing  a  schema-based 
abstraction  of  the  types  of  knowledge  required  for 
effective  model  generation  would  aid  in  the 
development  of  knowledge  acquisition  tools  for  this 
process. 

6.  Conclusions 

Our  ongoing  research  in  the  automated  analysis  of 
complex  aerial  imagery  relies  heavily  on  our  ability  to 
provide  spatial  and  structural  constraints,  encoded  as 
rules,  to  SPAM,  our  knowledge  based  interpretation 
system. 

Large-scale  knowledge-based  vision  systems  require 
specialized  tools  for  both  knowledge  acquisition  and 
result  evaluation.  As  we  have  seen,  this  research  has 
focused  on  the  interactive  acquisition  of  spatial  and 
functional  knowledge  as  weU  as  on  fully  automated 
techniques.  We  have  shown  preliminary  results  based 
on  our  experiments  with  airport  scenes  and  also 
demonstrated  the  importance  of  manually  compiled 
ground  truth  scene  segmentations  to  support  rigorous 
performance  analysis  tools.  The  ability  to  generate 
accurate  evaluations  of  system  performance  guides  our 
research  along  paths  that  are  likely  to  prove  most 
productive. 

Further  research  is  needed  in  basic  computer  vision 
and  image  understanding,  the  architecture  of 
knowledge-based  systems,  and  their  integration  with 
spatial  databases.  We  believe  that  our  research 
addresses  some  of  the  most  important  issues  in  these 
areas  and  that  we  are  progressing  toward  the 
development  of  competent  automated  scene  analysis 
systems. 
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Abstract 

The  task  of  shape  recovery  from  a  motion 
sequence  requires  the  establishment  of  corre¬ 
spondence  between  image  points.  The  two 
processes,  the  matching  process  and  the  shape 
recovery  one,  are  traditionally  viewed  as  inde¬ 
pendent.  Yet,  information  obtained  during  the 
process  of  shape  recovery  can  be  used  to  guide 
the  matching  process.  In  this  paper  we  review 
the  constraints  imposed  on  the  correspondence 
by  rigid  transformations  and  extend  them  to 
objects  that  undergo  general  affine  (non  rigid) 
transformation  (including  stretch  and  shear), 
as  well  as  to  rigid  objects  with  smooth  sur¬ 
faces.  In  all  these  cases  corresponding  points 
lie  along  epipolar  fines,  and  these  fines  can  be 
recovered  from  a  small  set  of  corresponding 
points.  We  discuss  the  potential  use  of  epipo¬ 
lar  fines  in  the  matching  process,  and  present 
an  algorithm  that  recovers  the  correspondence 
&om  three  contour  images.  The  ^gorithm 
was  implemented  and  used  to  construct  object 
models  for  recognition. 

1  Introduction 

Correspondence  is  a  process  of  relating  information  in 
one  image  to  its  equivalent  in  others.  Using  correspon¬ 
dence  a  vision  system  can  infer  the  3-D  structure  of  the 
observed  scene,  an  inference  that  is  significantly  more 
difficult  to  make  from  a  single  2-D  image.  The  estab¬ 
lishment  of  correspondence  is  itself  a  difficult  task.  Var¬ 
ious  methods  were  developed  in  recent  years  to  achieve 
correspondence  for  stereo  vision  [Baker  &  Binford  1981, 
Marr  &  Poggio  1979,  Crimson  1980],  motion  analysis 
[Koenderink  &  Van  D^m  1975,  Ullman  1979,  Longuet- 
Higgins  1981,  Hildreth  1984],  and  object  recognition 
[Fischler  tc  Bolles  1981,  Crimson  k  Losano  Peres  1984, 
Lowe  1987,  Huttenlocher  k  Ullman  1987]. 

In  motion  analysis  the  stage  of  establishing  correspon¬ 
dence  is  usually  viewed  as  independent  of  the  stage  of 
shape  recovery  [UUman  1978].  According  to  this  view, 
the  correspondence  is  determined  so  as  to  minimise  the 
observed  2-D  motion  along  the  image  sequence.  No  as¬ 
sumptions  are  made  at  this  stage  with  respect  to  the 
shape  of  the  moving  objects  or  to  the  transformations 


they  undergo.  In  this  way  correspondence  can  be  found 
even  for  non  rigid  objects  and  when  the  images  contain 
a  number  of  objects  moving  differently. 

The  distinction  between  the  two  processes  of  corre¬ 
spondence  and  shape  recovery  is  useful  when  the  motion 
between  the  frames  is  relatively  small,  in  which  case 
a  minimisation  process  can  resolve  the  correspondence 
correctly.  When,  however,  “long  range  motion”  is  con¬ 
sidered,  minimisation  techniques  often  fail  to  find  the 
correct  correspondence.  Information  about  the  transfor¬ 
mation  may  be  used  in  these  cases  to  guide  the  process 
of  establishing  correspondence. 

An  important  application  that  requires  correspon¬ 
dence  under  “long  range  motion”  conditions  is  the  con¬ 
struction  of  3-D  representations  for  object  recognition. 
In  this  process  shape  information  is  accumulated  over 
time  until  a  complete  model  is  constructed  for  the  ob¬ 
ject.  During  this  period  the  object  may  be  observed  in 
positions  that  significantly  differ  from  one  another.  Yet, 
it  is  desired  from  this  process  to  tolerate  such  differences. 

The  process  of  model  construction  is  not  only  useful 
for  building  object-centered  modek,  but  also  for  con¬ 
structing  viewer-centered  ones.  A  recognition  scheme 
that  uses  viewer-centered  representations  was  recently 
developed  [Ullman  k  Basri  1991].  In  this  method  an  oth 
ject  is  represented  by  a  small  number  of  its  2-D  images 
together  with  the  correspondence  between  the  images. 
The  appearance  of  an  object  &om  different  viewpoints 
is  predicted  by  the  linear  combinations  of  its  model  im¬ 
ages.  These  predictions  are  exact  for  rigid  objects.  In 
both  object-centered  as  well  as  viewer-centered  cases  the 
representations  obtained  are  often  more  stable  as  the  im¬ 
ages  used  to  construct  the  models  are  taken  from  viewing 
angles  that  are  relatively  distant  &om  one  another. 

One  assumption  that  is  generally  used  in  different  vi¬ 
sion  applications  such  as  motion  and  object  recognition 
is  that  the  objects  observed  are  rigid.  Huang  k  Lee 
[1990]  have  recently  addressed  the  question  of  how  rigid¬ 
ity  affects  the  solution  to  the  correspondence  problem. 
They  showed  that  under  an  orthographic  projection  the 
correspondence  to  points  cannot  be  determine*''  but  up 
to  straight  fines  (known  as  “the  epijtolar  lines  of  the 
points”),  and  that  four  corresponding  points  determine 
the  position  of  these  fines.  They  did  not  specify  any 
method  to  resolve  the  correspondence  within  the  lines. 

The  epipolar  line  idea  is  not  new.  It  is  extensively  used 
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in  stereopsis,  but  rarely  used  in  establishing  correspon¬ 
dence  in  motion  analysis.  Bolles  &:  Baker  [1985]  used 
epipolar  lines  to  analyze  motion  sequences  obtained  by  a 
translation  along  a  straight  line.  Yachida  [1986]  and  Ay- 
ache  ic  Lustman  [1987]  used  it  in  developing  their  trinoc- 
ular  stereovision  algorithm.  In  this  paper  we  examine  the 
use  of  the  epipolar  lines  in  establishing  correspondence 
for  depth  reconstruction.  In  the  first  part  of  this  paper 
(Section  2)  we  review  the  theory  behind  the  epipolar 
lines  and  the  way  to  compute  them  from  a  small  num¬ 
ber  of  corresponding  points.  The  formulation  we  use  is 
somewhat  different  from  that  presented  by  Huang  &  Lee 
[1990],  and  we  analyze  the  similarity  and  the  differences 
between  orthographic  and  perspective  projection  mod¬ 
els.  We  show  that  epipolar  lines  exist  even  in  more  com¬ 
plicated  situations,  such  as,  when  an  object  undergoes 
a  general  linear  transformation  (including  stretch  and 
shear),  and  when  objects  with  smooth  surfaces  are  con¬ 
sidered.  In  the  second  part  of  this  paper  (Section  3)  we 
show  that  the  correspondence  is  not  determined  uitiquely 
even  when  three  or  more  images  are  given.  Additional 
images  can,  however,  be  used  to  heuristically  resolve 
the  correspondence  [Yachida  1986].  Together  with  addi¬ 
tional  constraints  such  a  heuristic  can,  in  practice,  solve 
the  correspondence  problem.  We  have  implemented  this 
algorithm  and  the  results  were  used  in  constructing  ob¬ 
ject  models  for  recognition. 

2  Correspondence  from  Two  Images 

The  correspondence  problem  discussed  below  is  defined 
as  follows.  Given  a  pair  of  2-D  images,  for  every  point  in 
space  that  is  projected  to  both  images  find  its  location 
in  the  two  images.  Often  only  feature  points  (such  as, 
contour  points)  are  considered.  We  examine  this  prob¬ 
lem  assuming  the  images  differ  by  a  rigid  transforma¬ 
tion.  We  consider  two  projection  modek,  orthographic 
projection  (with  a  uniform  scale  factor  to  compensate 
for  depth  changes)  and  perspective  projection.  We  begin 
our  discussion  by  introducing  general  properties  for  both 
projection  models,  and  later  prove  these  properties  for 
each  of  the  modek  separately.  Finally,  we  extend  these 
properties  to  more  complicated  cases,  such  as,  objects 
that  undergo  general  affine  transformations  (rather  than 
rigid  ones)  and  objects  with  smooth  surface  boundaries. 

Our  analysis  conskts  of  three  steps: 

1.  We  show  that  rigidity  divides  the  images  into  sets  of 
epipolar  lines,  their  correspondence  is  determined 
by  the  transformation  that  separates  the  images, 
but  the  correspondences  of  points  along  the  lines 
cannot  be  determined. 

2.  The  epipolar  lines  can  be  recovered  from  a  small  set 
of  corresponding  points,  four  in  the  orthographic 
case  and  seven  in  the  perspective  case. 

3.  These  results  apply  ako  to  objects  that  undergo 
general  affine  transformation  and  to  objects  with 
smooth  surfaces. 

Proposition  1  establishes  that  in  a  pair  of  images  re¬ 
lated  by  a  rigid  transformation  a  point  in  one  image  can 
potentially  match  in  the  second  image  any  point  that  lies 


along  a  straight  line  (which  is  referred  to  as  "the  tpipolar 
line  of  that  point”). 

Let  Pi  and  Pj  be  two  projections  (either  orthographic 
or  perspective)  of,  a  rigid  object  from  two  given  view¬ 
points.  Let  (x,y)  be  the  projection  of  some  object  point 
in  Pi. 

Proposition  1:  The  corresponding  point  to  (*,  y)  in 

P2  lies  along  a  straight  line  given  by: 

(*',2/')  =  u  -h  a(s)v 

where  u,  v  €  72*  are  constants  (namely,  independent  of 
z),  and  a  is  a  scalar  function  of  z. 

Following  Proposition  1,  given  the  transformation 
that  relates  the  two  images,  the  correspondence  is  de¬ 
termined  up  to  a  straight  line.  The  vectors  u  and  v  are 
determined  both  by  the  transformation  and  by  the  2-D 
position  of  the  point  {x,y),  while  a  is  the  only  compo¬ 
nent  that  depends  on  z,  the  depth  value  of  the  point 
in  3-D.  There  k  a  one-to-one  mapping  between  the  po¬ 
sition  of  p  along  the  epipolar  line  in  P2  and  its  depth 
value.  Every  different  depth  value  corresponds  to  a  dif¬ 
ferent  location  of  p  along  the  epipolar  line,  and  every 
different  location  along  the  epipolar  line  determines  a 
different  depth  value. 

In  some  cases  point  coirespondeucc  can  be  utbquely 
determined,  fhk  occurs  in  the  degenerated  case  when 
V  =  0.  In  thk  case  the  position  (z',}/)  does  not  depend 
on  the  depth  value  of  the  point.  Under  orthographic  pro¬ 
jection  this  occurs  when  the  object  is  rotated  around  the 
line  of  sight  and  then  translated  arbitrarily.  Under  per¬ 
spective  projection  v  vanishes  when  the  object  k  rotated 
around  the  camera. 

The  epipolar  lines  are  parallel  in  the  orthographic  . 
case,  since  in  this  case  v  depends  solely  on  the  trans¬ 
formation  and  is  therefore  common  to  all  object  points. 
This  k  not  always  true  in  the  perspective  case.  In  this 
case  the  epipolar  lines  are  parallel  only  if  the  transforma¬ 
tion  includes  no  translation  in  depth.  If,  however,  tz  ^  0 
the  epipolar  lines  coincide  at  a  single  point  known  as  the 
/ocus  of  expansion. 

A  rigid  transformation  divides  the  image  into  epipolar 
lines  within  which  correspondence  cannot  be  determined. 
Every  epipolar  line  in  one  image  has  its  corresponding 
epipolar  line  in  the  second  image,  in  the  sense  that,  all 
the  points  that  lie  along  some  epipolar  line  in  the  first 
image  share  the  same  epipolar  line  in  the  second  image 
and  vice  versa.  Thk  is  established  in  Proposition  2. 
Proposition  2:  Let  Pi,P2  €  be  two  points  that 

lie  along  some  common  epipolar  line.  The  epipolar  line 
of  Pi  and  the  epipolar  line  of  P2  in  Pi  coincide. 

Since  rigidity  alone  does  not  determine  the  correspon¬ 
dence  but  up  to  epipolar  lines,  it  may  in  some  cases  be 
sufficient  to  recover  the  epipolar  lines  rather  than  all 
the  parameters  of  the  transformation.  Interestingly,  un¬ 
der  orthographic  projection  the  epipolar  lines  are  deter¬ 
mined  from  two  images  while  the  transformation  is  not. 
Four  non  coplanar  points  are  required  for  thk  task.  The 
transformation  breaks  up  into  its  planar  parts  and  non 
planar  parts.  The  planar  parts  of  the  transformation  are 
determined  by  the  epipolar  lines,  while  the  non  planar 
parts,  the  rotation  in  depth,  cannot  be  recovered.  In  the 
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perspective  case  both  ‘he  epipolat  lines  and  the  trans¬ 
formation  are  determined  from  two  images.  In  this  case 
seven  points  are  required. 

The  results  above  apply  in  two  additional  cases  that 
extend  beyond  the  set  of  rigid  transformations.  Epipolar 
lines  exist  when  the  objects  considered  undergo  general 
3-D  afiine  transformation,  which  includes  stretch  and 
shear.  The  same  applies  under  orthographic  projection 
to  objects  with  smooth  bounding  surfaces.  In  this  case 
the  contours  change  their  position  on  the  object  with  the 
viewpoint.  (See  a  discussion  in  [Basri  &  Ullman  1988].) 
This  motion  is  projected  along  epipolar  lines  (See  sec¬ 
tion  2.3  below).  In  both  cases,  corresponding  points  lie 
along  epipolar  lines,  and  these  epipolar  lines  can  be  re¬ 
covered  from  a  small  set  of  corresponding  points. 

2.1  Orthographic  Projection 

In  this  section  we  repeat  the  results  presented  in  the  be¬ 
ginning  of  Section  2  and  prove  them  for  the  orthographic 
case.  Let  Pi  and  P2  be  two  images  of  a  rigid  object  from 
two  arbitrary  viewpoints.  Let  p  =  (z,y,  z)  be  an  object 
point,  its  position  in  Pi  is  given  by  (2,2/),  and  its  posi¬ 
tion  in  Pj  is  given  by  (s',  j/)  which  is  the  orthographic 
projection  of  $Rp  +  t,  where  s  is  a  scale  factor,  A  is  a 
3x3  rotation  matrix,  and  t  is  a  translation  vector. 

In  the  following  analysis  we  assume  that  the  trans¬ 
formation  between  the  images  (namely,  s,  R,  and  t)  is 
known.  We  select  a  point  (z,  y)  in  the  first  image  and 
compute  its  possible  positions  in  the  second  image.  We 
show  that  the  set  of  these  positions  forms  a  straight  line, 
and  that  the  exact  position  along  this  line  is  determined 
by  its  depth  value. 

Proposition  la:  Given  a  rigid  transformation  de¬ 

fined  by  {s,  R,  t}  and  a  point  (z,  y)  €  Pi,  its  correspond¬ 
ing  point  in  P2  lies  along  the  epipolar  line  given  by: 

(z',y')  =  u  +  zv 

where  u,  v  €  are  constants. 

Proof:  Since 

V  y*  /  V  *(’■21*  +  ’•22^  +  »‘232:)  +  tg  ) 

We  define 

sriix  +  sri2y  +  U  A 

sr2iz  +  tr22y  +  ty  ) 

Notice  that  since  the  transformation  is  given,  u  and 
V  are  determined,  and  the  corresponding  point  for  (z,  y) 
lies  along  the  straight  line  u-f  zv.  When  the  depth  value, 
z,  of  the  point  is  given,  the  location  of  the  corresponding 
point  along  the  line  is  determined,  and  vice  versa,  select¬ 
ing  a  corresponding  point  along  the  line  determines  its 
depth  value.  When  v  =  0  the  epipolar  line  vanishes 
into  a  point.  In  this  case  the  images  are  separated  by 
a  rotation  about  the  line  of  sight  (plus  some  arbitrary 
translation).  For  symmetry  reasons  we  obtain  the  same 
results  for  points  in  the  second  image,  namely,  that  their 
corresponding  points  in  Pi  lie  along  straight  lines. 


The  epipolar  lines  in  each  of  the  images  are  paral¬ 
lel.  This  follows  the  fact  that  v  depends  solely  on  the 
transformation,  and  therefore  has  a  common  value  for 
all  image  points.  All  the  points  in  Pi  that  lie  along  a 
single  epipolar  line  share  the  same  epipolat  line  in  Pj. 
This  is  established  in  the  following  Proposition. 
Proposition  2a:  Let  Pi,  Pz  6  Pi  be  two  points  that 

lie  along  some  common  epipolar  line.  The  epipolar  line 
of  Pi  and  the  epipolat  line  of  Pz  in  Pz  coincide. 

Proof:  All  the  epipolat  lines  ate  parallel.  According 

to  the  definition  of  an  epipolar  line,  since  pi  and  pz  lie 
along  a  single  epipolar  line,  both  are  possible  matches  of 
a  single  point,  q,  in  Pz.  Therefore,  the  epipolar  lines  of  pi 
and  pz  intersect  in  q,  and  since  epipolar  lines  ate  parallel 
they  must  coincide.  Consequently,  rigidity  determines 
the  correspondence  between  epipolar  lines,  but  does  not 
resolve  the  correspondence  within  these  lines. 

When  only  two  images  are  given  the  transformation 
cannot  be  fully  recovered.  The  epipolar  lines,  how¬ 
ever,  can  be  recovered  using  a  correspondence  set  of 
four  non  coplanar  points.  A  linear  equation  from  which 
the  epipolar  lines  can  be  computed  is  given  below.  We 
shall  use  the  following  notation.  Let  (zi,yj)  G  Pi  and 
€  Pz  be  a  pair  of  corresponding  points,  namely, 
they  are  the  projections  of  a  common  point  in  3-B 
space,  Pi  =  (zi,  ys,  Zj).  We  shall  have  n  such  correspon¬ 
dences.  (To  solve  this  equation  n  must  be  >  4.)  Denote 

X  =  (zi,...,z„),  y  =  (yi,...,yi,),  »  =  (21 . x!  = 

(*i.-.*n).  y'  =  (yi.-.MyJ,),  and  l  =  (!,...,!)  e  TV'. 
According  to  [Ullman  &  Basri  1991],  x,  y,  s,  x',  y',  and 
1  are  all  embedded  in  a  4-D  linear  space.  This  follows 
the  identities  below 

x'  =  sriix -1- srizy -f  sri3* -f- 1,1 
y'  =  srzix -t- srzzy srz3* -f  1,1 

Consequently,  {x,  y,x,l}  span  a  4-D  linear  space  to 
which  x'  and  y'  also  belong.  Therefore,  there  exist 
nonzero  scalars  ai,  az,  bi,  hz,  and  c  such  that: 

aix  -I-  azy  hix'  -I-  hzy*  +  cl  =  0 

These  coefficients  are  determined  (up  to  a  scale  factor) 
by  four  non  coplanar  points.  The  epipolar  line  are  im¬ 
mediately  derived  from  this  equation.  (This  result  is 
proved  somewhat  differently  in  [Huang  &  Lee  1989,  Lee 
&  Huang  1990].) 

The  epipolar  lines  break  the  transformation  that  re¬ 
lates  the  images  into  its  planar  components  and  its  non 
planar  ones.  The  planar  components  can  be  recovered 
from  the  epipolar  lines,  while  the  non  planar  ones  can¬ 
not  be  determined  from  two  images.  The  translation 
component  perpendicular  to  the  epipolar  line  is  given  by 
c.  (The  translation  components  can  be  discarded  alto¬ 
gether  if  we  consider  differences  between  points  rather 
than  the  points  themselves.)  The  values  of  the  other 
coefficients  are  given  below. 

Oj  =  srzz  bi  =  rz3 

oz  =  — ^’•si  1>2  =  —’■13 

The  scale  factor  is  therefore  given  by  the  ratio 

_  /g?  +  gj 
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The  relative  angle  between  the  epipolar  lines  determines 
the  planar  parts  of  the  rotation,  as  explained  below.  A 
3-D  rotation  can  be  decomposed  into  a  sequence  of  three 
successive  rotations.  A  rotation  about  the  Z-axis  by  an 
angle  a,  a  second  rotation  about  the  Y -axis  by  an  angle 
0,  and  a  third  rotation  about  the  Z-axis  by  an  angle  7. 
Under  this  decomposition  the  following  identities  hold 


r32  =  sin  a  sin rjs  = 
r^i  =  —COSO  sin/?  rjs  = 


We  therefore  obtmn  that 


o 

7 


tan 


-1  at 


at 


-  tan  ^ - 


62 


sin /?  sin  7 
sin /?  cos  7 


While  0  cannot  be  determined. 

We  can  visualize  this  decomposition  in  the  following 
way.  After  compensating  for  the  translation  and  scale 
changes,  we  first  rotate  the  image  Pi  by  a.  Conse¬ 
quently,  the  epipolar  lines  point  in  Pi  to  a  horizontal 
direction.  We  then  rotate  the  second  image,  Pj,  by  —7. 
As  a  result,  the  epipolar  lines  in  Pj  also  point  horizon¬ 
tally.  The  images  obtained  are  related  by  a  rotation 
about  the  vertical  axis,  which  is  a  rotation  in  depth. 
Following  such  a  rotation  the  points  move  horizontally, 
which  is,  along  the  (rotated)  epipolar  lines.  This  motion 
cannot  be  recovered  since  it  depends  both  on  the  angle 
of  rotation,  0,  and  on  the  depth  of  the  points. 


2.2  Perspective  Projection 

In  this  section  we  repeat  the  results  presented  in  the  be¬ 
ginning  of  Section  2  and  prove  them  for  the  perspective 
case.  We  use  the  following  notation.  An  object  point 
p  is  denoted  by  z(z,y,  1).  It  is  projected  in  Pi  to  the 
position  (r,y)  and  in  P2  to  (There  the  actual 

3-D  position  of  the  point  is  denoted  by  z'(z',  j/,  1).) 
Proposition  lb:  Given  a  rigid  transformation  that 

includes  a  rotation  R  and  a  translation  t,  and  given  a 
point  (z,  y)  E  Pi,  its  corresponding  point  in  P2  lies  along 
the  epipolar  line  9ven  by 

=  u  -1-  a(z)v 

where  u,  v  G  71’  are  constants,  and  a(z)  is  scalar. 
Proof:  Denote 


(s)="(?) 

Note  that  Xf,  yr,  and  z,  are  all  independent  of  z.  Since 


We  obtain  that 


o-iO-K;;;:!) 


And  so  we  define 


u  = 


V 


Q(r) 


Parallel  epipolar  lines  are  obtained  when  t:  =  0.  In 
this  case  v  is  independent  of  the  position  of  the  point 
and  depends  solely  on  the  transformation.  If.  however. 
it  ^  0  the  epipolar  lines  intersect  in  one  point,  called 
the  focus  of  expansion.  This  point  stands  for  z  =  0,  and 
its  location  in  P2  is  given  by 


The  location  of  the  focus  of  expansion  in  Pi  corresponds 
to  the  case  when  v  =  0.  This  condition  implies  the 
following  linear  equation  system 


itXf  —  t*z^ 
itVr  =  t^Zr 

From  which  this  location  can  be  retrieved.  (Recall  that 
Xr,  Vt,  and  z,  are  linear  functions  of  z  and  y.) 

Similar  to  the  orthographic  case,  points  that  lie  on 
a  common  epipolar  line  in  one  image  share  the  same 
epipolar  line  in  the  other. 

Proposition  2b:  Let  Pi ,  P2  €  Pi  be  two  points  that 

lie  along  some  common  epipolar  line,  assume  both  px 
and  p2  are  not  the  focus  of  expansion.  The  epipolar  line 
of  Pi  and  the  epipolar  line  of  P2  in  P2  coincide. 

Proof:  If  =  0  the  epipolar  lines  are  parallel  and 

the  proof  is  identical  to  that  of  the  orthographic  case.  If 
9^  0  the  epipolar  lines  in  each  image  intersect  in  the 
focus  of  expansion.  Since  the  points  lie  along  a  common 
epipolar  line  in  Pi  there  exists  a  point  q  in  Pj  that  is 
a  possible  match  to  both  points,  q  is  not  the  focus  of 
expansion.  Therefore,  the  epipolar  line  of  pi  and  that 
of  P2  intersect  in  q,  and  since  both  lines  also  intersect  in 
the  focus  of  expansion  they  must  coincide. 

In  the  perspective  case  the  transformation  can  be  de¬ 
termined  (up  to  a  scale  factor)  by  a  correspondence  set 
of  seven  points  [Longuet-Higgins  1981,  Tsai  k  Huang 
1984].  For  the  sake  of  completeness  we  review  in  Ap¬ 
pendix  A  one  method  to  recover  the  transformation  from 
eight  corresponding  points  using  essentially  linear  oper¬ 
ations.  This  method  appeared  in  Tsai  k  Huang  [1984]. 

It  is  worth  noting  that  although  in  the  perspective 
case  the  transformation  can  be  recovered  from  two  im¬ 
ages  the  computation  may  in  many  cases  be  unstable. 
This  happens  when  the  object  is  relatively  distant  from 
the  camera,  in  which  case  depth  differences  are  rela¬ 
tively  small  and  perspective  distortions  are  negligible, 
and  when  the  depth  translation  component  vanishes,  in 
which  case  the  epipolar  lines  are  parallel.  These  cases 
are  essentially  similar  to  the  orthographic  case.  In  both 
cases  the  transformation  obtained  is  unstable,  but  the 
epipolar  lines  are  still  stable. 


2.3  Extensions 

In  the  previous  discussion  we  showed  that  rigidity  deter¬ 
mines  the  correspondence  up  to  epipolar  lines  and  that 
the  position  of  points  along  these  lines  is  determined  by 
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theii  depth  values.  We  also  showed  that  the  epipolat 
lines  can  be  recovered  from  a  small  set  of  corresponding 
points.  In  this  section  we  extend  these  results  to  two 
additional  cases,  images  of  objects  that  undergo  general 
affine  transformation  and  contour  images  of  rigid  objects 
with  smooth  bounding  surfaces. 

An  affine  transformation  in  3-D  space  is  composed  of  a 
general  linear  transformation  followed  by  a  translation. 
The  set  of  alfine  transformations  contains  in  addition 
to  all  the  rigid  transformations  also  stretch  and  shear. 
Similar  to  the  rigid  case,  in  a  pair  of  images  of  an  object 
that  undergoes  an  afline  transformation,  corresponding 
points  lie  along  epipolar  lines.  This  is  true  both  under 
orthographic  as  well  as  perspective  projections.  It  fol¬ 
lows  the  fact  that  in  proving  the  results  above  we  never 
used  the  special  properties  of  the  rotation  matrix. 

When  a  pair  of  images  is  given,  whether  the  objects 
in  these  images  are  moving  rigidly  or  whether  they  un¬ 
dergo  an  affine  (non  rigid)  transformations  is  indistin¬ 
guishable.  Basri  ic  UUman  [1991]  showed  that  under 
orthographic  projection  the  set  of  images  of  a  rigid  ob¬ 
ject  is  contained  in  a  4-D  linear  space,  and  additional 
(quadratic)  constraints  distinguish  between  these  images 
and  other  vectors  in  this  space.  These  other  vectors  are, 
in  fact,  images  obtmned  by  applying  a  general  3-D  affine 
transformation  to  the  object.  The  quadratic  constraints 
cannot  be  recovered  from  two  images.  Hence,  it  is  im¬ 
possible  to  distinguish  between  the  two  cases  when  only 
two  images  are  given.  A  similar  ambiguity  holds  under 
perspective  projection.  It  is  worth  noting  that  general 
afRne  transformations  approximate  the  way  moving  ob¬ 
jects  are  observed  in  movies  from  different  viewpoints. 
This  effect  is  known  since  1859  as  the  La  Gournerie  Para¬ 
dox  and  was  discussed  recently  by  Jacobs  [1991]. 

A  second  interesting  case  is  that  of  rigid  objects  with 
smooth  surfaces.  The  bounding  contours  of  such  an  ob¬ 
ject  are  generated  by  surface  patches  that  are  tangent 
to  the  line  of  sight.  These  patches  are  usually  referred 
to  as  the  rim  [Koenderink  &  Van  Doom  1979]  or  the 
contour  generator  [Marr  1977]  of  the  object.  Since  the 
surface  of  the  object  is  smooth,  when  the  object  rotates 
in  depth  a  new  set  of  surface  patches  that  are  now  tan¬ 
gent  to  the  new  line  of  sight  replaces  the  original  rim, 
generating  a  new  set  of  bounding  contours.  Establishing 
correspondence  between  the  original  and  the  new  bound¬ 
ing  contours  of  the  object  is  therefore  problematic,  since 
the  contours  undergo  in  addition  to  the  rigid  transfor¬ 
mation  also  some  arbitrary  motion  that  depends  on  the 
exact  shape  of  the  object. 

Tracing  the  positions  of  these  contours  is  useful  for  any 
shape  reconstruction  and  object  recognition  scheme  that 
is  based  on  contour  matching.  A  method  to  predict  the 
appearance  of  objects  with  smooth  surfaces  for  recogni¬ 
tion  was  recently  developed  [Basri  &  UUman  1988].  The 
method  assumes  an  orthographic  projection  and  uses  the 
3-D  curvature  of  points  along  the  contours  to  foUow  their 
change  in  position  with  viewpoint.  The  curvature  val¬ 
ues  were  computed  from  a  few  images  of  the  object  by 
matching  the  contours  in  these  images. 

The  next  observation  demonstrates  that  in  the  case 
of  objects  with  smooth  surfaces  under  orthographic  pro¬ 


jection  corresponding  points  lie  along  epipolat  lines.  We 
first  look  at  the  simpler  case  of  an  object  that  rotates 
about  the  vertical  axis.  Let  p  be  a  rim  point,  and  let 
us  take  a  horizontal  slice  of  the  object  that  contains 
p.  (Namely,  if  p  =  -o)  we  consider  the  plane 

y  =  J/0-)  The  intersection  of  the  surface  of  the  object 
with  this  plane  forms  a  space  curve,  C.  When  the  ob¬ 
ject  rotates,  the  rim  point  p  changes  its  position  on  the 
object  along  C.  Denote  the  new  rim  point  by  p'.  Since 
this  is  a  rotation  about  the  Y -eixis,  the  epipolar  lines  in 
both  images  are  horizontal.  Therefore,  all  the  points  on 
C  including  p  and  p'  are  projected  to  a  common  epipolar 
line  in  both  images. 

We  now  extend  this  observation  to  general  rigid  trans¬ 
formations.  Rotation  is  the  only  component  that  affects 
the  rim.  Translation  and  scaling  do  not  change  the  rim 
and  therefore  can  be  disregarded.  A  3-D  rotation  can 
be  decomposed  into  three  successive  rotations,  around 
the  Z-.,  Y-,  and  Z-axes.  (The  same  decomposition  used 
in  Section  2.1.)  As  we  did  in  Section  2.1,  we  apply  the 
first  rotation  to  the  first  image,  and  (the  inverse  of)  the 
last  rotation  to  the  second  image.  Both  rotations  are 
image  rotations,  and  they  do  not  change  the  rim.  Af¬ 
ter  applying  these  rotations  we  obtain  that  the  two  im¬ 
ages  are  related  by  a  rotation  about  the  vertical  axis, 
and  hence  their  epipolar  lines  are  horizontal.  (See  Sec¬ 
tion  2.1.)  Therefore,  the  observed  position  of  the  rim 
points  change  along  epipolar  lines. 

Figure  1  shows  the  epipolar  Unes  in  two  orthographic 
projections  of  a  VW  car.  Notice  that  the  matching  be¬ 
tween  silhouette  points  along  epipolar  lines  is  good  al¬ 
though  they  are  generated  by  smooth  surfaces. 

3  Resolving  Point  Correspondence 

In  the  previous  section  we  have  shown  that  rigidity 
alone  is  insufficient  to  solve  the  correspondence  prob¬ 
lem  uniquely  from  two  images.  It  divides  the  images 
into  epipolar  lines,  their  matching  is  determined  by  the 
transformation  that  separates  the  images,  but  the  corre¬ 
spondence  of  points  within  the  lines  cannot  be  resolved. 
In  this  section  we  examine  the  problem  of  establish¬ 
ing  correspondence  in  three  or  more  images.  We  show 
that,  as  in  the  case  of  two  images,  the  correspondence 
is  not  determined  uniquely.  Additional  images,  how¬ 
ever,  provide  constraints  that  can  be  used  to  solve  the 
problem  heuristically  (e.g.,  the  trinocular  stereovision  al¬ 
gorithm  [Yachida  1986]).  We  discuss  several  additional 
constraints  that  can  be  used  together  with  epipolar  lines 
to  find  the  correspondence  between  images.  These  meth¬ 
ods  were  implemented  and  the  results  are  presented  be¬ 
low. 

It  should  be  noted  that  the  use  of  epipolar  lines  to  de¬ 
termine  correspondence  is  limited  to  those  regions  in  the 
images  that  are  consistent  with  a  rigid  (or  afiine)  trans¬ 
formation.  When  the  images  contain  a  number  of  rigid 
objects  moving  independently  each  of  the  objects  may 
determine  a  different  set  of  epipolar  lines.  A  segmenta¬ 
tion  process  must  be  applied  to  separate  these  objects 
and  divide  the  images  into  regions  with  consistent  sets 
of  epipolar  lines.  We  shall  not  address  the  segmentation 
problem  in  this  paper. 
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Figure  1:  Epipolar  lines  in  two  orthographic  projections  of  a  VW  car.  Note  the  fact  that  corresponding  points  lie  along 
epipolar  lines.  The  silhouette  contours  deserve  specitd  attention  for  being  generated  from  smooth  surfaces. 


3.1  Correspondence  from  Three  Images 

We  have  so  far  explored  the  establishment  of  correspon¬ 
dence  from  two  images.  We  showed  that  the  correspon¬ 
dence  between  points  in  the  images  cannot  be  uniquely 
resolved  even  if  the  transformation  is  known.  We  now 
address  the  following  question.  Can  the  correspondence 
be  resolved  when  three  or  mote  images  ate  considered? 
Structure  from  motion  theory  demonstrates  that  the  an¬ 
swer  to  this  question  is  not  trivial.  When  correspondence 
is  given,  under  orthographic  projection  two  images  ate 
not  sufficient  to  recover  the  transformation,  but  three  ate 
[Ullman  1979,  Huang  &  Lee  1989].  The  correspondence 
problem  is  nevertheless  different  from  the  structure  from 
motion  problem.  Point  correspondence  cannot  be  re¬ 
solved  by  using  any  number  of  additional  images.  Yet, 
additional  images  provide  information  that  can  be  used 
to  filter  out  less  likely  solutions. 

Proposition  3  establishes  that  point  correspondence 
cannot  be  resolved  from  any  number  of  images.  Let 
Pi,P2i—tPk  be  k  images.  Let  (*i,3fc),  1  <  »  <  k  be 
the  locations  of  a  point  p  =  (x,y,z)  in  P,-.  (Assume 
w.l.g.  that  xi  =  X  and  yi  =  y.)  Let  Ti,  2  <  i  <  k 
be  the  rigid  transformation  applied  to  p  in  Pj,  assuming 
orthographic  projection. 

Proposition  3:  Given  Tj, ...,  Tt,  the  set  of  possible 

locations  of  p  in  Pj, ...,  P»  is  a  straight  line  in 
given  by: 

{*2tP2i  •••,*»,!»)  =  u  -f  zv 

where  u,v  €  are  constants. 

Proof:  This  is  obtained  simply  by  defining  u  = 

(u2,...,un)  and  v  =  (vj,...,v»),  where  Ui,Vi  €  71*  are 
the  corresponding  vectors  u  and  v  &om  Proposition  la. 

This  proposition  implies  that  the  number  of  possible 
correspondences  for  each  point  is  infinite.  Every  possible 
assignment  of  z  yields  to  a  different  location  of  the  points 
in  all  of  the  images.  An  equivalent  claim  can  be  made 
in  case  of  perspective  projection. 

There  is,  however,  one  additional  consequence  to  this 
proposition.  Determining  the  correspondence  between 
two  of  the  images  immediately  implies  the  correspon¬ 


dence  in  all  other  images.  This  property  suggests  a 
hypothesis-verification  heuristic  to  recover  correspon¬ 
dence.  The  algorithm  first  selects  a  point  in  the  first 
image,  hypothesizes  its  correspondence  in  the  second  im¬ 
age,  computes  accordingly  its  position  in  the  third,  and 
then  verifies  its  appearance  in  the  predicted  location. 
This  algorithm  is  used  in  Ikinocular  stereopsis  [Yachida 
1986).  The  algorithm  can  be  defined  in  two  versions. 
The  first  requires  the  transformation  between  the  im¬ 
ages.  It  predicts  the  position  of  points  in  the  third  image 
by  explicitly  computing  their  depth  values.  The  second 
requires  the  epipolar  lines  between  all  pairs  of  images. 
It  predicts  the  position  of  points  in  the  third  image  by 
intersecting  epipolar  lines. 

Version  1. 

1.  Select  a  point  p  =  (z,y)  €  Pi  and  find  its  epipolar 
lines  A  in  Pz- 

2.  For  all  candidates  qi,...,qn  along  A  compute  the 
corresponding  depth  value  zi, ...,  z„. 

3.  For  every  possible  depth  value,  zi, ...,  z„,  compute 
the  position  of  the  point  (z,y,  Z{)  in  P3  and  verify 
its  actual  appearance  at  this  location. 

Version  2. 

1.  Select  a  point  p  =  (z,i/)  €  Pi,  and  find  its  epipedar 
lines  A  in  and  B  in  P3. 

2.  For  all  candidates  qi,  ...,qn  along  A  compute  their 
epipolar  lines  Ci, ...,  Cn  in  P3. 

3.  Intersect  each  of  the  lines,  Ci,..., Cn,  with  B  and 
verify  the  actual  appearance  of  p  in  these  locations. 

The  two  versions  of  the  algorithm  are  essentially  sim¬ 
ilar.  The  first  version  uses  the  transformation  between 
the  images  to  compute  depth  values.  The  second  ver¬ 
sion  replaces  this  computation  by  intersecting  epipolar 
lines.  Note  that  the  transformation  can  be  computed 
from  three  images  using  four  non  coplanar  points  [Ull¬ 
man  1979).  The  second  version  can  be  used  only  if  the 
epipolar  lines  Cj  intersect  with  B.  The  meaning  of  this 
requirement  is  for  every  image  its  epipolar  lines  with 
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respect  to  the  other  images  should  all  be  non  parallel. 
This  requirement  is  equivalent  to  requiring  the  trans¬ 
formations  to  be  independent.  Unless  this  condition 
is  met  structure-from-motion  algorithms  cannot  recover 
the  transformation  from  correspondence  [Huang  &  Lee 
1989]. 

One  observation  following  this  algorithm  is  that,  since 
epipolar  lines  are  defined  for  pairs  of  images,  one  can  use 
different  sets  of  anchor  points  to  recover  the  epipolar 
lines  in  each  of  the  pairs.  This  is  different  from  most 
existing  structure  from  motion  algorithms,  which  require 
from  the  set  of  anchor  points  to  be  identical  in  all  three 
images. 

Note  that  the  use  of  three  images  rather  than  two 
is  reasonable  since  three  images  are  required  to  recover 
structure  from  motion  under  orthographic  projection 
[UUman  1979]  and  to  form  a  viewer-centered  representa¬ 
tion  for  a  rigid  object  [UUman  &  Basri  1991]. 

The  algorithm  handles  both  rigid  objects  as  weU  as 
objects  that  undergo  general  3-D  affine  transformations. 
There  is,  however,  some  difference  between  the  two 
cases.  When  four  or  more  images  are  considered  cer¬ 
tain  configurations  of  epipolar  lines  may  be  consistent 
with  some  affine  transformations  but  with  no  rigid  ones. 
This  is  concluded  from  [Basri  k  UUman  1991],  since 
three  images  are  necessary  to  determine  the  functional 
constraints  that  distinguish  rigid  transformations  from 
affine  ones.  These  constraints  then  restrict  the  possi¬ 
ble  configuration  of  the  epipolar  lines  in  larger  sets  of 
images. 

It  should  be  stressed  that  both  versions  do  not  guar¬ 
antee  uniqueness.  From  time  to  time  several  candidates 
may  be  found  consistent  with  aU  three  images.  Further 
pruning  between  these  candidates  is  required.  In  gen¬ 
eral  the  algorithm  gives  better  results  for  sparse  images 
than  for  dense  ones  and  for  images  with  arbitrarUy  dis¬ 
tributed  texture  than  for  images  with  uniform  texture. 
(Density  refers  here  to  the  number  of  points  actuaUy 
considered  by  the  algorithm  relative  to  the  total  area  of 
the  images.)  A  common  way  to  reduce  the  density  of 
an  image  is  to  consider  its  edge  map.  Edge  images  are 
in  general  stiU  too  dense,  and  a  naive  implementation 
of  the  algorithm  would  fail  to  provide  a  unique  solution 
for  many  of  the  points.  To  avoid  this  problem  we  sug¬ 
gest  to  apply  this  matching  procedure  to  edges  rather 
than  to  points,  using  the  assumption  that  continuous 
edges  tend  to  remain  continuous  in  all  images.  Unlike 
Ayache  &  Lustman  [1987],  our  implementation  is  not 
confined  to  straight  line  segments,  but  is  applied  to  ar¬ 
bitrarily  curved  ones.  We  exploit  the  shape  variance  of 
image  contours  to  discriminate  between  correct  and  false 
matches. 

The  modified  algorithm  was  implemented  and  run  on 
natural  images.  An  example  is  given  in  Figures  2-4.  In 
these  figures  correspondence  was  sought  between  three 
edge  images  of  a  VW  car  (Figure  2).  We  first  selected 
a  contour  from  the  first  image.  Then  we  found  all  the 
contours  in  the  second  image  that  could  possibly  match 
the  selected  contour.  For  each  of  the  candidates  we  com¬ 
puted  their  location  in  the  third  image.  We  repeated  this 
process  for  a  number  of  contours.  Figure  3  shows  the 


best  candidates  projected  to  the  third  image.  Figure  4 
shows  some  of  the  other  candidates  projected.  .Vone 
of  these  candidates  match  an  actual  contour  (although 
some  of  their  points  do).  The  results  of  this  algorithm 
were  used  to  create  object  models  for  recognition.  An 
example  for  the  use  of  these  models  can  be  found  in 
[UUman  k  Basri  1991]. 


3.2  Alternative  constraints 

In  this  section  we  briefly  discuss  several  constraints  that, 
combined  with  the  epipolar  Unes,  are  useful  in  establish¬ 
ing  point  correspondence.  The  first  constraint  is  tra¬ 
ditionally  referred  to  as  ordemeas.  Most  objects  are 
opaque.  Contour  segments  (and  points)  on  such  ob¬ 
jects  retain  their  spatial  order  from  different  viewpoints. 
Therefore,  a  contour  segment  B  that  lies  between  two 
segments,  A  and  C,  in  one  image  would  in  general  match 
some  contour  segment  B',  which  lies  between  the  two 
corresponding  segments.  A'  and  C'  respectively.  (Notice 
that  right,  left,  up,  and  down  can  still  change,  as  in  the 
case  of  a  180°  rotation  around  the  line  of  sight.) 

Other  cues  that  may  be  helpful  to  resolve  the  corre¬ 
spondence  are  parallelism  and  symriietry.  If  a  pair  of 
contour  segments  are  paraUel  or  symmetrical  in  one  im¬ 
age  their  corresponding  segments  in  the  second  image 
are  often  parallel  or  symmetrical  respectively.  Resolving 
the  correspondence  for  one  segment  would  therefore  in¬ 
dicate  a  solution  for  the  other  segment.  It  is  worth  men¬ 
tioning,  however,  that  perspective  projection  docs  not 
maintain  parallelism,  and  that  symmetrical  components 
often  appear  skewed  in  the  image  under  both  projections. 
Incorporating  these  cues  into  a  process  of  resolving  the 
correspondence  may  therefore  be  fairly  difficult. 

Epipolar  lines  can  be  used  to  improve  correspondence 
achieved  under  aperture  conditions.  Under  these  terms 
matching  between  contours  b  given  along  a  direction 
perpendicular  to  the  contours  [Marr  k  UUman  1981]. 
Common  techniques  to  correct  the  matching  use  iter¬ 
ative  computation  to  maximise  the  smoothness  of  the 
flow  [Hildreth  1984],  or  use  sequences  of  images  to  find  a 
rigidly  consistent  solution  [UUman  1984].  The  epipolar 
line  technique  offers  an  exact  solution  to  the  aperture 
problem  for  rigid  motion  that  is  both  computationaUy 
simple  and  resolves  the  correspondence  for  as  few  as  two 
images. 

Figure  5  compares  the  matching  obtained  under  aper¬ 
ture  conditions  with  the  matching  obtained  using  epipo¬ 
lar  Unes  for  two  car  silhouettes.  It  should  be  noted  that 
in  general  the  aperture  problem  is  associated  with  short 
range  motion  applications.  In  this  case  the  computa¬ 
tion  of  epipolar  lines  tends  to  be  unstable.  One  way  to 
overcome  thb  problem  is  to  recover  the  epipolar  Unes 
for  a  sequence  of  images,  such  that  the  difference  be¬ 
tween  each  pau  of  consecutive  images  is  smaU,  but  the 
overall  transformation  accumulated  along  the  sequence  is 
large.  Alternatively,  if  two  “distant”  images  are  provided 
the  images  may  first  be  roughly  aUgned  before  aperture 
matching  can  take  place. 
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Figuie  2;  Epipolai  lines  in  thiee  images  of  a  VW  cat.  Every  image  contains  one  set  of  epipolai  lines  against  each  of  the  other 
two  images. 


(a)  (b) 


Figure  3:  Application  of  the  three  images  algorithm  to  four  contour  pieces  selected  from  the  car  in  Figure  2(a).  (The  selected 
contours  include  the  roof  silhouette,  the  front  window,  the  rear  side  window,  and  the  bottom  silhouette.)  (a)  The  best 
prediction  found  by  the  algorithm  for  the  four  contour  pieces,  (b)  This  prediction  overlapped  with  the  actual  (third)  image. 


4  Summary 

The  recovery  of  shape  from  a  motion  sequence  requires  in 
general  establishing  correspondence  between  the  points 
in  the  images.  This  task  is  particularly  difficult  when 
the  images  are  taken  from  viewpoints  that  are  rela¬ 
tively  distant  from  one  another,  conditions  referred  to  as 
“long range  motion”.  Establishing  correspondence  under 
these  conditions  is  important  for  building  both  object- 
centered  as  well  as  viewer-centered  representations  for 
object  recognition.  Such  representations  tend  to  be  more 
stable  as  the  images  from  which  they  are  constructed  are 
separated  by  relatively  large  transformations. 

Information  about  the  shape  of  objects  and  the  trans¬ 
formations  they  undergo  can  be  used  to  guide  the  match¬ 
ing  process.  In  this  paper  we  reviewed  the  constraints 
imposed  on  the  correspondence  by  rigid  transformations 
and  extended  them  to  include  images  of  objects  that  un¬ 
dergo  general  3-D  affine  transformations  as  well  as  rigid 
objects  with  smooth  surfaces.  In  all  these  cases  the  im¬ 
ages  are  divided  into  epipolar  lines,  their  correspondence 
is  determined  by  the  transformation,  but  the  correspon¬ 
dence  of  points  within  the  lines  cannot  be  recovered.  The 


epipolar  lines  can  be  computed  from  a  small  set  of  anchor 
points. 

The  correspondence  is  not  determined  uniquely  even 
wi  u  three  or  more  images  are  considered.  Additional 
im  fes  can  be  used,  however,  in  a  heuristic  algorithm 
to  etermine  point  correspondence.  Such  an  algorithm 
is  the  trinocular  stereovision  algorithm  [Yachida  1986], 
which  is  designed  to  work  with  sparse  images  and  in  the 
absence  of  uniform  texture.  We  extended  this  algorithm 
to  handle  arbitrarily  curved  edge  images  and  applied  it  to 
images  of  natural  objects.  We  discussed  the  use  of  other 
constraints  such  as  orderness,  parallelism,  and  symme¬ 
try  in  solving  the  correspondence  problem.  Finally,  we 
showed  that  epipolar  lines  can  be  used  to  improve  match¬ 
ing  obtained  under  aperture  conditions.  The  techniques 
described  in  this  paper  were  implemented  and  used  to 
construct  viewer-centered  models  for  object  recognition. 
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of  false  candidates,  (d)  This  prediction  overlapped  with  the  actual  image. 
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Abstract 

In  this  paper  we  describe  a  new  technique  to  produce  a 
triangulated  irregular  netwoiic  (TIN)  from  a  digital 
elevation  model  (DEM).  The  overall  goal  is  to 
produce  an  approximate  terrain  description  that 
preserves  the  major  topographic  features  using  a 
greatly  reduced  set  of  points  selected  from  the  original 
DEM.  The  TIN  generation  process  is  iterative;  at  each 
iteration  we  identify  areas  in  the  DEM  that  lie  outside 
of  a  user-supplied  error  tolerance  in  the  TIN.  and 
choose  points  from  the  DEM  to  more  accurately  model 
these  areas.  Point  selection  involves  the  computation 
of  the  difference  between  the  actual  DEM  and  an 
approximate  DEM.  This  approximate  DEM  is 
calculated  by  interpolating  elevation  points  from  the 


The  iterative  nature  of  the  algorithm  permits  users  to 
terminate  the  terrain  approximation  algorithm  based  on 
operational  criteria,  such  as  median  error,  maximal 
error,  or  number  of  points  used  to  construct  the  TIN. 
This  is  particularly  relevant  to  real-time  computer 
image  generation  as  various  scene  rendering  systems 
utilizing  polygonal  terrain  patches  have  well  defined 
limitations  in  order  to  maintain  real-time  performance. 
It  also  implies  that  the  TIN  generation  procedure  is 
automaticsdly  sensitive  to  smooth  or  rough  terrain  in 
that  it  selects  ordy  enough  points  to  satisfy  the  required 
error  constraints  and  tends  to  place  points  in  those 
areas  having  the  greatest  topographic  complexity.* 
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1.  Introduction 

The  most  common  format  for  storing  terrain  data  is  in 
a  regular  rectangular  array  of  points  known  as  a  digital 
elevation  model  (DEM).  DEM’s  can  be  generated 
using  a  number  of  methods,  including  direct  point 
measurement  and  by  interpolation  from  cartographic 
source  material.  In  the  case  of  point  measurement, 
data  can  be  compiled  using  stereo  imagery  where 
elevation  points  are  measured  on  a  regular  grid,  by 
collecting  along  lines  using  a  manual  analytic  stereo 
plotter,  or  using  automatic  stereo  correlation.  DEM’s 
can  also  be  created  by  digitizing  contour  maps  and 
interpolating  elevations  on  a  regular  grid.  The  DEM  is 
an  easily  generated  and  commonly  available  format, 
but  a  regular  array  of  points  is  not  an  efficient 
representation  in  terms  of  information  for  the  number 
of  points  used. 

An  alternative  terrain  representation  format  is  a 
triangulated  irregular  network  (TIN),  consisting  of  a 
set  of  points  freely  placed  in  3  dimensions  and 
connected  in  a  manner,  nearly  always  planar,  so  as  to 
approximate  a  surface  by  triangular  patches.  TIN’s  can 
be  used  to  model  surfaces  whose  elevations  or 
properties  are  not  easily  sampled,  such  as  aquifers  or 
seismic  data.  The  use  of  TIN’s  which  concerns  us, 
however,  is  the  representation  of  a  surface  which  is 
easily  sampled  at  arbiU’ary  locations.  This  is  the  case, 
for  example,  when  a  DEM  is  available  which  is  of 
higher  resolution  than  the  desired  end  product. 

TIN’S  are  a  more  efficient  model  than  DEM’s  for 
processing  information  in  a  variety  of  problems.  They 
provide  an  advantage  in  many  applications  simply  by 
reducing  the  sheer  volume  of  data  which  must  be 
considered.  A  TIN  is  capable  of  greater  efficiency 
because  it  can  adapt  its  resolution  to  the  complexity  of 
the  terrain.  Terrain  display  is  a  task  for  which  TIN’s 
are  well-suited,  especially  in  the  case  of  real  time 
applications.  When  determining  which  parts  of  a  scene 
are  visible,  a  TIN  presents  a  relatively  small  number  of 
non-overlapping  polygons  (all  triangles)  which 
represent  the  terrain,  simplifying  graphics  calculations. 
Shading  and  texture-mapping  these  triangular 
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representations  are  also  straightforward.  TIN’S  are 
also  useful  in  terrain  analysis,  as  they  represent  terrain 
which,  in  many  cases,  has  already  been  partially 
analyzed.  Peaks,  valleys,  and  ridge  and  channel 
networks  are  all  easily  found  as  they  correspond  to 
vertices  and  edges  in  the  TIN.  Given  a  description  of 
the  topography  in  terms  of  these  features  augmented  by 
slope  and  aspect  information,  the  extraction  of 
drainage  patterns  and  water  sheds  is  simplified. 

In  this  paper  we  describe  a  new  technique  to  produce  a 
triangulated  irregular  network  (TIN)  from  a  digital 
elevation  model  (DEM).  The  overall  goal  is  to 
produce  an  approximation  of  the  terrain  description 
that  preserves  the  major  topographic  features  using  a 
small  subset  of  points  selected  from  the  original  DEM. 
The  TIN  generation  process  is  iterative;  at  each 
iteration  we  identify  areas  in  the  DEM  that  lie  outside 
of  a  user-supplied  error  tolerance  in  the  TIN,  and 
attempt  to  choose  points  in  the  DEM  to  more 
accurately  model  these  areas.  Point  selection  involves 
the  computation  of  the  difference  between  the  actual 
DEM  and  an  approximate  DEM.  This  approximate 
DEM  is  calculated  by  interpolating  elevation  points 
from  the  TIN. 

In  Section  2  we  describe  the  overall  stmcture  of  our 
TIN  generation  program  based  upon  iterative 
refinement  and  our  methods  for  point  selection  and 
error  evaluation.  In  Section  3  two  examples  are 
presented  to  illustrate  the  TIN  generation  process  using 
real  DEM’s  in  complex  rugged  terrain.  In  the 
following  section  we  provide  some  background  in  this 
research  area. 

1.1.  Previous  Work 

Interest  in  the  generation  and  utilization  of  irregular 
terrain  representations  dates  back  nearly  twenty  years. 
The  woik  of  Peucker  [Peucker,  et.  al.  76,  Peucker,  et. 
al.  78]  introduced  the  TIN  terminology  and  outlined 
the  basic  TIN  construction  problem  in  terms  of 
sampling  constraints,  as  well  as  manual  and  automated 
techniques  for  triangulation.  They  also  described  the 
importance  of  topological  structure  in  selecting  points 
for  inclusion  in  the  TIN.  Over  the  years  a  variety  of 
techniques  for  manual  and  automatic  point  selection 
have  bwn  proposed  and  implemented. 

Traditionally,  points  have  been  selected  from  stereo 
imagery  by  a  skilled  operator  [Carter  88].  Recently, 
work  has  been  done  on  automatic  TIN  generation  from 
other  terrain  models  already  rendered  in  digital  form. 
For  example,  Chen  and  Guevara  [Chen  and  Guevara 
87]  generate  a  TIN  from  raster  data,  and  Christensen 
[Christensen  87]  generates  a  TIN  from  contours  on  a 
digitized  topographic  map.  In  early  work, 
contemporaneous  with  Peucker,  Mark  [Mark  75] 
reported  that  a  TIN  of  approximately  14  times  fewer 
points  could  be  constmeted  to  achieve  the  same 
accuracy  as  the  original  DEM.  This  result  was 
confirmed  by  Peucker,  and  appears  to  have  been 


generally  accepted. 

Triangulation  can  be  performed  manually,  but  Peucker 
[Peucker,  et.  al.  76]  noted  that  automatic  triangulation 
exhibits  comparable  performance.  Given  automatic 
triangulation,  TIN  generation  techniques  can  be 
broadly  divided  into  two  categories:  those  which  use 
manual  point  selection,  and  those  which  use  automatic 
point  selection.  We  survey  some  of  the  more  recent 
published  results. 

One  of  the  earliest  manual  point  selection  methods  is 
the  one  used  in  the  ADA^  system.  This  system, 
mentioned  in  Peucker  [Peucker,  eL  al.  76],  uses  man- 
machine  interaction  to  aid  the  user  in  producing  a  TIN 
which  accurately  represents  the  original  terrain.  The 
system  goes  through  an  iteration  process  in  which  it 
indicates  inadequate  parts  of  the  TIN,  which  the  human 
operator  then  improves. 

Christensen  [Christensen  87]  uses  TIN’S  to  inteipolate 
between  digitized  contours.  The  input  contour  is 
represented  as  a  polygon,  with  the  next  contour,  if 
there  is  one,  indicating  an  area  or  areas  cut  out  of  the 
middle  of  the  present  contour.  A  medial  axis  of  this 
shape  is  computed,  and  then  a  Delaunay  triangulation 
is  performed  inside  the  shape,  using  the  vertices  of  the 
contour  and  medial  axis  as  &e  points  to  be 
triangulated. 

The  woik  most  closely  related  to  ours  is  that  by  Chen 
and  Guevara  [Chen  and  Guevara  87].  Their  VIP 
procedure  for  selecting  points  directly  from  the  DEM 
is  based  on  the  distance  a  point  is  from  the  4  lines 
connecting  its  neighbors.  This  would  seem  to  be  a 
good  solution  to  the  point  selection  problem;  all  points 
in  the  DEM  are  ordered  in  terms  of  importance,  which 
permits  selection  of  a  point  set  of  any  given  size.  The 
procedure  does  have  several  problems,  however,  all  of 
which  stem  from  the  local  nature  of  the  importance 
criterion.  First,  the  peak  of  a  small,  sharp  hill  will  be 
considered  more  significant  than  the  peak  of  one  which 
is  large,  yet  slopes  gently.  In  the  paper,  this  problem 
appears  to  have  been  circumvented  by  choosing 
enough  points  so  that  ridges  and  valleys  appear  clearly. 
Second,  the  VIP  procedure  chooses  nearly  all  points  on 
ridges  and  valleys.  This  is  desirable  in  the  sense  that 
these  are  important  features  to  capture.  However,  this 
is  clearly  unnecessary  in  those  places  where  the  ridge 
or  valley  follows  a  straight  line,  since  we  can  represent 
straight  lines  with  two  points.  Both  of  these  problems 
arise  because  only  the  8  neighboring  points  of  any 
point  are  considered.  In  fact,  for  good  point  selection, 
we  need  to  consider  topographic  features  on  a  more 
global  basis.  Yet,  the  more  points  we  consider,  the 
more  computationally  intensive  the  procedure 
becomes.  TTiis  suggests  that  a  less  direct  method  for 
point  selection  might  have  better  performance. 

Our  algorithm,  like  the  one  used  in  the  ADAPT 
system,  is  iterative,  choosing  points  in  stages  until  the 
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overall  quality  is  satisfactory.  It  differs  from  the 
ADAPT  system  in  that  it  does  not  require  a  human 
operator  to  select  points:  like  the  VIP  algorithm,  it 
chooses  points  automatically.  But,  unlike  the  VIP 
algorithm,  the  significance  of  each  point  is  ruit 
considered  independently  of  surrounding  points.  Since 
we  have  structured  the  sdgorithm  to  provide  a  previous 
approximation,  we  find  regions  in  which  there  are 
currently  errors  and  choose  a  few  points  to  correct 
each. 

2.  Iterative  TIN  Generation 

There  are  two  main  issues  for  TIN  generation  given  an 
existing  DEM:  point  selection  and  triangulation.  The 
point  selection  problem  can  be  stated  as:  "Given  some 
maximum  number  of  points,  which  should  be  chosen 
so  as  to  best  represent  the  terrain?"  For  triangulation 
the  issue  is:  "How  can  we  best  connect  a  given  set  of 
points  to  represent  the  terrain?"  Clearly  answers  to 
these  two  questions  are  interdependent.  This  section 
describes  our  iterative  point  selection  algorithm  and 
modifications  to  standard  planar  triangulation  in  some 
detail. 

Each  iteration  of  the  algorithm  adds  a  variable  number 
of  corrective  points  (selected  from  the  DEM)  in 
regions  where  an  error  term  has  a  local  maximum. 
Clearly,  points  in  these  areas  will  be  necessary  if  the 
TIN  is  to  be  improved.  The  algorithm  terminates  when 
there  no  longer  exist  any  error  regions  whose  size  is 
greater  than  a  minimum  area  threshold,  and  whose 
error  values  are  above  a  given  error  threshold  e. 
Varying  e  allows  loose  control  over  the  quality  of  the 
resulting  TIN’S.  The  algorithm  goes  through  the 
following  sequence  of  steps: 

1.  Initialization. 

2.  Generate  the  difference  DEM  by  computing  the 
absolute  value  of  the  difference  between  the 
actual  DEM  and  the  interpolated  DEM 
approximation  at  each  point  in  the  DEM. 

3.  Select  contour  intervals  and  compute  error 
contours  from  the  difference  DEM. 

4.  Select  representative  points  based  upon  error 
contours. 

5.  Filter  points  based  upon  adjacency  criterion. 

6.  Add  points  to  the  set  of  selected  points  and 
reconstruct  the  TIN. 

7.  Interpolate  an  approximate  DEM  from  the  TIN. 

8.  Evaluate  the  quality  of  the  approximate  DEM 
with  respect  to  error  regions  and  point  selection 
budget.  If  the  error  is  exceeded,  and  points  are 
still  available,  return  to  Step  2. 


2.1.  Initialization 

We  start  by  selecting  the  comer  points  of  the  DEM  as 
the  initial  set  of  points  for  the  TIN;  this  will  ensure  that 
the  boundaries  of  the  TIN  and  DEM  will  be  identical, 
enabling  us  to  evaluate  the  error  at  each  point  in  the 
TIN.  TTie  4  comer  points  are  triangulated,  as  described 
in  Section  2.3,  to  produce  an  initial  TIN.  A  DEM  is 
generated  from  the  TIN  by  linear  interpolation  of  each 
triangle  in  the  TIN,  and  the  absolute  value  of  the 
difference  between  the  interpolated  TIN  and  the  DEM 
at  each  point  becomes  the  initial  difference  image.  We 
then  compute  the  histogram  of  the  difference  image, 
and  use  it  to  find  the  maximum  error  5,  defined  as  the 
highest  point  on  the  histogram  which  has  more  than 
minarea  pixels.  We  do  not  use  the  absolute  maximum, 
as  we  are  willing  to  tolerate  small  areas  of  relatively 
high  error  as  long  as  the  majority  of  the  surface  is  weU 
represented. 

2.2.  Point  selection  and  filtering 

For  preliminary  point  selection  we  first  seek  points 
representative  of  the  error  regions  found  in  the 
difference  image,  for  these  identify  regions  in  the  DEM 
which  we  are  not  modeling  adequately.  We  determine 
the  areas  in  which  we  need  points  by  contouring  the 
difference  image.  The  a’gorithm  tries  to  generate  4 
contours  equally  spaced  in  the  error  range,  i.e.  at  5/4, 
5/2,  35/4  and  5.  In  this  way,  we  can  correct  local  error 
maxima  as  well  as  global  error  maxima.  As  5  gets 
small,  the  position  and  number  of  contours  are  adjusted 
to  obtain  a  reasonable  contouring  which  does  not 
include  error  regions  with  errors  below  e,  and  which 
does  not  include  error  regions  of  negligible  size.  By 
doing  this,  we  ensure  that  our  error  regions  have  both  a 
sufficiently  large  error  and  coverage  to  warrant  further 
consideration.  We  meet  the  coverage  constraint  by 
limiting  the  minimum  distance  between  contours  to  2, 
and  we  meet  the  error  constraint  by  limiting  the  lowest 
contour  to  e-2.  If  we  were  to  limit  the  lowest  contour 
to  e,  we  would  inhibit  the  algorithm’s  ability  to  capture 
error  regions  which  have  more  structure  just  below  e. 

The  next  step  is  to  filter  the  output  of  the  contouring 
program  to  obtain  error  region  shapes.  These  polygons 
must  be  sufficiently  large  to  allow  a  meaningful  medial 
axis  to  be  obtained,  which  we  motivate  shortly.  The 
size  is  controlled  by  a  minimum  area  limitation, 
minarea,  for  the  resulting  contours.  Then  the  polygons 
which  pass  the  filter  will  be  those  which  cover  at  least 
minarea  pixels  and  which  do  not  contain  any  contour 
of  at  least  minarea  pixels. 

We  then  select  points  to  represent  these  error  polygons. 
One  could  simply  take  the  centroid  of  each  polygon, 
but  this  has  two  disadvantages:  the  centroid  will  not 
necessarily  be  in  the  error  polygon,  and  merely  using 
the  centroid  throws  away  information  we  have  about 
the  error  region  shape.  Instead,  a  medial  axis  is  found 
for  each  contour.  Before  computing  the  medial  axis, 
the  shape  of  the  contour  is  smoothed  by  dilating  and 
then  eroding  the  interior  of  the  contour  by  a  factor  of  3. 
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This  serves  to  remove  bumps  from  the  contour  that 
would  otherwise  cause  the  medial  axis  routine  to 
produce  a  spur  leading  out  to  the  bump.  The  medial 
axis  transform  is  then  applied  to  the  filtered  contour. 
Finally,  ^e  number  of  vertices  in  the  medial  axis  is 
reduced  by  a  smoothing  routine  [Aviad  88]  which 
approximates  the  medial  axis  using  a  point  at  every 
intersection,  every  comer  more  than  135  degrees  sharp, 
and  every  endpoint.  The  medial  axis  approximation 
has  fewer  vertices,  allowing  us  to  constrain  the  number 
of  candidate  points  for  the  TIN,  while  still  maintaining 
the  basic  stmeture  of  the  error  polygon. 

The  next  step  involves  the  selection  of  a  set  of  these 
candidate  points  for  addition  to  the  TIN.  These  points 
are  all  representative  of  errors  in  the  current 
approximation.  Some  of  these  points,  however,  may 
be  very  close  to  one  another  (consider  the  medial  axis 
of  an  almost-square  rectangle).  We  begin  by  using  all 
the  points  from  the  previous  approximation  in  the 
current  approximation.  This  serves  to  drive  the  process 
toward  convergence,  although  this  strategy  means  that 
errors  caused  by  a  poor  point  choice  are  difficult  to  fix. 
Next  we  add  points  from  the  candidate  set,  one  at  a 
time,  only  adding  a  point  if  it  is  not  too  close  to  a  point 
already  in  the  current  set  of  points.  We  judge  points  to 
be  too  close  if  they  are  within  mindist  pixels  in  both 
horizontal  (x  and  y)  directions  and  minheight 
(experimentally,  e/2  seems  to  be  a  good  value  for 
minheight)  pixels  in  the  vertical  (z)  direction.  If  this 
filtering  process  rejects  all  the  candidate  points,  then 
all  the  points  are  added  since  we  have  no  mechanism 
for  choosing  a  point  from  a  group  of  points  in  close 
proximity. 

2.3.  Triangulation 

The  new  set  of  points  is  then  triangulated  to  produce  a 
new  TIN;  the  TIN  is  interpolated  to  produce  a  DEM; 
and  a  new  difference  image  is  produced  for  the  next 
iteration.  Although  the  Etelaunay  triangulation  is  the 
most  popular  for  generating  a  TIN  from  a  set  of  points, 
we  found  a  modification  of  the  greedy  trianguladon  to 
be  the  most  suitable  for  our  purposes.  Both  the 
Delaunay  and  greedy  triangulations  are  approximations 
to  the  minimum  weight  triangulation  of  the  plane 
[Preparata  and  Shamos  85].  Because  they  seek  to 
solve  a  planar  problem,  they  have  deficiencies  when 
used  in  TIN  generation,  a  3  dimensional  problem.  One 
such  deficiency  is  the  phenomenon  of  contour 
crossing. 

Contour  crossing  occurs  when  a  triangulation  uses  a 
short  edge  that  cuts  through  a  ridge  or  bridges  a  valley 
instead  of  a  longer  edge  that  roughly  follows  the  ridge 
line.  In  the  example  shown  in  figure  1,  suppose  that 
the  triangulator  chooses  the  solid  lines  which  form  a 
diamond  around  the  ridge.  The  triangulator  must  then 
select  between  edges  A  and  B.  Edge  B,  which  runs 
across  the  ridge  line,  is  obviously  a  better 
approximation  of  the  terrain,  but  a  trian^lator  which 
seeks  to  minimize  total  edge  length  will  always  choose 


edge  A.  To  avoid  this,  we  modify  the  weight  assigned 
to  candidate  edges.  Instead  of  simply  using  the  length 
of  each  edge,  we  add  a  measure  of  how  well  that  edge 
approximates  the  terrain.  This  measure  is  obtained  by 
summing  the  square  of  the  vertical  difference  between 
the  candidate  edge  and  the  real  terrain  at  each  point 
along  the  edge.  We  can  then  use  a  greedy  triangulation 
algorithm  to  try  to  minimize  this  modified  weight.  The 
greedy  triangulation  proceeds  as  before,  sorting  the 
edges  by  their  altered  weights  and  adding  edges  to  the 
triangulation  in  increasing  order  of  weight  if  they  do 
not  intersect  any  edge  already  added.  This  altered 
weight  is  not  a  distance,  because  it  does  not,  in  general, 
obey  the  triangle  rule.  This  defect  makes  the  Delaunay 
triangulation  inappropriate  in  this  case. 

2.4.  Termination 

The  iteration  ends  when  we  no  longer  have  any 
contours  from  which  to  generate  points,  i.e.,  there  are 
no  longer  any  contours  which  are  at  least  minarea 
pixels  in  area.  The  termination  is  area-based  since  we 
only  wish  to  correct  regions  which  have  sufficiently 
large  coverage  of  the  terrain.  Additionally,  we 
terminate  if  5  falls  below  e,  i.e.,  the  maximum  error  in 
our  TIN  is  lower  than  the  error  threshold. 

3.  Experimental  Results 

In  this  section  we  present  experimental  results  for  our 
iterative  TIN  generation  system  for  two  test  areas.  The 
terrain  shown  in  Figures  2  and  14  are  each  composed 
of  a  15  minute  quadrangle  in  Yellowstone  National 
Park.  The  data  is  from  the  Defense  Mapping  Agency’s 
(DMA)  DTED  database  (level  I)  which  consists  of  one 
elevation  post  every  3  arc  seconds.  This  corresponds 
to  approximately  a  100  foot  (30m)  spacing.  The 
DEM’s  are  shown  by  simply  displaying  each  one  as  an 
intensity  image.  The  image’s  dynamic  range  has  been 
linearly  reduced  to  8  bits  for  ease  of  manipulation  and 
display.  Bright  regions  represent  areas  of  high 
elevation;  dark  regions  represent  areas  of  low 
elevation.  The  top  of  each  image  is  oriented  to  the 
northern  edge  of  the  DEM. 

For  each  of  the  example  TIN’S,  the  original  source  data 
is  reduced  from  a  rectangular  grid  of  90,000  points  in  a 
DEM  to  approximately  a  700  point  TIN.  The  actual 
reduction  in  data  storage  size  is  not  quite  as  dramatic, 
since  each  point  no  longer  encodes  a  single  value 
(elevation);  each  point  now  stores  x,  y,  and  z 
coordinates  as  well  as  connectivity  information.  Using 
a  simple  three-dimensional  point  encoding  scheme 
which  was  not  optimized  for  TIN  storage,  our  first  TIN 
used  only  19K,  compared  to  the  original  101 K  used  by 
the  DEM.  Other  experiments  have  indicated  belter 
reduction  factors,  in  the  same  size  areas  with  less 
topographic  variation  we  have  generated  TIN’S  with 
approximately  100  points.  Obviously,  in  areas  that  arc 
quite  flat,  the  algorithm  will  not  generate  more  points 
than  necessary  to  maintain  the  specified  error  budget. 
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Figure  1:  Contour  Crossing 


3.1.  Yellowstone  TED11144_7 

For  this  example,  the  desired  minimum  horizontal  (x 
and  y)  distance  between  points  (mindist)  was  set  to  50. 
The  minimum  vertical  (z)  distance  between  points 
iminheight)  was  set  to  5  elevation  units.  The  minimum 
allowable  size  for  an  error  polygon  (minarea)  was  set 
to  50  pixels,  and  the  desired  maximum  error  at  a  point 
(e)  is  10  elevation  units. 

The  TIN  is  initialized  to  the  comer  points  of  the  DEM. 
We  then  interpolate  and  find  the  absolute  difference 
between  this  and  the  original  image.  The  result  is 
shown  in  Figure  3.  Here  the  lighter  areas  are  those  of 
higher  error.  We  contour  this  difference  image,  find 
maximal  contours,  perform  a  medial  axis  transform, 
select  points  on  the  approximated  medial  axis  which 
represent  the  error  polygons,  and  triangulate  again, 
producing  Figure  4. 

Even  with  the  first  correction,  some  of  the  structure 
begins  to  appear.  The  first  iteration  captured  the  rough 
relative  heights  of  each  quadrant  as  well  as  the  fact  that 
there  is  a  valley  running  down  the  middle  of  the 
terrain.  A  difference  image  shown  in  Figure  5  is 
generated  from  this  first  iteration,  in  which  the  areas 
missed  in  this  iteration  are  readily  identified.  For 
example,  by  examining  the  lower  right  comer  of  the 
difference  image,  we  can  see  that  the  ridge  was  picked 
up  well,  but  we  notice  that  the  valley  enclosed  by  the 
ridge  needs  more  points,  which  we  obtain  on  the  next 
iteration  as  shown  in  Figure  6.  With  the  third  and 
succeeding  corrections,  the  picture  becomes 
increasingly  sophisticated.  Iteration  3,  shown  in  Figure 
7,  further  develops  the  central  river  valley,  and  begins 
to  draw  out  another  river  valley  in  the  southwest 
comer.  This  valley  is  refined  on  the  next  iteration 
(Figure  8).  The  algorithm  then  slows  down  gradually 
because  the  points  chosen  to  represent  error  regions  are 
too  close  to  points  already  in  the  TIN.  This  reaches  an 
extreme  during  iteration  7,  where  only  one  point  is 
added.  In  the  next  iteration,  no  candidate  points  satisfy 
the  adjacency  criterion,  and  so  all  155  of  them  are 
added.  The  algorithm  continues  in  a  similar  manner, 
eventually  representing  all  maximal  contours  on 
iteration  12  as  shown  in  Figure  12. 


.seen  that  all  of  the  large  structures  have  been  captured. 
Indeed,  looking  at  it,  one  is  hard  pressed  to  find  any 
structure  at  all.  What  remains  are  primarily  pieces  of 
river  Vuileys  that  were  too  small  to  be  considered 
important  i.e.,  error  regions  below  the  minarea  size 
cutoff. 

In  terms  of  absolute  error,  the  worst  defects  are  to  be 
found  along  the  edges.  Since  we  start  with  the  comers 
in  the  image,  boundaries  are  initially  represented  by  a 
smooth  slope  connecting  the  comers.  This  generally 
poor  approximation  persists  until  a  point  is  added  at  the 
boundary,  but  this  does  not  generally  happen,  because 
the  medial  axis  tends  to  be  in  the  center  of  the  error 
regions.  Thus  the  area  adjacent  to  the  boundary  is 
represented  by  a  few  long,  thin  triangles  which  poorly 
approximate  the  terrain. 

3.2.  Yellowstone  TEDl  1043^6 

A  second  test  area  having  sharp  ridges  and  deep 
valleys  provides  an  especially  challenging  test  for  the 
triangulator.  It  is  difficult  to  keep  the  triangulator  from 
making  contour  crossing  errors,  which  would  chop 
holes  in  ridges  or  fill  in  valleys.  The  modifications  to 
the  greedy  triangulation  algorithm  discussed  in  Section 
2.3  reduce  the  magnitude  of  the  problem,  but 
occasionally  the  triangulator  has  no  choice  but  to  cross 
contour  lines.  This  example  shows  how  well  the 
iterative  algorithm  works  in  such  an  area. 

We  use  the  same  values  for  the  control  parameters  as 
in  our  previous  example.  The  algorithm  begins  with 
the  four  comer  points,  and  quickly  picks  up  major 
terrain  features.  In  Figure  15,  the  algorithm  begins  to 
resolve  some  of  the  finer  features,  and  here  the 
triangulator  is  forced  to  make  a  number  of  contour 
crossing  errors.  Additional  points  are  added  to  correct 
for  this  in  Figure  16,  and  the  approximation  becomes 
more  recognizable  as  the  original  DEM  terrain  image. 
As  the  run  continues,  finer  and  finer  details  from  the 
original  are  represented  in  the  approximation.  By 
Figure  18  the  approximation  is  very  close  to  the 
original,  but  several  more  iterations  are  required  to 
meet  the  error  bound  and  refine  more  subtie  terrain 
variations. 


From  Figure  13,  the  final  difference  image,  it  can  be 
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Figure  2:  Original  DEM,  300  x  300  Points 


Figure  3:  Initial  Difference  DEM 


This  area.  tedii043_6.  is  about  as  complex  as 
TEDlll44_7  in  terms  of  the  number  of  points  the 
algorithm  needs  to  approximate  it  within  the  error 
budget.  It  is  apparent,  however,  that  it  differs  widely 
from  TEDill44_7  in  terms  of  overall  appearance. 
Whereas  TEDlll44_7  had  rounded  and  smooth  ridges, 
this  example  has  sharp  peaks  and  deep  valleys.  Each 
required  additional  points  for  different  reasons 
corresponding  to  the  differences  in  terrain,  tedi  i  I44_7 
needed  points  because  the  TIN  approximation 
produced  a  faceted  appearance,  so  additional  facets 


were  needed  to  represent  curves.  In  this  experiment, 
facets  were  ideally  suited  to  the  jagged  nature  of  the 
terrain,  but  extra  points  were  needed  as  hints  to  the 
triangulator  because  of  the  steepness  of  the  terrain. 

3.3.  Numerical  Results 

In  this  section  we  consider  the  perfomiance  of  the 
algorithm  numerically  for  each  of  the  test  cases. 
Recall  that  the  images  were  scaled  to  an  8-bit  range,  so 
the  statistics  presented  here  are  given  in  temis  of 
pixels.  Tables  I  and  2  represent  perfomiancc  statistics 


I 


Figure  6:  Iteration  2  DEM,  159  vertices  Figure  7:  Iteration  3  DEM,  286  vertices 


Figures:  Iteration  4  DEM,  354  vertices  Figure  9:  Iteration  6  DEM,  381  vertices 


for  TED11I44_7  and  TEDI1()43_6,  respectively.  In  each  Maximal  contours  gives  the  numoer  of  contours  above 

Table,  the  number  of  points  used  in  each  iteration  is  e  that  are  more  than  minarea  pixels  in  area.  The 

summarized  and  the  RMS  error  is  given.  The  internal  algorithm  will  terminate  either  if  the  maximum  error 

RMS  error  is  an  attempt  to  remove  the  edge  effect  drops  below  e  or  if  there  are  no  maximal  contours  of 

from  error  measurement.  This  column  gives  the  RMS  the  requisite  size  left.  In  both  cases,  the  algorithm 

error  excluding  a  5%  (15  pixel)  border  around  the  terminated  because  there  were  no  more  maximal 

edges.  contours.  This  happens  with  increasing  frequency  as 

the  value  of  e  relative  to  the  dynamic  range  of  the 
Maximum  error  (6)  gives  the  highest  numbered  point  terrain  decreases.  The  edge  effect  is  quite  visible  in  the 

on  the  error  histogram  which  has  more  than  a  given  last  few  iterations  of  the  algorithm,  as  evidenced  by  the 

number  (minarea,  which  is  in  this  case  50)  of  pixels,  increase  in  the  difference  between  the  RMS  error  and 
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Figure  10:  Iteration  8  DEM,  537  vertices 


Figure  II:  Iteration  10  DEM,  643  vertices 


Figure  12:  Iteration  12  DEM.  686  vertices  Figure  13:  Final  Difference  DEM 


the  internal  RMS  error,  the  latter  of  which  does  not  the  underlying  terrain  exhibits  high  variability  in 

include  a  15-pixel  border  around  the  edges.  elevation,  as  is  expected.  The  edge  effect  is  visible. 

mainly  along  the  lower  and  right  edges.  It  also  shows. 

Triangulation  some  degree,  the  success  that  the  triangulator  has  in 

Another  way  to  visualize  the  results  of  the  TIN  generating  edges  which  follow  the  terrain.  Such  terrain 

network  is  to  display  the  triangulations  of  the  selected  lol lowing  causes  the  narrow  triangles  which  can  be 

points.  Figures  22  and  23  are  the  networks  generated  along  the  rivers, 

for  TEDI 1 144_7  during  the  second  and  twelfth  iterations. 

They  illustrate  the  clustering  of  points  in  areas  where 
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Figure  14:  Original  DEM,  300  x  300  Points 


Figure  15:  Iteration  2  DEM,  172  vertices 


Figure  16:  Iteration  3  DEM,  309  vertices  Figure  17:  Iteration  4  DEM.  388  vertices 


4.  Future  Work 

Iterative  TIN  construction  is  an  area  that  is  rich  in 
possibilities  for  future  work.  Interesting  research  issues 
include  how  to  integrate  specific  topographic  features 
as  constraints;  the  tiling  of  individual  TIN's  to  provide 
a  larger  area  of  coverage;  and  the  use  of  cartographic 
feature  data  to  improve  the  correlation  between  terrain 
and  object  data. 

In  our  technique,  we  select  points  solely  based  upon 
our  contour  error  model.  It  is  clear  that  analysis  of  the 


topography  can  yield  points  such  as  peaks,  valleys,  and 
ridges  that  can  be  used  to  augment  the  current  selection 
process.  How  to  select  a  subset  of  these  ptrints, 
without  greatly  increasing  the  number  of  points  in  the 
TIN.  and  while  continuing  to  minimize  the  overall 
error,  raises  an  interesting  set  of  selection  and 
optimization  problems. 

The  tin's  generated  in  this  paper  cannot  be  connected 
simply  by  placing  them  side-by-side  because  of  the 
edge  effects.  Simply,  there  are  no  guarantees  that 


Figure  18:  Iteration  5  DEM,  418  vertices  Figure  19:  Iteration  7  DEM.  597  vertices 


Figure  20:  Iteration  9  DEM,  689  vertices  Figure  21:  Iteration  10  DEM,  713  vertices 


adjacent  terrain  patches  will  not  have  severe  levels-of-detail. 

discontinuities  at  the  borders.  Instead,  some  way  of 

meshing  TIN’s  generated  independently,  such  as  re-  Finally,  the  use  of  digital  spatial  databases  containing 

triangulating  along  the  edges,  is  needed.  The  road  networks,  drainage  networks,  and  specific  man- 

generation  of  hierarchical  TIN’s  using  an  iterative  made  objects  in  conjunction  with  DEM's  or  TIN's 

method  are  another  interesting  area  for  research.  Such  raise  the  issue  of  correlation  between  the  spatial 

representations  allow  the  resolution  of  the  TIN  to  vary  kKation  of  features  and  the  underlying  terrain.  A 

by  ne.sting  an  additional  higher  resolution  TIN  within  a  constraint  that  could  be  applied  during  point  selection 
TIN  at  the  previous  level  of  the  hierarchy.  Both  of  would  be  to  maintain  coherency  between  the  liKation 
these  issues  arise  due  to  concerns  to  support  terrain  of  spatial  database  objects  and  the  TIN.  For  example, 

models  for  large  areas  of  coverage  and  with  variable  one  would  expect  that  streams  would  continue  to  How 
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Performance  Results  for  TEDl  1 144_7 


Iteration 


TIN 

Points 


Global 

Internal 

Maximum 

Maximal 

RMS  Error 

RMS  Error 

Error 

Contours 

42.9111 

42.9195 

113 

8 

31.8709 

31.5403 

84 

28 

14.9189 

15.0460 

44 

59 

8.11141 

7.65872 

29 

74 

7.04317 

6.59736 

28 

64 

6.65025 

6.36294 

27 

59 

6.54591 

6.26490 

27 

57 

6.54332 

6.26490 

27 

56 

5.37215 

5.00352 

22 

41 

5.25415 

4.86895 

22 

35 

4.76598 

4.41236 

20 

17 

4.68777 

4.31051 

20 

14 

4.47456 

4.11826 

19 

0 

Table  1:  Numerical  Accuracy  TEDl  1 144_7 


TIN 

Points 


Performance  Results  for  TEDl  1043_6 


Global 
RMS  Error 

Internal 
RMS  Error 

Maximum 

Error 

Maximal 

Contours 

66.0866 

67.2875 

132 

15 

23.1250 

21.0115 

75 

30 

12.2026 

11.8442 

43 

57 

8,14550 

7.65331 

33 

73 

7.17465 

6.60838 

28 

69 

6.62902 

6.15420 

27 

61 

6.50469 

5.99604 

27 

54 

5.40780 

5.14117 

21 

32 

5.39088 

5.11934 

20 

30 

4.99839 

4.72627 

18 

8 

4.87780 

4.57424 

18 

0 

Table  2:  Numerical  Accuracy  TEDl  1043_6 


5.  Conclusions  elevation  model  (DEM).  Our  overall  goal  is  to 

We  described  a  new  technique  to  produce  a  produce  an  approximate  terrain  description  that 
triangulated  irregular  network  (TIN)  from  a  digital  preserves  the  major  topographic  features  using  a 
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Figure  22:  Iteration  2  Triangulation 

important  for  computer  image  generation,  but  also 
emerge  in  simulation  and  autonomous  navigation  using 
digital  maps. 

5.  Conclusions 

We  described  a  new  technique  to  produce  a 
triangulated  irregular  network  (TIN)  from  a  digital 
elevation  model  (DEM).  Our  overall  goal  is  to 
produce  an  approximate  terrain  description  that 
preserves  the  major  topographic  features  using  a 
greatly  reduced  set  of  points  selected  from  the  original 
DEM.  The  TIN  generation  process  is  iterative;  at  each 
iteration  we  identify  areas  in  the  DEM  that  lie  outside 
of  a  user-supplied  error  tolerance  in  the  TIN,  and 
attempt  to  choose  points  in  the  DEM  to  more 
accurately  model  these  areas.  Point  selection  involves 
the  computation  of  the  difference  between  the  actual 
DEM  and  an  approximate  DEM.  This  approximate 
DEM  is  calculated  by  interpolating  elevation  points 
from  the  TIN. 

The  iterative  nature  of  the  algorithm  permits  users  to 
terminate  the  terrain  approximation  algorithm  based  on 
operational  criteria,  such  as  median  error,  maximal 
error,  or  number  of  points  used  to  construct  the  TIN. 
This  is  particularly  relevant  to  real-time  computer 
image  generation  as  various  scene  rendering  systems 
utilizing  polygonal  terrain  patches  have  well  defined 
limitations  in  order  to  maintain  real-time  pierformance. 
It  also  implies  that  the  TIN  generation  procedure  is 
automatically  sensitive  to  smooth  or  rough  terrain  in 
that  it  selects  only  enough  points  to  satisfy  the  required 
error  constraints  and  tends  to  place  points  in  those 
areas  having  the  greatest  topographic  complexity. 


Figure  23:  Final  Triangulation 
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Abstract 

A  computational  vision  approach  is  presented 
for  the  estimation  of  2-D  translation,  rota¬ 
tion,  and  scale  from  two  partially  overlapping 
images.  The  approach  results  in  a  fast  and 
novel  method  that  produces  excellent  results 
even  when  large  rotation  and  scale  changes 
have  occurred  between  the  two  frames  and  the 
images  are  devoid  of  significant  features.  In 
our  method  an  illuminant  direction  estimation 
method  is  first  used  to  obtain  an  initial  es¬ 
timate  of  camera  rotation.  A  small  number 
of  feature  points  are  then  located  based  on  a 
Gabor  wavelet  model  for  detecting  local  curva¬ 
ture  discontinuities.  An  initial  estimate  of  scale 
and  translation  is  obtained  by  pairwise  match¬ 
ing  of  the  feature  points  detected  from  both 
frames.  Finally,  hierarchical  feature  matching 
is  performed  to  obtain  an  accurate  estimate  of 
translation,  rotation  and  scale.  Experiments 
with  synthetic  and  real  images  show  that  this 
algorithm  yields  accurate  results  when  the  scale 
between  the  image  pair  differ  by  up  to  10%, 
the  overlap  between  the  two  frames  is  as  small 
as  35%,  and  the  camera  rotation  between  the 
two  frames  is  significant.  Applications  of  the 
method  to  texture  and  stereo  image  registra¬ 
tion,  satellite  image  mosaicking,  and  moving 
object  detection  are  presented. 

1  Introduction 

Automatic  image  registration  is  an  important  problem 
in  computer  vision  and  image  processing.  TVaditional 
solutions  [1,  5,  10,  12,  13,  18]  to  this  problem  are  unre¬ 
liable  when  the  rotation  of  the  camera  and  scale  change 
between  the  two  frames  are  significant.  Registration  be¬ 
comes  even  more  difficult  if  the  images  are  devoid  of 
significant  features  and/or  the  overlap  between  the  two 
frames  is  small.  In  this  paper  we  present  a  computa¬ 
tional  vision  approach  for  the  estimation  of  2-D  trans¬ 
lation,  rotation,  and  scale  from  two  partially  overlapped 
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images.  Figure  1  shows  the  block  diagram  of  our  cam¬ 
era  motion  estimation  algorithm.  We  notice  that  the 
illuminant  direction  is  inherent  to  environment  and  is 
invariant  to  camera  motion.  The  difference  in  illumi¬ 
nant  direction  of  two  successive  frames  is  equal  to  the 
amount  of  camera  rotation.  By  estimating  the  illumi¬ 
nant  direction  in  each  frame,  we  estimate  the  rotation 
between  the  two  frames  and  simplify  the  matching  pro¬ 
cess.  To  estimate  the  illuminant  azimuth,  we  use  a  local 
voting  estimator  reported  in  [20].  Since  the  common 
area  between  the  two  frames  can  be  much  smaller  than 
the  image  field  and  additionally  there  is  scaling  between 
the  two  frames,  methods  based  on  correlation  matching 
become  unreliable.  In  this  work  we  use  a  feature  based 
matching  technique.  First  we  extract  a  small  number 
of  feature  points  based  on  a  Gabor  wavelet  model  for 
local  curvature  analysis.  In  doing  this  we  compute  a 
local  energy  measure  defined  as  the  interaction  of  im¬ 
age  convolved  with  basic  wavelet  functions  at  different 
scales,  and  then  detect  the  locations  of  local  maxima  of 
such  an  energy  map  as  the  feature  points.  The  effect 
of  local  inhibition  of  nearby  feature  points  is  also  con¬ 
sidered.  Since  no  prior  knowledge  about  the  translation 
is  available,  an  initial  estimate  of  scale  and  treuislation 
is  obtained  by  pairwise  matching  between  the  neighbors 
of  feature  points  detected  from  both  the  frames.  Subse¬ 
quently,  hierarchical  correlation  matching  is  performed 
to  obtain  an  accurate  camera  motion  estimate.  Using 
this  algorithm  we  have  obtained  impressive  results  for 
several  applications  including  stereo  and  texture  image 
registration,  satellite  image  mosaicking,  and  moving  ob¬ 
ject  detection. 

The  organization  of  the  paper  is  as  follows;  Section  2 
discusses  the  basic  steps  used  in  our  matching  algorithm, 
issues  of  camera  rotation  estimation,  feature  extraction, 
matching  criterion,  scale  and  translation  estimation,  and 
hierarchical  implementation  are  addressed.  Section  3 
presents  the  matching  algorithm.  Section  4  presents  ex¬ 
perimental  results  on  texture  image  registration,  stereo 
matching,  satellite  image  mosaicking,  and  moving  object 
detection.  The  work  is  summarized  in  Section  5. 

2  Basic  Steps 

Before  a  discussion  of  our  algorithm,  some  definitions 
and  basic  steps  used  should  be  addressed. 
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Figure  1:  Block  diagram  of  image  registration  algorithm 


Initial  Estimation  of  Camera  Rotation 
As  shown  in  Figure  1,  the  initial  estimation  of  camera 
rotation  is  computed  as  the  difference  of  illuminant  az¬ 
imuth  angles  estimated  from  both  of  the  frames.  To  esti¬ 
mate  the  illuminant  direction  from  the  image  we  use  the 
local  voting  illuminant  azimuth  estimator  discussed  in 
[20].  To  be  more  specific,  for  each  pixel  we  first  compute 


by  solving 
where 


=(I0 


X  =  iB*B)-^B*dI 


<11  = 


B  = 


/  6h 
Sh 


\  SIn 

6xi  6yi  \ 

6x2  6y2 


6xs  6yN  / 
and  N  is  the  number  of  measured  directions  for  s. 

The  estimate  of  illuminant  azimuth  is  then  computed 


After  the  illuminant  azimuth  Ti,  i  =  1,2  are  computed, 
the  initial  estimation  of  camera  rotation  is  computed  as 

0o  =  n-T2  (1) 

Feature  Detection  Feature  detection  is  an  im¬ 
portant  step  in  our  algorithm.  In  this  work,  we  use  a 
method  which  is  based  on  a  biologically  motivated  model 
for  identifying  local  curvature  discontinuities,  and  makes 
use  of  Gabor  wavelet  decomposition  of  the  image  and 
local  scale  interactions  between  features  [15].  Tlie  ba¬ 
sic  wavelet  function  used  in  our  decomposition  is  of  the 
form 

=  (2) 

x'  =  X  cos  0  +  ysinO 
y'  =  —X  sin  0  -)-  j/  cos  0 

where  0  is  the  preferred  spatial  orientation  and  A  is  the 
aspect  ratio  of  the  Gaussian.  For  convenience,  we  will 
drop  the  subscripts  in  further  discussion.  In  all  our  ex¬ 
periments,  A  is  set  to  1,  and  0  is  discretized  into  four 
orientations. 

The  corresponding  wavelet  transformation  is  obtained 
by  convolving  the  image  data  f  with  a  bank  of  filters 
whose  responses  are  simple  dilations  and  translations  of 
the  basic  wavelet  in  (2),  and  are  denoted  by 


Wi{x,y,0)  =  {(^^aix,c<iy,0),  i  =  {0,1,2,..}  (3) 

Here  a  denotes  the  scale  parameter.  Usually  the  param¬ 
eter  values  used  are  those  corresponding  to  half-octave 
(a  =  y/2)  or  octave  (a  =  2).  Physically,  this  transforma¬ 
tion  detects  features  in  the  image  such  as  line  and  step 
edges.  Biologically,  this  models  the  processing  by  simple 
cells  in  the  visual  cortex  of  mammals.  These  features  by 
themselves  are  not  good  for  applications  such  ^ls  obtain¬ 
ing  correspondence.  The  next  stage  in  our  feature  detec¬ 
tion  module  involves  interactions  between  these  simple 
features  (at  different  spatial  frequencies,  within  each  ori¬ 
entation).  This  step  can  be  identified  with  the  responses 
of  hypercomplex  cells  in  the  visual  cortex.  Hypercom¬ 
plex  cells  exhibit  end-inhibition.  They  are  sensitive  to 
oriented  lines  and  step  edges  of  short  lengths,  and  their 
response  decreases  if  the  lengths  are  increased.  Using 
scale  interactions  to  model  these  cells  was  first  suggested 
by  Hubei  and  Wiesel  [11],  who  were  the  first  to  discover 
these  cells  in  the  visual  cortex.  Subsequent  anatomical 
studies  have  also  supported  this  model  [2].  We  model 
these  interactions  as  follows; 


/f»i,n(x,y)  =  niax^(||lUm(.r,y,0)-7lI'„(x,y.<J)||)  (4) 

where  j  is  a  non-linear  transformation,  such  as  thresh¬ 
olding  or  a  sigmoid  non-linearity,  7  is  a  normalizing  fac¬ 
tor,  and  n  >  m.  In  order  to  identify  features  at  different 
scales,  one  has  to  consider  different  scale  interactions. 
The  final  step  is  to  actually  localize  these  features,  and 
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this  is  done  by  looking  at  the  local  maxima  of  these  fea¬ 
ture  responses. 


Image  Transformation  The  image  transform 
is  implemented  by  bilinear  interpolation  from  the  origi¬ 
nal  image.  A  2-D  affine  transform  with  parameters  (s, 
AX,  AY)  is  defined  as 

f  cose  sine\fX\,fAX\ 

Vrj='V-™»  cossXyj+Uvj-  W 

For  a  grid  point  {X',  Y')  on  the  transformed  frame  its 
location  on  the  original  frame  is  given  by 

fX\_lfcose  -sin9\f  X'  -  AX\ 
\Y)-s\sme  cosej{Y'-AYj 

The  four  nearest  grid  points  of  (X,  Y)  are  (X,  Y),  (X, 
Y),  (X,  y),  and  (X,  V),  with 

X  =  int(X) 

Z  =  int(y) 

X  =  X-l- 1 

y  =  y-bi 


where  int(  )  is  the  floor  function  defined  as  the  largest 
integer  not  greater  than  the  variable.  The  image  value 
at  (X',  y')  is  computed  by  a  bilinear  interpolation*  as 

f'(X',y')  =  f(x,y) 


where 


=  £*£yf(2C,y)  +  e.M(X,r) 

(X  y)  +  (X ,  y)  (7) 

e,  =  X-X 
=  X-X 

ly  =  y  -  y 


Matching  Criterion  The  goodness  of  fit  be¬ 
tween  pixels  fi(m,  n)  in  frame  ti  and  f2(u,  n)  in  frame  <2 
is  measured  by  their  mutual  correlation  coefficient 


1 

<ri<r2(2u;m+  1)^^ 


(fi(m-fi,n-b»-/ii)(f2(«+i,v-fj)-/i2)  (8) 


I  J  = 

where 


^2  = 


(2«„ -f- 1)2. 


f2(«  -t-  i,  n  -b  j) 


0\  = 


\ 


fj— Win 


A 


i,j=w„ 


*  Here  we  assume  that  (X,  F)  is  an  inner  point.  For  match¬ 
ing  purposes,  boundary  and  out  of  the  image  frame  pixels  can 
simply  be  ignored. 


and  (2wm  +  1)^  is  the  area  of  the  matching  window. 

Initial  Matching  Initial  matching  is  deter¬ 
mined  by  best  pairwise  fitting  between  feature  points 
detected  from  different  frames.  Assume  that  the  fea¬ 
ture  points  detected  from  frame  <i  are  {/](A’],,  y,), 
i  =  and  the  feature  points  detected  from 

frame  <2  are  {/2(X2,-,y2i)>  *  =  1,  -,X2},  then  the 

feature  point  /i(A'ii,yi,)  is  matched  to  /2(Xii,yi,)  of 
frame  <2  if 

t^A/.(Xu,Yu;Xu,Yi.)  = 

V^/,a(Xu,Yu;u,v) 

1<J<JV2  |u-X2j|  <  w. 

In  -  y2^  |  <  w, 

where  w,  is  the  search  window  parameter  with  (2w,  -|- 1)^ 
be  the  size  of  searching  area.  Similarly,  the  feature  point 
/2(X2i,Y2i)  is  matched  to  /i(X2i,y2i)  of  frame  <1  if 

i>S.sAX2i,Y2i\X2i,Y2i)  = 

^JtA(.rn,n;X2i,Y2i). 

lm-Xij|  <w. 


Scale  Estimation  Since  the  Euclidean  distance 
between  the  feature  points  only  depends  on  the  scale 
between  the  two  frames,  and  is  invariant  to  rotation  and 
translation,  the  scale  factor  can  be  estimated  prior  to 
the  estimation  of  other  parameters.  Assume  that  the 
matched  feature  point  pairs  are  {(X,^y/)  (Xj,yi)  i  = 
l,...,Ar'}  with  N'  be  the  number  of  matched  feature 
pairs.  Then 

{di  62  ■■■  dN>)=s'{di  d2  •••  ds‘)  (9) 
where 

di  =  ^/(xf-^)2-b(y/-i7)2 


VAX- 

x.)2-)-(y;- 

X  = 

1 

t  =  l 

>7  = 

I 

—  Vy' 

N>  « 

1=1 

Ti  = 

1 

1  =  1 

W  = 

1 

The  scale  factor  from  frame  <1  to  <2  is  computed  as 


,  _  di  •  dj 


(10) 


Rotation  and  Translation  Estimation 

With  the  scale  factor  determined,  the  rotation  and  trans¬ 
lation  between  the  transformed  frame  <1  and  frame  <2  ran 


901 


be  computed  as  follows.  Assuming  that  the  matched  fea¬ 
ture  point  pairs  are  {(Xj,y/)  ^  (X,,yi)  i  = 
the  relation  between  the  matched  feature  pairs  is 

fXi\  ,{cose'  sine' \f  X’A  AX' \ 
[Yin[-sme'  cose')[Y})  +  [AY>)^ 

Note  9',  being  the  residual  of  initial  rotation  estimation, 
is  very  small.  By  approximating  cos  O'  and  sintf'  up  to 
linear  terms,  (11)  can  be  rewritten  as 

^1  )(«^'  )  "^  (  A?'  )  >  (12) 


A  =  BC 


where 


/  Xi- 

s'X{ 

yi- 

s'Yl 

Xn.  - 

s'XJ, 

\  Yn>  — 

s'Yi,. 

( 

1 

-s'X{ 

0 

s'Y'f,, 

i 

\-s'X't,, 

0 

■(“■) 


The  vector  C  can  then  be  computed  as 

C  =  (B*  B)-^  B*  A.  (14) 

Matching  Refinement  After  initial  matching 
is  accomplished,  the  matching  is  refined  on  every  reso¬ 
lution  layer  of  the  matching  pyramid.  To  do  this,  frame 
ti  and  its  feature  points  are  first  transformed  using  the 
estimated  parameters.  The  transformed  feature  points 
are  then  truncated  to  the  nearest  grid  location.  Let  the 
transformed  feature  point  be  /((XJ,-,  y/J;  it  is  matched 

to  /2(-^i«.yii)  on  frame  tj  if 

max  il>fifj(X'u,Y(i]X'u  +  m,Y(i  +  n)  (15) 
lm|  <  u, 
ln|  <  u. 

Estimation  Refinement  In  our  method, 
at  each  level  of  the  matching  hierarchy,  the  feature 
points  from  frame  <i  {/i(Xi,', yi,), i  =  l,...,Xi}  are 
first  transformed  using  (5)  to  get  {/i(Xj^,yi'J,t  = 
l,...,fVj}.  The  transformed  feature  points  are  then 
matched  to  frame  t2  and  the  correction  on  initial  es¬ 
timation  (s',  9' ,  AX' ,  AY')  is  computed  using  (10)  and 


(14).  After  that,  the  total  transformation  is  obtained  by 
combining  the  two  transformations  together: 

/'X\_  ,(  cos  9'  sin9'\(  X\J  AX' \ 

vyy  ®V-sin0'  cos9' J[y'J'^[ay' J 

_  cos 9'  sin Nr  /  cos 9  sin^N/A'N 

— sintf'  cos  y  —  sin  ^  cos  9J\YJ 

cos(9  +  9')  sm(9  +  9')\f  X\ 

-  \-sin(9  +  9')  cos(9  +  9')  J{y  J 

,/'s'(cos9'AX+sin9'AY)+AX'\ 

■'■\s'(-sin^'AX-fcos0'Ay)-f-Ay7 


I'  AX-l-sin^'Ay)-fAX'  \ 
9'AX+cos9'AY)+AY'J 


So  the  estimates  are  updated  using 

(  N"/  )  (‘) 

1  8'(co8  0'AX-|-sin0'Ay)-|-AX'  AX  (1^) 

Vs'(-sin0'AX-(-cos0'Ay)-|-Ay7  \AY/ 

3  Matching  Algorithm 

Combining  the  operations  discussed  in  Section  2,  we  ob¬ 
tain  our  matching  algorithm,  summarized  as 

Step  1:  •  Estimate  the  illuminant  azimuth  r< 

from  frame  t,-,  t  =  1,2; 

•  Set 

'  s  =  1 

9  =  Ti-rz 

AX  =  0 
,  Ay  =  0 

Step  2:  •  Reduce  the  image  size  to  that  of  the 

lowest  resolution  layer; 

•  Estimate  feature  points  from  each 
frame. 

Step  3:  •  Apply  an  affine  transformation  with 

parameters  (s,  9,  AX,  AY)  on  the  low¬ 
est  resolution  version  of  frame  ti  and 
its  feature  points; 

•  Do  initial  matching  to  obtain  estimates 
(s',  9',  AX',  Ay'); 

•  Update  (s,  9,  AX,  Ay)  by  (17). 

Step  4:  •  Reduce  the  image  resolution  corre¬ 

sponding  to  current  layer  of  the  match¬ 
ing  hierarchy; 

•  Magnify  the  coordinates  of  the  feature 
points  corresponding  to  the  resolution 
of  the  current  layer; 

•  Apply  an  affine  transform  with  param¬ 
eter  (s,  9,  AX,  AY)  to  frame  ti  and 
its  feature  points; 

•  Do  matching  refinement  to  obtain  (s', 

9',  AX',  Ay'); 

•  Update  the  estimates  using  (17). 


Step  3: 


Step  5:  •  If  the  current  level  is  at  the  highest  res¬ 

olution  then  stop; 

•  Increase  the  image  resolution  and  ad¬ 
just  the  translation  estimation  by 

f  2  AX  \  f  ^X  \ 

2  AY  y  AY  )' 

•  Go  to  Step  4. 

4  Experiments 

Texture  Image  Registration  Registration  of 
texture  images  is  a  difficult  problem  in  computer  vision. 
We  have  tested  our  algorithm  on  images  of  grass,  leather, 
pigskin,  sand,  wood,  and  wool.  Two  typical  results  are 
presented  here.  In  these  experiments,  the  input  pair  of 
images  are  512  x  512,  obtained  by  digitizing  a  photo 
[3,  19]  and  its  rotated  version.  The  angles  of  rotation 
are  about  30®  as  measured  by  a  goniometer.  So  the  true 
transform  between  the  texture  image  pairs  is  expected 
to  have  a  rotation  of  about  30®,  scale  close  to  1,  and 
a  small  amount  of  translation.  In  the  implementation 
of  our  registration  algorithm,  we  let  the  image  size  for 
the  lowest  resolution  layer  be  128  x  128  and  the  m  and 
n  of  (4)  be  2  and  5  respectively.  Only  feature  points 
which  are  maximum  for  a  radius  of  r  =  a"  are  selected. 
The  matching  window  parameter  and  the  search  space 
parameter  are  set  to  Wm  =  8  and  =  3  respectively.  In 
our  implementation,  matches  with  correlation  coefficient 
<  I  are  considered  insignificant  and  are  not  used  in 
motion  estimation,  unless  the  matches  are  among  the 
top  two  candidates  with  the  highest  correlation. 


(2.b)  Grass-30 


(2.a)  Grass-00 


Figure  2  presents  the  registration  of  two  pictures  of  a 
grassy  field,  (a)  and  (b)  are  the  input  images,  (c)  is  the 
mosaicking  of  the  transformed  0®  picture  and  30®  pic¬ 
ture.  The  transformation  is  done  by  the  estimated  mo¬ 
tion  parameters,  which  are  a  =  1.00128,  0  =  —30.2141®, 
AX  =  0.1374  pixel,  and  AY  =  17.0161  pixels. 

Figure  3  presents  the  registration  of  two  pictures  of 
pigskin,  (a)  and  (b)  are  the  input  images,  (c)  is  the 
mosaicking  of  the  transformed  0®  picture  and  30®  pic¬ 
ture.  (d)  is  the  difference  between  the  transformed  0® 


(2.c)  Mosaicking  of  Grass-00  and  Grass-30 


s 


(3.b)  Pigskin-30 


(2.d)  Difference  after  registration 


Figure  2:  Registration  of  grass  pictures 
zero  of  difference  is  shifted  to  128. 


(3.c)  Mosaicking  of  Pigskin-00  and  Pigskin-30 


(3. a)  Pigskin-00 


Figure  3:  Registration  of  pigskin  pictures 
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picture  and  30®  picture.  The  transformation  is  done  by 
the  estimated  motion  parameters,  which  are  s  =  1.00024, 
9  =  -30.6011®,  AA'  =  2.4026  pixels,  and  AY  =  15.3286 
pixels. 

In  (d)  of  Figure  2,  there  are  noticeable  error  rings  in 
the  central  area.  These  are  the  errors  generated  by  the 
digitizer.  In  spite  of  the  digitization  error,  our  algorithm 
produces  correct  matches.  The  continuity  in  the  mo¬ 
saicking  image  and  the  patterns  in  the  difference  image 
shows  that  the  estimates  are  reasonably  correct.  Further 
quantitative  error  analysis  can  be  performed  by  checking 
the  consistency  between  forward  and  backward  motion 
estimation.  For  image  pair  /i(-Vi,Ti)  and  f2(X2,Y2), 
assume  that  the  estimated  parameters  of  motion  from 
fi{Xi,Yi)  to  f2iX2,Y2)  are  si,  AA'i,  and  AVi,  and 
the  estimated  parameters  of  motion  from  f2{X2,Y2)  to 
/i(Ar2,V2)  are  S2,  92,  AX2,  and  Ay2;  then  we  have 

(?;) = («) 

Combining  (18)  and  (19)  leads  to 

-  ««  {cos(9i+92)  sin{9x  +  92)\{Xi\ 

\Yi )  -  *1^2  ^  ^  ) 

(  COS^2  sind2'\M^lV/^^^2^ 
sin  92  COS  92)  l^ATi J\AY2 ) 

(  cos(^i  +  92)  ain(9i  +  92))  /X2) 
(Y2J  -  ^^^^(-sin(9i-h92)  cos(9i  +  92)J[Y2J 

f  cos^i  sin^iVAA’2\  /AA:A 
■^®\-sinfli  cos«iy\^Ay2y  VATiy 

The  differences  between  the  forward  and  backward  esti¬ 
mates  of  (s,  9,  AX,  AY)  can  be  defined  as 

f,  =  |siS2  -  1|  (20) 

=  1^1  +  ^2!  (21) 

f^X/  =  S2(co8  ^2AXi  ■+•  sin  ^2Ayi)-(-AX2  (22) 

=  S2(-8in02AXi  -h  cosfl2Ayi)-l-Ay2  (23) 
fAXk  =  si(cos0iAX2 -f-sin0iAy2)-l-AXi  (24) 
c^y,  =  si(— sin  ^1  AA^2  COS  Ay2)-f-Ayi  (25) 

In  all  our  experiments  on  texture  image  registration, 
the  differences  between  the  forward  and  backward  esti¬ 
mates  are  bounded  by 

e,  <  2  X  10“^ 

(e  <  0.07® 

<  0.5  pixel 

Matching  of  Aerial  Images  Automatic  im¬ 
age  registration  is  an  important  issue  in  remote  sens¬ 
ing  applications.  Figures  4  and  5  show  the  results  of 
using  the  registration  algorithm  for  registering  satellite 
images.  Figure  4  shows  the  registration  of  two  San  Fran¬ 
cisco  images  with  significant  differences  in  image  orien¬ 
tations  and  intensities.  SF3212  and  SF3222,  shown  in 


(4.c)  Mosaicking  of  (4. a)  and  (4.b) 


Figure  4:  Registration  of  the  San  Francisco  images. 
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(5.a)  SD311 


(5.b)  SD325 


(5.c)  Mosaicking  of  (5.a)  and  (5.b) 

Figure  5:  Registration  of  the  San  Diego  images. 


(6.a)  frame-1 


(a)  and  (b),  are  the  input  images.  The  mosaicking  of 
the  transformed  SF3212  and  SF3222  is  shown  in  (c). 
The  estimated  transform  parameters  are  s  =  0.983908, 
e  =  56.3428®,  AX  =  -3.17038,  and  AY  =  -57.8287. 

Figure  5  shows  the  registration  of  two  San  Diego 
images,  in  which  (a)  and  (b)  are  the  input  images. 
The  mosaicking  of  the  transformed  SD311  and  SD325 
is  shown  in  (c).  The  estimated  transform  parameters 
are  s  =  0.999911,  $  =  0.222798®,  AX  =  51.2238,  and 
AV  =  78.6986.  In  Figures  (5.c)  inspection  of  the  conti¬ 
nuity  of  features  in  the  mosaicking  image  shows  that  the 
registering  is  correct. 

Matching  of  Stereo  Images  Most  of  the 
known  stereo  algorithms  assume  that  the  input  images 
have  already  been  aligned  so  that  the  epipolar  line  is 
paradlel  to  the  scanning  direction;  also  the  images  are 
roughly  registered  and  the  scaling  between  the  image 
pair  is  adjusted.  In  many  situations,  this  initial  match¬ 
ing  is  performed  by  manually  picking  some  feature  points 
and  aligning  the  images  using  a  stereoscopic  platform. 
Our  registration  algorithm  is  useful  for  obtaining  the  ini¬ 
tial  matching  and  the  direction  of  the  epipolar  lines.  Two 
experimental  results  on  stereo  image  pair  registration  are 
presented  here. 

Figure  6  shows  results  on  matching  the  first  two  frames 
of  a  robot  arm  image  sequence^,  (a)  and  (b)  show  the 
input  images,  (c)  shows  the  difference  between  the  trans¬ 
formed  (a)  and  (b).  The  estimated  transform  parameters 
are  s  =  1.00131,  0  =  4.19226®,  AX  =  -0.125131,  and 
Ay  =  -5.25858. 

Figure  7  shows  an  experiment  on  matching  the  first 
and  last  frames  of  a  chemical  plant  image  sequence  (close 
view).  In  Figure  7,  (a)  and  (b)  are  the  two  input 
frames  of  the  chemical  plant  image  sequence,  (c)  shows 
the  difference  between  the  motion  compensated  (a)  and 

(b) .  The  estimated  motion  parameters  are  s  =  1.06916, 
9  =  0,175411®,  AX  =  -0.784010,  and  AY  =  -28.5614. 
Note  that  the  field  of  the  last  frame  is  smaller  than  the 

*The  robot  arm  image  sequence  was  provided  by  Univer¬ 
sity  of  Massachusetts,  Department  of  Computer  and  Infor¬ 
mation  Sciences. 
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(6.c)  Difference  after  registration 

Figure  6:  Registration  of  a  robot  arm  image  sequence 
(the  first  two  frames).  In  (6.c)  the  first  frame  is  first 
transformed  using  the  estimated  motion  parameters. 


Figure  7:  Registration  of  a  chemical  plant  image  se 
quence  (close  view).  In  (7.c)  the  first  frame  is  first  trans 
formed  using  the  estimated  motion  parameters. 


(7.a)  frame-1 


(7.c)  Difference  after  registration 


(8. a)  first  frame 


(8.b)  eighth  frame 


(8.c)  Direct  difference 


field  of  the  first  frame  and  the  translation  between  the 
two  frames  is  mainly  in  the  vertical  direction.  These  are 
consistent  with  the  fact  that  the  camera  is  approaching 
the  chemical  plant. 

Moving  Object  Detection  Our  algorithm 
can  also  be  used  for  change  detection  from  satellite  im- 
ages.  Here  we  present  applications  of  our  algorithm  to 
moving  object  detection. 

Figure  8  shows  an  experiment  in  change  detection  us¬ 
ing  the  first  and  the  eighth  frames  of  a  helicopter  image 
sequence.  The  basic  purpose  is  to  detect  the  differences 
in  successive  images.  To  carry  this  out,  the  motion  of  the 
camera  needs  be  compensated  first.  Our  registration  al¬ 
gorithm  is  useful  in  this  application  as  demonstrated  in 
the  following.  In  Figure  8,  (a)  and  (b)  are  the  input 
images,  (c)  shows  the  direct  difference  between  (a)  and 
(b),  and  (d)  shows  the  difference  between  (a)  and  (b) 
after  the  motion  of  the  camera  has  been  compensated. 
The  improvement  of  (d)  over  (c)  is  obvious.  The  moving 
helicopter  can  be  detected  by  thresholding  the  difference 
image  shown  in  Figure  (8.d). 


(8.d)  Motion  compensated  difference 
Figure  8:  Change  detection  from  helicopter  images 


5  Conclusion 

A  {r  '  robust  2-D  translation,  rotation,  and  scal¬ 
ing  n  algorithm  has  been  presented.  We  have 

illut  .le  performance  of  the  algorithm  on  a  va¬ 

riety  -.jages.  Satisfactory  registration  results  have 
been  obtained  for  registration  of  texture  images,  stereo 
matching,  satellite  image  mosaicking,  and  moving  object 
detection. 
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This  paper  coosidos  the  relative  placement  problem  for  two 
cameras  viewing  a  point  set  in  three  dimensions.  A  non¬ 
iterative  algorithm  to  solve  this  problem  was  given  by 
Longoet-Higgins  [3].  However,  Lmguet-Higgins's  solution 
made  certain  assumptions  about  the  camera  that  may  not  be 
justified  in  practice.  In  particular,  it  is  assumed  implicitly 
in  his  paper  that  the  focal  length  of  the  two  cameras  is 
known,  as  is  the  princ^  point  (the  point  where  the  focal 
axis  of  the  camera  inmsects  the  image  plane).  Whereas  it  is 
often  a  side  assumption  dua  the  princi^  point  of  an  image 
is  at  the  center  pixel,  the  focal  length  of  the  camera  is  not 
easily  deduced,  and  will  generally  be  unknown  for  images  of 
unknown  origin.  In  this  piqter,  a  non-iterative  algorithm  is 
given  for  the  solution  of  the  relative  camera  placement 
problem  when  the  focal  lengths  of  the  cameras  are 
unknown.  It  wiU  also  reexamine  the  algorithm  of  Longuet- 
Higgins  and  suggest  a  ^plificaoon. 

The  focal  length  ai  a  camera  determines  the  field  of  view  cd 
an  image,  or  more  particularly,  the  angle  subtended  per  unit 
length  in  the  focal  plane.  The  effect  of  increased  focal 
length  is  however  in^tinguishable  fiom  enlargement  For 
digiml  images  digitized  frm  film  and  sampled  into  discrete 
pixels,  the  concept  of  focal  length  is  meaningless,  unless 
the  exact  pixel  size  is  known.  For  this  reason,  I  prefer  to 
use  the  term  “magnification”  in  place  of  the  term  focal 
lengdi. 

1.  THE  8-POlNT  ALGORITHM 

First  I  will  dmve  the  8-point  algorithm  of  Longuet- 
Higgins  in  order  to  fix  notation  and  to  gain  some  insight 
into  its  properties.  An  alternative  derivation  was  given  in 
[3]  or  [4].  Since  we  are  dealing  with  homogeneous 
coordinates,  we  are  inteested  only  in  values  determined  tq) 
to  scale.  Consequently  we  introduce  the  notation  ->  to 
indicate  equality  up  to  multiplication  by  a  scale  factor. 
Image  space  coordinates  will  usually  be  given  in 
honnogeneous  coordinates  as  (u,  v,  w)^. 

1.1  Algorithm  Derivation. 

We  consider  the  case  of  two  cameras,  one  of  which  is 
situated  at  the  origin  (0. 0, 0)^  of  object  space  coordinates, 
and  OIK  which  is  displaced  from  it  The  two  cameras  may 
be  lepreaenied  the  transformation  that  they  perform 
translating  points  from  object  space  into  image  space 
coordiriaies.  The  two  transformations  ate  assumed  to  be 


and 


(Eq.l) 


(Eq.2) 


where  R  is  a  rotation  matrix,  and  the  vectors  (u,  v,  w)^  and 
(u'.  v',  w')^  are  the  homogeneous  coordinates  of  the  image 
points.  Writing  T  «  (tx.  ty.  tz)^.  and  using  homogeneous 
coordinates  in  both  object  and  image  space,  the  above 
relations  may  be  written  in  matrix  frxm  as 


and 


(Eq.3) 
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where  {R I  -RT)  and  G 1 0)  ate  3x4  matrices  divided  into  a 
3x3  Mock  and  a  3x1  column  and  I  is  the  identity  matrix. 

Now.  I  will  define  a  transformation  between  the  2- 
dimensional  projective  platK  of  image  coordinmes  in  image 
1  and  the  pracil  of  epipolar  lines  in  the  second  image.  In 
particular,  as  is  well  known,  given  a  point  (u.  v,  w)^  in 
image  1,  the  corte^nding  point  in  image  2  must  lie  on  a 
certain  qripolar  litK,  which  is  the  image  under  1*2  of  the  set 
S  of  all  points  (x,  y,  z,  1)^  which  map  under  Pi  to  (u,  v, 
w)*^.  To  determine  this  line  one  may  identify  two  points  in 
S,  namely  the  camera  origin  (0,  0,  0, 1)*^  and  the  point  at 
infinity,  (u,  v,  w,  0)^.  The  images  of  these  two  points 
under  P2  are  -RT  and  R(u,  v,  w)^  respectively. 
Consequently,  the  line  that  passes  through  these  two  points 
is  given  in  homogeneous  coordinates  by  their  cross  product. 


fP') 

q  sRTxR  V 


■f-0) 


(Eq.5) 


Here  (p,  q,  r)'*’  represents  the  line  pu'  +  qv'  +  rw'  =  0. 
Representing  by  S  the  matrix 

f  0  -iz  ty  > 

S  =  0  -tx  (Eq-6) 

^-ty  tx  0  , 

equation  (Eq.  S)  may  be  written  as 


Since  the  point  (u',  v',  crnresponding  to  (u,  v,  w)"*^ 
must  lie  on  the  qtipolar  line,  we  have  the  important  relation 

(u',  v',  wO  Q  (u,  V,  w)T  *  0  (Eq.  8) 

where  Q  =  RS.  This  relationship  is  due  to  Longuet-Higgins 
(131). 

As  is  well  known,  given  8  cwrespondences  of  more,  the 
matrix  Q  may  Im  computed  by  solving  a  (possibly 
overdetermined)  set  of  linear  equations.  In  order  to  compute 
the  second  camera  transfdvm,  it  is  necessary  to  fact<»'  Q 
into  the  product  RS  of  a  rotation  matrix  a  skew- 
symmetric  rrutrix.  Longuet-Higgins  [3]  gives  a  rather 
involved,  and  iqrparently  numerically  somewhat  unstable 
method  of  doing  this.  I  will  give  an  alternative  tttethod  of 
factoring  the  Q  matrix  based  on  the  Singular  Value 
Decomposition  ([S]).  The  following  result  may  be  verified. 

Theorem  1:  A  3x3  matrix  Q  can  be  factored  as  the 
product  of  a  rotatitm  matrix  and  a  non-zero  skew  symmetric 
matrix  if  and  only  if  Q  has  two  equal  n<Mi-zero  singular 
values  and  one  singular  value  equal  to  0.  // 

For  a  proof  see  [1]. 

This  theorem  allows  us  to  give  an  easy  method  of  factoring 
any  matrix  into  a  product  RS,  where  possible. 

Theorem  2:  Suppose  the  matrix  Q  can  be  facUMied  into  a 
product  RS  where  R  is  orthogonal  and  S  is  skew- 
qrmmetiic.  Let  the  singular  value  decomposition  of  Q  be 
UDV*^  where  D  =  diag  ^  k,  0).  Then  up  to  a  scale  factor 
the  factorization  is  one  of  the  following : 

S  -  VZVT  ;  R  .  uevT  or  UeTy"^  ; 

Q  -  RS. 

where 

/010\  ^010\ 

E«  -1  0  0  ,  Z=  -1  0  0  I  (Eq.9) 

,ooij  (,oooj 

// 

Proof :  That  the  given  factorization  is  valid  is  true  by 
inspection.  That  these  are  the  only  solutions  is  implicit  in 
the  paper  of  Longuet-Higgins.  // 

It  may  be  verified  that  T  (the  translation  vector)  in  Theorem 
2  is  equal  to  V.(0,0,1)*^  since  this  ensures  that  ST  s  0  as 


required  by  (Eq.  6).  Furthermore  II  T  II  =  1,  which  is  a 
convenient  normali^on  suggested  in  [3].  As  remaiked  by 
Longuet-Higgins.  the  correct  solution  to  the  camera 
placement  prr^lem  may  be  chosen  based  on  the  requirement 
that  the  visible  points  be  in  front  of  both  cameras  ([3]). 
There  are  four  possible  rotation-translation  pairs  that  must 
be  ctxisidered  based  on  the  two  possible  ciroices  of  R  and 
two  possible  signs  of  T.  Therefwe,  since  UEV^.V(0,  0, 
1)^  =  U  (0, 0, 1)"*^  the  requisite  camera  matrix  (R  I  -RT)  is 
equal  to  (UEVT  I  U  (0.  0.  1)T)  or  one  of  the  obvious 
alternatives. 

1.2.  Numerical  Considerations. 

In  any  {nactical  application,  the  matrix  Q  found  will  not 
factor  exactly  in  the  required  manner  because  of  inaccuracies 
of  measurement  In  this  case,  the  requiranent  will  be  to 
nnd  the  matrix  “closest”  to  Q  that  does  facttx  into  a  product 
RS.  Using  the  sum  of  squares  of  matrix  entries  as  a  norm, 
we  wish  to  find  the  matrix  O'  =  RS  such  that  II Q  -  O'  H  is 
minimized.  The  following  theorem  shows  that  the 
factorization  given  in  the  previous  themem  is  numerically 
optimal. 

Theorem  :  Let  Q  be  any  3  by  3  matrix  and  Q  =  UDV^^ 
be  its  singular  value  decomposition  in  which  D  =  diag  (r,  s, 
t)  and  r  2  s  e!  L  Define  the  matrix  Q'  by  O'  =  U  D' 
where  jy  =  diag  (k,  k,  0)  and  k  =  (r-i-s)/2.  Then  Q'  is  the 
matrix  closest  to  Q  uiider  the  sum-of-squares  norm  which 
satisfies  the  condition  O'  =  R.S,  where  R  is  a  rotation  and 
S  is  skew-symmetric.  Furthermore,  the  factorization  is 
given  up  to  sign  by  R  =  UEV"^  or  UE^V^  and  S 
=k.VZVT.// 

This  theorem  is  plausible  given  the  norm-preserving 
property  of  orthogonal  transformations.  However,  its  proof 
is  not  entirely  obvious  and  falls  outside  of  the  scope  of  this 
paper. 

1.3.  Algorithm  Outline. 

The  algorithm  for  computing  relative  camera  locations  for 
calibrated  cameras  is  as  follows. 

1.  Find  Q  by  solving  a  set  of  equations  of  the  Cmn 
(Eq.8) 

2.  Find  the  singular  value  decomposition  Q  = 
UDV*^,  where  D  =  diag(a,  b,  c)  and  a  ^  b  ^  c. 

3.  The  transformation  matrices  for  the  two  cameras 
are  Cl  »  G  I  0)  and  C2  equal  to  one  of  the  four 
following  matrices. 

(UEvT  IU(0,0,  oT) 

(UEvT  IU(0.0.-1)T) 

(UE'''vT|U(0,0,  1)T) 

(UeTvT|U(P.  0.-1)T) 


The  choice  between  the  four  transfcxmations  for  C2  is 
determined  by  the  requirement  that  the  point  locations 
(which  may  be  comput^  once  the  cameras  are  known  [3]) 
must  lie  in  front  of  both  cameras.  Geometrically,  the 
camera  rotations  rq>tesented  by  UEV^  and  UE^V^  differ 
from  each  other  by  a  rotation  tl^gh  180  degrees  about  the 
line  joining  the  two  cameras.  Given  this  fact,  it  may  be 
veriHed  geometrically  that  a  single  pixel-to-pixel 
correspondence  is  enough  to  eliminate  all  but  one  of  the 
four  alternative  camera  i^acements. 

2.  UNCALIBRATED  CAMERAS. 


points  that  must  lie  on  the  epipolar  line  are  the  images 
under  P2  of  the  camera  centre  (0,  0,  0,  1)^  of  the  first 


camera  and  the  point  at  infinity 


Transform 


P2  takes  these  two  points  to  the  points  -K2RT  and 
K2RKi~V(u.  V,  w)^.  The  line  through  these  these  points 
is  given  by  the  cross  {woduct 

K2RT  X  K2RKi-1(u,  V.  w)T.  (Eq.  10) 


If  the  internal  camma  calibration  is  not  known,  then  the 
problem  of  finding  the  camera  parameters  is  considerably 
mme  difficult  In  general  one  would  like  to  allow  arbitrary 
non-singular  matrices  K  describing  internal  camera 
calibration  and  consider  camera  matrices  of  the  general  form 
(K2R  I -K2RT).  that  is.  general  3x4  matrices.  Because  K 
is  multiplied  by  a  rotation.  R.  it  may  be  assumed  that  K  is 
iqtpn  triangular.  Further  allowing  for  an  arbitrary  scale 
factor,  there  are  S  remaining  independent  entries  in  K 
representing  camera  parameters.  Other  authors  have  allowed 
four  internal  camera  parameters,  namely  principal  point 
offsets  in  two  directions  and  different  sc^  factors  in  two 
directions.  If  however  different  scaling  is  allowed  in  two 
directions  not  necessarily  aligned  with  the  direction  of  the 
image-space  axes,  then  one  more  parameter  is  needed, 
matog  iqr  the  S. 

It  is  too  much  to  hope  that  from  a  set  of  image  point 
conespondences  one  could  hcqie  to  retrieve  the  full  set  of 
internal  catnoa  parameters  for  a  pair  of  cameras  as  well  as 
the  relative  extonal  positioning  of  the  cameras.  Indeed  if 
(xi)  are  a  set  of  points  visible  in  a  pair  of  cameras  with 
transform  matrices  Mi  and  M2,  and  G  is  an  arbitrary  non¬ 
singular  4x4  matrix,  then  replacing  each  xi  by  G~^xi  and 
each  camera  Mj  with  MjG  preserves  the  object-point  to 
image-space  correspondences.  As  may  be  seen,  the  internal 
parame^  of  one  of  the  cameras.  Mi  say,  may  be  chosen 
arbitrarily.  The  situation  is  not  helped  by  adding  more 
camoas.  This  is  in  contrast  to  the  case  of  calibrated 
camoas  in  which  a  finite  number  of  solutions  ate  possible 
([I])  The  question  remains,  therefrae,  how  much  can  be 
deduced  about  the  internal  camera  parameters  from  a  set  of 
image  correspondences.  It  will  be  shown  in  this  papa  that 
assuming  all  other  intnnal  parameters  to  be  krown,  the 
magnifioition  Csctots  of  the  two  cameras  may  be  computed. 

2.1  Algorithm  Derivation 

Let  Ki  and  K2  be  two  matrices  representing  the  internal 
camera  transformations  of  the  two  cameras  and  let  Pi »  (Ki 
1 0)  and  P2  *  (K2R I -K2RT)  be  the  two  camera  trankorms. 
The  task  is  to  obtain  R,  T,  Ki  and  K2  given  a  set  of  image- 
point  correspondences.  For  the  present,  the  matrices  Ki  and 
K2  will  be  assumed  arbitrary. 

As  before,  it  is  possible  to  determine  the  ej^polar  line 
corresponding  to  a  point  (u,  v,  w)^  in  image  1.  The  two 


If  K  is  a  square  matrix,  we  use  the  notation  K*  to  represent 
the  adjoint  of  K.  that  is  the  matrix  of  cofactors  defined  by 
K*ij  =  (-l)*+idet  (K(y))  where  is  the  matrix  derived 
from  K  by  removing  the  i-th  row  and  j-th  colutim.  If  K  is 
non-singular,  then  it  is  well  known  that  K*  =  det(K)  . 
(K^)“*  .  In  other  words,  K*  -  The  adjoint 

matrix  is  related  to  cross  products  in  the  following  way. 

Remark  :  If  a  and  b  are  3-dimensional  column  vecUvs  attd 
K  is  a  3  X  3  matrix,  then  Ka  x  Kb  »  K*  (  a  x  b  ). 


Using  this  fact  it  is  easy  to  evaluate  the  cross  product 

K2RTxK2RKrl(u,v,w)''’  -  K2*RKi*-* 
(KiTxCu.v.w)''')  (Eq.ll) 

If  (tx' ,  ty' ,  tz'  =  KiT  then  writing 

r  0  -tz'  ty'  ^ 


s  = 


tz'  0  -tx' 
1,-ty'  tx'  0  y 


(Eq.l2) 


as  before,  we  have  a  formula  for  the  epipolar  line 
craresponding  to  the  point  (u,  v,  w)^  in  image  1. 


0  -  K2*RKiTs 

Furthermore,  setting  Q  =  K2~^RKiS  we  have  the  formula 
(u'.v',w')Q(u,v,w)'>'  =  0.  (Eq.l4) 

After  solving  to  find  the  matrix  Q  as  b^ore,  we  are  faced 
with  the  mote  difficult  task  of  finding  the  factorization  Q  « 
K2*RKiTs. 

An  alternative  factorization  for  Q  that  may  be  derived  from 
(Eq.  10)  is 

Q  -  (K2~b‘'’RSKr*  (Eq.  15) 

where  S'  is  as  given  in  (Eq.  6). 

2.2.  Factorization  of  Q. 

Our  goal  is  to  find  the  factorization  Q  •>  K2*RKi^S.  As 
before,  we  use  the  Singular  Value  Decomposition,  Q  > 
UDW^.  Since  Q  is  non-singular,  the  diagonal  matrix  D 
equals  diag  (r,  s,  0)  where  r  and  s  are  two  positive  constants. 


0 


(Eq.l3) 
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Since  QW(0,  0,  1)T  =  0.  it  follows  that  SW(0.  0.  1)T  =  0 
since  K2*RKi’^  is  non-singular,  and  so 
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1 
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,  \ 

s  •• 

w 

1  0  0 

WT 

< 

,0  0  0^ 

> 

as  in  section  1.2.  The  general  solution  to  the  (Moblem  of 
factoring  Q  into  a  product  R'S.  where  R'  is  non-singular 
and  S  is  skew-synuneiric  is  therefore  given  by 
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multiplying  by  (K2K2^)  and  (KiKi^)  req)ectively  gives  an 
equation 


<  /ll  /12  /13  \ 

hi  fii  fn  =j 

*2^2  *2^3  ) 
(Eq.21) 


^gll  gl2  *1^«13  ^ 
g21  g22  ^1^23 
^g31  g32  *1^«33  > 


where  the/|y  and  g/y  linear  exivessions  in  a,  P  and  y,  and  x 
is  an  unknown  scale  factex’. 


The  top  left  hand  block  of  this  equation  set  comprises  a  set 
of  equations  of  the  form 


(Eq.l6) 

where  a,  p  and  Y  are  arbitrary  constants.  The  two  bracketed 
expressions  ate  R'  and  S  reflectively  and  the  factorization  is 
unique  (excfK  for  the  variables  a,  P  and  y)  up  to  scale. 
Unl^  in  the  case  in  section  1.2,  we  ^  not  n^  to  consido* 
the  alternate  solution  in  which  is  replaced  by  E.  since 
that  is  taken  care  of  by  the  undetermined  values  a.  P  and  y. 
Since  both  E  and  W  are  othogonal  matrices,  we  write  V  = 
WE,  and  V  is  also  orthogonal.  Furth^  writing 


Xa.p,Y  = 


/r  0  a  \ 
0  s  p 
^0  0  T, 


we  obtain  the  expression  R'  =  U  Xa,p,Y 


(Eq.l7) 


/ll  /12 
f2\  f22 


gl2  ^ 

g22  J 


(Eq.22) 


If  the  scale  factcx  were  known,  then  this  system  could  be 
solved  as  a  set  of  linear  equations.  Unfortunately,  z  is  not 
known,  and  it  is  necessary  to  Hitd  the  value  of  x  before 
solving  the  set  of  linear  equations  f(x  a,  p  and  y. 

Since  the  entries  of  the  matrices  on  both  sides  of  (Eq.  21) 
are  linear  expressions  in  a,  P  and  y,  it  is  possible  to  rewrite 
(Eq.  22)  in  the  form 


P 

—  X.  Mx 

P 

Y 

y 

lu 

(Eq.23) 


Now.  we  turn  our  attention  to  the  matrix  R'.  For  some 
values  of  a,P  and  Yt  we  must  have  an  identity  R'  « 
K2*RKi*~1,  where  R  is  a  rotation  matrix.  From  this  it 
follows  that  R  »  K2*~*R'  Ki*.  The  particular  property 
of  a  rotation  matrix  that  we  will  use  is  that  it  is  equal  to  its 
adjoint,  (inverse  transpose).  This  means  that  K2*~^R'Ki* 
-  K2-1  R'  •  Ki  or 

K2K2''’r'  -  R'*KiKi'''.  (Eq.l8) 

Since  R'  =  OXV^,  it  follows  that  R'*  -  U  X*  v'*’  and  X* 
is  die  matrix 


ad 


X*-X*a,P.Y» 


/  sy  0  0 

0  lY  0 

^-sa  -rp  IS 


(K2K2T)UXo.P,yVT  -  UXVP.yV''^(KiKiT) 
(Eq.20) 

At  this  point,  it  is  necessary  to  fiecialize  to  the  case  where 
Ki  and  K2  are  the  simide  form  Ki  s  diag(l,  1,  ki)  and 
K2  «  diag  (1,  1,  k2).  In  this  case,  ki  and  k2  are  the 
inverses  of  the  magnification  factors.  If  the  entries  of 
LXW^  are  (fy)  and  those  of  UX*w'^  are  (gy),  th«i 


where  Mi  and  Mx  are  4x4  matrices,  each  row  trf  Mi  or  Mx 
corresponding  to  one  of  the  four  entries  in  the  matrices  in 
(Eq.  22).  Such  a  set  of  equations  has  a  solution  only  if 
det(Mi-x.Mx)  =  0. 

This  leads  to  a  polynomial  equation  of  degree  4  in  x  :  p(x) 
=  det(Mi  -  xMx)  -  0.  This  equation  may  be  solved  to  give 
a  value  of  x.  It  turns  out,  however,  that  it  is  sufficient  to 
solve  a  quadratic  equation.  In  particular,  it  may  be  shown 
that  det(Mi)  and  det(Mx)  ate  both  zero.  Since  these 
represent  the  constant  and  fourth-order  terms  of 
deUMi+xMx)  it  follows  that  det(Mi-fxMx)  is  a  cubic 
polynomial  with  one  root  equal  to  0.  To  solve  the 
equation,  therefore,  it  is  necessary  only  to  solve  a  quadratic 
equation. 

At  this  stage  we  know  that  det(Mi-x.Mx)  is  a  polynomial 
of  the  form  p(x)  »  asx^  +  a2x^  +  aix  and  so  may  be  easily 
solved.  The  root  x  »  0  of  this  polynomial  may  safely  be 
ignored,  since  according  to  (Eq.  22)  it  would  imply  that/y  = 
0  for  i,  j  £  2,  and  hence  that  R'  is  singular,  which  by 
assumption  it  is  not 

Once  the  value  of  x  is  known,  the  values  of  a,  p  and  y  may 
be  determined  by  solving  die  set  of  equations  given  in  (Eq. 
22).  Finally,  the  values  of  ki  and  k2  may  be  read  off  from 
equations  (Eq.  21).  In  particular. 
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«  xg3l/f31  =xg32/f32 
kl^=  fl3/xgl3=  f23/*g23  (Eq.24) 

k22f33*x-kl^g33- 

This  redundancy  gives  a  check  on  the  correctness  of  the 
computed  values  of  ki  and  k2  which  may  be  used  to 
distinguish  between  the  two  roots  of  p(x).  In  practice, 
however,  several  observations  have  been  made  which  could 
be  used  to  simplify  the  case  checking. 

1.  The  two  roots  of  the  polynomial  p(x)  =  det(Mi  - 
xMx)  are  negatives  of  each  other. 

2.  For  some  Q  matrices,  the  determinant  p(x)  has 
purely  imaginary  roots.  This  would  imply  the 
impossibility  of  a  real  solution  and  so  could  not 
occur  if  die  Q  matrix  is  derived  from  the  problem  of 
image-point  matches. 

3.  The  estimated  value  of  ki^  and  k2^  corre^xmding 
to  the  two  opposite  roots  of  p(x)  are  negatives  of 
each  other.  In  one  of  these  cases,  therefore,  ki^  is 
negative,  aid  may  be  ignored. 

4.  The  two  estimates  for  and  k2^  given  by  the  first 
line  of  (Eq.  27)  are  exactly  equal.  The  same  holds 
for  the  two  values  of  ki^. 

At  this  point,  it  is  possible  to  continue  and  compute  the 
values  the  rotation  matrix  from  (Eq.  17).  However,  it 
turns  out  to  be  more  convenient,  now  that  the  values  of  the 
magnification  are  known,  to  revert  to  the  case  of  a  calibrated 
camera.  Mote  particularly,  we  observe  that  according  to 
OEq.  15),  Q  may  be  written  as  Q  *  K2~*Q'  Kp*  where  Q' 
s  RS,  and  R  is  a  rotation  matrix.  The  wiginal  method  of 
section  1.3  may  now  be  used  to  solve  for  the  camera 
matrices  derived  firom  Q'.  In  this  way,  we  find  camera 
models  Ci  »  G 1 0)  and  C2  -  (R  ■  -RT)  for  the  two  cameras 
coneqxmding  to  O'.  Taking  account  of  the  magnification 
matrices  Ki  and  K2,  the  final  estimates  of  the  camera 
matrices  are  (Kj  1 0)  tmd  (K2R I -K2RT). 

In  (nactice  it  has  been  observed  that  greater  nummcal 
accuracy  is  obtained  by  repeating  the  computation  of  ki  and 
k2  afto'  replacing  Q  by  Q'.  The  values  of  ki  and  k2 
computed  from  O'  are  very  close  to  1  and  may  be  used  to 
revise  the  computed  magnifications  very  slighdy.  However, 
such  a  revision  is  necessary  only  because  of  numerical 
lound-off  error  in  the  algorithm  and  is  not  strictly  necessary. 

3.  PRACTICAL  RESULTS. 

This  algorithm  has  been  encoded  in  C  and  tested  on  a  variety 
of  examples.  In  the  first  test,  a  set  of  25  matched  points 
was  computed  synthetically,  corresponding  to  an  oblique 
placement  of  two  cameras  with  equ^  magnification  values 
of  1003.  The  principal  point  offset  was  assumed  known. 
The  sohition  to  the  rdUoive  camera  placement  problem  was 
computed.  The  two  cameras  were  computed  to  have 
magfiifications  of  1003.52  and  1003.71,  very  close  to  the 


original.  Camera  placements  and  point  positions  were 
computed  and  were  found  to  match  die  input  pixel  position 
data  within  limits  of  accuracy.  Similarly,  the  positions  in 
3-space  of  the  object  points  matched  die  known  positions  to 
within  one  part  in  10^. 

The  algorithm  was  also  tested  out  on  a  set  of  matched 
points  derived  from  a  stereo-matching  program, 
STEREOSYS  ([2]).  A  set  of  124  matched  points  were 
found  by  an  unconstrained  hi^archical  search.  The  two 
images  used  were  1024  by  1024  aerial  overhead  images  of 
the  Malibu  region  with  about  40%  overijqi.  The  algorithm 
described  here  was  applied  to  the  set  of  124  matched  points 
and  relative  camera  placements  and  object-point  positions 
were  computed.  The  computed  model  was  then  evaluated 
against  the  miginal  data.  Consequently,  the  computed 
camera  models  were  applied  to  the  computed  3-D  object 
points  to  give  new  pixel  locations  which  were  then 
compared  with  the  ori^nal  refoence  pixd  data.  The  RMS 
pixel  error  was  found  to  be  0.11  pixds.  In  other  wads,  the 
derived  model  matches  the  actual  data  with  a  stan^d 
deviation  of  0.11  pixels.  This  shows  die  accuracy  na  only 
of  the  described  camera  modelling  data,  but  also  the  accuracy 
of  the  point-matching  algoithms. 
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Abstract 

It  is  well  known  that  given  at  least  four  reference 
points/lines  on  a  planar  surface  and  their  correspon¬ 
dences  in  the  image,  the  relative  positions  of  other 
points/lines  on  the  plane  can  be  derived  regardless  of 
camera  position  or  intrinsic  calibration  parameters. 
This  result  is  used  to  perform  model  extension  for 
points  and  lines  on  planar  surfaces.  A  new  frame¬ 
work  for  data  fusion  in  the  projective  plane  is  presented 
to  merge  the  relative  positions  of  coplanar  points  and 
lines  derived  in  this  way. 

1  Planar  Model  Extension 

Acquiring  3D  models  of  the  environment  is  an  im¬ 
portant  current  research  problem  in  computer  vision. 
Modeling  the  world  in  all  its  complexity  is  a  daunting 
task,  due  to  open  issues  in  representing  curved  sur¬ 
faces  and  volumes,  and  the  variety  of  textures  found 
in  natural  scenes.  Because  of  these  difficulties,  many 
researchers  have  focused  on  man-made  domains  where 
planar  surfaces  and  linear  sricface  markings  predomi¬ 
nate.  The  limitations  of  tnese  domains  are  mitigated 
by  the  fact  that  useful  applications  exist  where  the 
pl2marity  assumption  does  hold,  such  as  indoor  mobile 
robot  navigation.  Even  in  unrestricted  environments, 
the  world  is  sometimes  flat  enough  locally  to  approxi¬ 
mate  by  piecewise  planar  patches. 

The  major  benefit  of  adopting  the  world-planarity 
assumption  is  that  the  relevant  geometric  entities, 
namely  points,  lines  and  planes,  are  easily  represented 
as  linear  subspaces.  Furthermore,  a  rich  set  of  results 
from  the  field  of  projective  geometry  become  avail¬ 
able.  The  relevance  of  projective  geometry  to  the 
visual  aqnisition  of  planar  surface  models  cannot  be 
overstressed:  Projective  geometry  provides  a  mathe¬ 
matical  foundation  for  characterizing  and  representing 
the  relationships  between  linear  subspaces  that  remain 
invariant  under  the  imaging  process. 


'This  work  was  funded  by  DARPA  and  TACOM  under 
contract  number  DAAE07-91-C-R035  and  by  the  National 
Sdence  Foundation  under  grant  number  CDA-8922572. 


This  paper  describes  an  approach  to  model  exten¬ 
sion  using  properties  of  projective  mappings  between 
planes.  Model  extension  is  just  one  application  of  a 
general  framework  being  developed  for  geometric  in¬ 
ference  in  projective  space.  The  term  model  exten¬ 
sion  reflects  the  notion  of  combining  apriori  structure 
information  from  a  partied  model  with  the  observed 
structure  of  an  image,  to  derive  further  extensions  to 
the  model.  In  particular,  apriori  knowledge  of  the  rel¬ 
ative  positions  of  at  least  four  coplanar  points  or  lines 
is  used  to  derive  the  positions  of  other  points  and  lines 
on  the  same  plane  in  a  way  that  is  invariant  to  rela¬ 
tive  ciunera  location  and  intrinsic  camera  parameters. 
One  of  the  main  contributions  of  this  work  is  the  de¬ 
velopment  of  an  appropriate  methodology  for  fusing 
geometric  information  in  the  projective  plane. 

Section  2  presents  background  material  showing 
that  a  number  of  useful  plane  to  plane  correspondences 
can  be  described  by  invertible  linear  transformations 
called  homographies.  The  condition  of  invertibility  is 
crucial  to  enabling  geometric  inference  between  pl2mes, 
but  it  requires  that  the  planes  be  treated  as  projective, 
rather  than  affine  or  Euclidean,  a  distinction  that  be¬ 
comes  important  when  considering  how  to  represent 
and  propagate  uncertainty  in  the  data.  In  Section  3, 
an  approach  to  single  plane  model  extension  using  ho¬ 
mographies  is  described  that  decouples  determination 
of  the  relative  2D  locations  of  coplanar  points  and  lines 
from  the  calculation  of  the  3D  position  of  the  plane  in 
which  they  lie,  aUowing  reconstruction  of  a  planar  ob¬ 
ject  face  without  knowing  the  position  of  the  camera 
or  the  appropriate  c£uiiera  calibration  parameters.  In 
Section  4  geometric  information  derived  &om  multiple 
views  is  fused  in  the  projective  plane  using  an  antipo- 
dally  symmetric  probability  distribution  on  the  unit 
sphere. 

2  Projective  Transformations 

This  section  briefly  summarises  properties  of  projec¬ 
tive  mappings  between  planes  and  their  representa¬ 
tion  as  homographies.  Some  of  the  relevant  mate¬ 
rial  can  be  found  in  [Mohr91,  FaugSS,  Tsai82].  For 
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a  more  comprehensive  discussion  of  projective  trans¬ 
formations,  the  reader  is  invited  to  consult  a  projective 
geometry  text  such  as  [Spri64]. 


2.1  Homographies 


A  general  projective  transformation  between  planes 
can  be  written  algebraically  as 

^  aX  +  bY+c  ^  dX  +  eY  +  f 

gX  +  hY  +  i  ’  gX  +  hY  +  i  '  ^ 

where  (A,  y)  and  (X',!'')  are  points  represented  in 
the  2D  local  coordinate  systems  of  each  plane.  Un¬ 
fortunately,  this  is  a  nonlinear  transformation  that  is 
undefined  when  the  denominator  is  zero,  correspond¬ 
ing  to  a  point  mapping  to  infinity. 


In  order  to  make  a  projective  transformation  bi- 
jective,  a  line  of  points  at  infinity  is  explicitly  added 
to  each  plane,  to  correspond  to  the  cases  where  the 
denominator  in  (1)  goes  to  zero.  A  plane  that  has 
been  augmented  in  this  way  is  a  new  geometric  entity 
called  the  projective  plane.  The  projective  plane  has  a 
different  globed  topology  than  the  affine  or  Euclidean 
plane,  and  this  has  implications  for  the  representation 
of  observed  points  and  their  uncertainty.  This  topic  is 
explored  in  Section  4. 

Using  homogeneous  coordinates,  infinite  points 
can  be  manipulated  the  same  as  finite  ones,  and  the 
transformation  of  equation  (1)  become  linear 
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where  A;  is  a  nonzero  scalar,  5  and  S'  are  1  for  finite 
points  in  the  plane,  and  0  for  infinite  points.  Since 
homogeneous  coordinates  are  equivalent  up  to  scalar 
multiples,  the  transformation  matrix  can  be  multiplied 
by  any  nonzero  constant  and  still  represent  the  same 
mapping,  and  therefore  has  only  8  independent  param¬ 
eters.  A  nonsingular  projective  mapping  that  is  linear 
in  homogeneous  coordinates  is  called  a  homography. 
Matrices  representing  homographies  form  a  group  un¬ 
der  matrix  multiplication  and  matrix  inverse. 

Because  they  are  linear,  invertible  and  closed  un¬ 
der  composition,  homographies  greatly  simplify  the 
analysis  of  projective  mappings.  Figure  1  shows  a 
familiar  computer  vision  scenario  where  pictures  are 
taken  of  a  single  planar  surface  by  two  cameras  from 
different  viewpoints. 

Under  the  pinhole  camera  model,  each  camera  ef¬ 
fects  a  homography  (in  fact  a  perspeciivHy)  from  the 
object  plane  to  a  pinhole  image  plane,  with  all  point 
correspondences  lying  on  lines  p2kssing  through  the  fo¬ 
cal  point. ^  The  perspectivities  are  labeled  Hi  and  if  2 

*In  this  paper  the  term  point  eorretpondenee  denotes 
two  points  related  by  any  plane  to  plane  homography,  not 
just  image  to  image  mappings. 


Figure  1;  An  object  plane  viewed  by  two  cameras 
whose  deviations  from  the  pinhole  camera  model  can 
be  described  by  linear  camera  parameters.  Corre¬ 
sponding  points  in  any  two  planes  in  this  diagram  are 
related  by  a  homography.  Refer  to  the  text. 


in  Figure  1.  In  actuality,  the  pinhole  camera  model 
does  not  adequately  characterize  images  produced  by 
real  cameras.  If  the  deviation  of  a  particular  cam¬ 
era  from  the  pinhole  model  is  governed  by  linear  cam¬ 
era  calibration  parameters,  the  resulting  image  is  an 
affine  transformation  of  the  pure  pinhole  projection 
[Horn86].  These  transformations  are  labeled  Ci  and 
Cz  in  the  figure.  Affine  transformations  are  a  sub¬ 
group  of  the  homography  group,  and  the  composition 
of  a  pinhole  perspectivity  followed  by  an  affine  defor¬ 
mation  yields  yet  another  homography.  It  is  there¬ 
fore  easy  to  derive  the  transformation  between  any  two 
planes  in  the  diagram;  for  instance  the  transformation 
mapping  points  in  image  1  into  corresponding  points 
in  image  2  is  CzHzff  C'l  . 

Early  research  in  the  field  studied  the  homography 
HiH'2  relating  pinhole  images  of  coplanar  points, 
and  showed  that  a  decomposition  of  the  homography 
matrix  allows  recovery  of  the  relative  positions  of  the 
two  cameras  with  respect  to  each  other  and  to  the  ob¬ 
ject  plane  [Faug88,  Tsai82].  Recent  work  has  focused 
on  the  object  plane  to  image  plane  mapping  CiHi, 
and  more  importantly  its  inverse  which  back- 

projects  image  plane  points  to  their  appropriate  object 
plane  positions  regardless  of  camera  location  or  linear 
distortion  parameters.  This  backprojection  lies  at  the 
core  of  recent  work  by  Mohr  [MohrQl,  Mohr90],  as  well 
as  the  approach  to  model  extension  in  Section  3. 


Unfortunately,  the  above  analysis  can  not  be  car¬ 
ried  out  for  cameras  dominated  by  nonlinear  lens  dis¬ 
tortion.  Images  of  colinear  points  may  then  no  longer 
be  colinear,  and  the  plane  to  plane  mappings  are  no 
longer  homographies.  In  the  remainder  of  this  paper 
nonlinear  camera  parameters  are  neglected;  when  such 
lens  distortions  are  nonnegligable  a  preprocessing  step 
must  be  performed  to  remove  their  effects  [Gros90]. 

2.2  Estimation  of  a  homography 

The  fundamental  theorem  of  projective  geometry 
states  that  a  plane  to  plane  homography  is  completely 
determined  by  the  correspondences  of  4  points,  no 
three  of  which  are  colinear,  or  by  4  corresponding  lines, 
no  three  of  which  meet  at  a  point.  In  practice  ii  is  bet¬ 
ter  to  use  2is  many  point  and  line  correspondences  as 
possible  to  reduce  errors  in  the  estimated  transforma¬ 
tion  caused  by  noise  in  the  observed  image  data. 

Faugeras  and  Lustman  present  a  least  squares  ap¬ 
proach  to  estimating  a  homography  from  at  least  4 
points  [FaugSS].  Each  point  to  point  correspondence 
adds  two  constraints  on  the  ei^ht  independent  param¬ 
eters  to  be  estimated.  Using  the  notation  of  equa¬ 
tion  (2)  these  constraints  are 

aX  -^bY  ^  c  ~  X'(gX  +  hY  ■¥  i)  =  0 

dX  +  eY  +  f  -  Y'igX  -h  hY  -h  i)  =  0. 

Since  possible  solutions  for  the  set  of  parameters  are 
eqtiivalent  up  to  scalar  multiples,  a  further  constraint 
like  t  =  1  is  imposed  to  provide  a  unique  solution. 
One  problem  with  the  above  constraints  is  that  they 
are  only  valid  when  all  the  points  are  finite  (5  and  S' 
in  Equation  2  must  not  be  zero).  When  one  cr  both 
of  the  points  in  a  correspondence  are  infiiute,  a  mod¬ 
ified  pair  of  constraints  can  be  used  [Coll91].  Making 
sure  that  the  constraint  equations  properly  handle  in¬ 
finite  points  is  important,  since  a  standard  coordinati- 
zation  of  the  projective  plane  with  respect  to  four  basis 
points  involves  computing  a  homography  mapping  fi¬ 
nite  points  to  infinite  points  [Spri64].  Points  at  infinity 
also  arise  in  practical  applications  (see  Section  4). 

Homographies  preserve  colinear  structure,  and  the 
image  of  a  line  under  a  homographic  mapping  is  the 
locus  of  its  transformed  points.  The  homogeneous  co¬ 
ordinates  representating  a  line  are  formed  as  the  vector 
cross-product  p  x  g  of  the  homogeneous  coordinates  p 
and  q  of  any  two  distinct  points  on  the  line.  Under  a 
general  homograpy  A,  the  line  passing  through  points 
p  and  q  now  passes  through  Ap  and  Aq,  and  its  homo¬ 
geneous  coordinates  therefore  become  Ap  x  Aq.  It  is 
straightforward,  though  slightly  tedious,  to  show  that 
for  any  nonsingular  3x3  matrix  A, 

ApxAq  =  [A'^f(pxq). 

This  means  that  the  homography  matrix  that  maps 
line  coordinates  to  line  coordinates  under  a  given  pro¬ 
jective  mapping  is  the  transpose  of  the  inverse  of  the 
matrix  mapping  points  to  points,  and  vice  versa. 


Using  the  above  relationship  a  line  homography 
can  always  be  converted  to  the  corresponding  point 
homography.  Since  line  segment  extraction  is  a  more 
global  process  than  point  extraction,  it  seems  reason¬ 
able  to  expect  a  point  homography  derived  by  trans¬ 
forming  an  estimated  line  homography  to  be  more  ac¬ 
curate  than  one  estimated  directly  from  points. 

3  Planar  Surface  Model  Extension 

In  this  section  an  approach  to  planar  model  extension 
is  described  that  assumes  at  least  four  hneai  features 
(points  or  lines)  on  the  object  plane  are  already  known. 
The  transformation  mapping  known  object  features  to 
their  corresponding  images  is  a  homography  contain¬ 
ing  information  about  the  camera  location  and  imag¬ 
ing  parameters.  By  inverting  this  transformation  the 
positions  of  new  points  and  lines  on  the  object  face 
are  determined  with  respect  to  the  known  reference 
features,  without  having  to  solve  first  for  camera  loca¬ 
tion  or  intrinsic  calibration  parameters. 

This  approach  is  similar  to  one  used  by  Mohr 
[Mohr90].  Mohr  locates  points  on  an  object  surface 
using  p2urs  of  cross  ratios  between  an  object  point 
and  four  known  object  locations.  Since  the  cross- 
ratio  is  invariant  under  homographies,  the  values  in 
a  cross-ratio  pair  can  be  computed  directly  from  the 
image.  It  is  possible  to  rewrite  the  mapping  effected  by 
Mohr’s  cross-ratio  algorithm  as  a  homography  matrix. 
When  exactly  four  point  correspondences  are  used, 
the  homography  estimated  using  the  least  squares  ap¬ 
proach  of  Section  2.2  reduces  to  that  used  by  Mohr. 
When  more  than  four  point  or  line  correspondences  are 
known,  direct  least  squares  estimation  of  the  homogra¬ 
phy  matrix  is  probably  more  accurate.  Furthermore, 
the  new  data  fusion  approach  presented  in  Section  4 
allows  positions  estimated  from  several  images  to  be 
merged  to  derive  ever  more  accurate  point  and  line 
positions  from  noisy  observations. 

An  image  sequence  from  the  pose  estimation 
literature  was  chosen  to  illustrate  model  extension 
[Kuma90].  Figure  2  shows  a  typical  image  from  a  se¬ 
quence  of  20  images  taken  by  mounting  the  camera 
on  a  PUMA  robot  arm  and  rotating  the  arm  4  de¬ 
grees  between  consecutive  views.  Ground  truth  data 
for  the  labeled  points  was  measured  in  feet  to  two  dec¬ 
imal  places  (0.01ft  ~  l/8in).  This  sequence  has  been 
analyzed  by  Kumar  using  a  robust  pose  estimation  al¬ 
gorithm,  using  ground  truth  positions  of  the  12  points 
marked  with  crosses  to  estimate  camera  pose,  then  es¬ 
timating  depths  for  the  20  points  marked  with  circles. 

The  reference  points  used  by  Kumar  occur  in  clus¬ 
ters  of  4  points  on  3  different  planar  surfaces.  In 
the  following  experiment,  each  of  these  3  surfaces  was 
treated  as  a  separate  object  plane  containing  4  refer¬ 
ence  points  (crosses)  and  4  test  points  (circles).  Cir- 
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Figure  2;  A  typical  image  from  Kumar’s  PUMA  se¬ 
quence. 


cled  points  1-4  and  17-20  were  not  used  in  this  exper¬ 
iment.  Since  the  3D  ground  truth  coordinate  system 
was  aligned  with  the  natural  axes  of  the  room,  a  local 
2D  coordinate  system  for  each  object  plane  was  formed 
by  dropping  the  coordinate  that  didn’t  change  for  the 
points  in  that  plane.  In  this  way  each  measured  3D 
object  point  was  converted  to  a  2D  point  in  a  local 
object  plane  coordinate  system. 

For  each  of  the  3  object  planes,  the  4  reference 
points  on  that  plane  were  matched  by  hand  with  their 
corresponding  image  points,  and  the  linear  system  of 
equations  in  Section  2.2  was  used  to  find  the  homogra- 
phy  mapping  image  plane  points  onto  the  object  plane 
points.  The  estimated  homography  was  used  to  back- 
project  the  images  of  circled  test  points  in  the  same 
plane  back  into  the  object  plane,  where  their  estimated 
positions  were  compared  with  the  known  ground  truth 
locations.  This  process  was  repeated  for  each  of  the 
20  images  in  the  sequence. 

Distance  errors  in  points  9  through  12  are  con¬ 
sistently  low,  around  1/3  of  an  inch,  and  never  more 
than  3/4  inches.^  This  is  because  the  chosen  reference 
points  are  well  spread  out  in  that  object  plane,  and 
completely  surround  the  estimated  test  points.  On  the 
other  hand,  errors  in  points  13  and  15  are  quite  high: 
on  average  around  2  feet,  and  in  some  cases  as  much 
as  13  feet!  This  occurs  because  the  reference  points  on 
that  object  plane  do  not  adequately  span  the  space  of 
points  to  be  estimated.  In  fact  three  of  the  reference 


^AU  reported  distance  measurements  lie  entirely  within 
the  object  plane;  the  notion  of  distance  from  the  camera 
never  arises  in  this  framework. 


points  are  very  nearly  colinear,  a  poor  configuration 
for  homography  estimation. 

To  explore  the  relative  accuracy  of  the  estimated 
point  positions,  all  distances  between  estimated  points 
in  a  single  object  plane  were  compared  to  the  ground 
truth  distances,  and  percentage  errors  were  computed. 
Figure  3  shows  the  results.  The  light  grey  curve  for 
each  plane  shows  the  average  percentage  error  taken 
over  all  point  to  point  distances  in  that  object  plane 
for  each  image.  Plane  2  (containing  points  9-12)  shows 
the  most  accurate  results;  the  average  percentage  error 
over  all  images  is  1.1%,  around  the  level  of  noise  in 
the  ground  truth  measurements.  Plane  1  (points  5- 
8)  shows  slightly  worse  errors,  the  average  percentage 
error  over  all  images  is  3.5%.  Finally,  plane  3  (points 
13-16)  shows  miirkedly  bad  errors,  the  average  error 
over  ail  images  is  51%,  but  for  some  images  the  average 
error  is  as  high  as  200%. 

Even  in  planes  where  points  are  estimated  with 
high  accuracy,  the  level  of  accuracy  varies  unpre- 
dictably  from  image  to  image.  This  is  typical  for  a  sys¬ 
tem  that  makes  no  use  of  previous  estimates.  In  con¬ 
trast,  the  smoother  dark  curve  overlaid  on  each  graph 
shows  average  percentage  point  to  point  distance  er¬ 
rors  when  position  estimates  for  previous  images  are 
combined  with  current  position  estimates  using  the 
data  fusion  technique  described  in  the  next  section. 
A  detailed  analysis  of  the  results  appears  in  [Coll91]. 

4  Merging  Geometric  Information 

A  method  for  merging  geometric  information  derived 
from  multiple  views  is  proposed  in  this  section,  based 
on  fusing  data  points  in  the  projective  plane.  From 
each  image  in  the  above  sequence,  a  homography  is 
estimated  that  backprojects  image  points  onto  the  ob¬ 
ject  plane,  thereby  providing  an  estimate  for  the  lo¬ 
cation  of  each  object  point.  Over  multiple  images, 
multiple  location  estimates  are  obtained.  Each  point 
location  estimate  in  homogeneous  coordinates  repre¬ 
sents  a  point  in  the  projective  plane;  multiple  location 
estimates  for  each  object  point  form  a  sample  of  points 
in  the  projective  plane,  clustered  around  the  point  in 
the  projective  plane  representing  the  homogeneous  co¬ 
ordinates  of  the  true  object  point  location.  This  sec¬ 
tion  describes  a  method  for  estimating  the  true  point 
position  from  its  sample  cluster. 

4.1  Probabilities  in  tbe  Projective  Plane 

There  are  many  ways  to  visuidise  the  projective  plane. 
In  Section  2  the  projective  plane  is  described  as  the  Eu¬ 
clidean  plane  augmented  with  a  line  of  points  at  infin¬ 
ity.  This  is  not  the  best  way  to  visualise  the  projective 
plane,  however,  since  the  Euclideam  plane  is  topologi¬ 
cally  open,  while  the  projective  plane  is  topologically 
closed.  To  see  why  adding  points  at  infinity  closes  the 
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Figure  3:  Graphs  showing  for  each  object  plane  the 
average  percentage  distance  error  between  estimated 
interpoint  distances  and  ground  truth  distances. 


plane,  consider  a  hypothetical  traveler  in  the  plane  fol¬ 
lowing  a  ray  starting  at  the  origin,  going  through  point 
(z,  y),  and  continuing  out  infinitely  far.  After  finally 
"arriving”  at  infinity,  the  traveler  is  located  at  point 
(z,y,  0)  in  homogeneous  coordinates.  But  in  homoge¬ 
nous  coordinates  this  is  the  same  point  as  (-z,  -y,  0), 
so  the  intrepid  explorer  can  keep  traveling  "past  infin¬ 
ity”  in  the  same  direction,  eventually  passing  through 
point  (— z,  — y)  and  Anally  returning  to  the  origin. 

Because  of  this  wraparound  effect,  if  the  topology 
of  the  projective  plane  is  ignored  and  it  is  treated  as 
a  Euclidean  plane,  a  single  cluster  of  points  centered 
around  a  point  at  inAnity  will  appear  as  two  clusters 
inAnitely  ^  apart.  Any  estimation  technique  based  on 
"averaf^g”  these  points  using  a  Gaussian  distribution 
in  the  plane  will  produce  bad  results  in  this  case,  be¬ 


cause  the  unimodal  Gaussian  distribution  is  a  terrible 
approximation  to  the  underlying  bimodal  distribution. 

Proper  handling  of  points  at  inAnity  is  not  just 
of  theoretic2d  interest.  Such  points  do  arise  in  prac¬ 
tice.  For  instance,  parallel  lines  on  an  object  plane 
often  project  in  the  image  as  lines  that  converge  to 
a  vanishing  point.  When  backprojected  from  the  im¬ 
age  plane  to  the  object  plane,  the  vanishing  point  be¬ 
comes  a  point  at  inAnity,  thus  a  cluster  of  vanishing 
point  estimates  becomes  a  cluster  of  points  at  inAnity 
[CoD90a].  Furthermore,  the  homogeneous  coordinates 
of  lines  in  the  image  plane  may  have  third  coordinates 
that  are  zero  or  nearly  so.  Treated  as  points  in  the 
projective  plane,  these  line  coordinates  are  essentially 
points  at  inAnity. 

Since  the  projective  plane  is  topologically  closed, 
it  is  better  to  think  of  it  as  a  closed  2D  space  like 
the  surface  of  a  sphere.  More  formally,  deAne  an 
equivalence  relation  ~  on  —  {(0,  0, 0)}  such  that 
(zi,Z2,Z3)  ~  (yi,y2,y3)  iff  there  exists  some  nonzero 
k  such  that  Zj  =  kyi,i  =  1,2,3.  The  projective  plane 

is  deAned  as  the  the  quotient  space® 

(R®- {(0,0,0)})/  ~  . 

Viewing  R®  geometrically  as  Euclidean  3-space,  each 
member  of  the  quotient  space  is  an  equivalence  class  of 
points  along  an  inAnite  line  through  the  origin  (exclud¬ 
ing  the  origin  itself  which  would  otherwise  need  to  be  a 
member  of  all  the  equivalence  classes).  Consider  now 
the  surface  of  the  unit  sphere  S®  =  {(zi,  Z2,  Z3)|zJ  -I- 
z|  -f  z|  =  1},  and  form  the  quotient  space  5®  /  ~  . 
Each  equivalence  class  now  contains  one  pair  of  dia¬ 
metrically  opposite  points.  Equating  these  equivalence 
classes  with  those  of  the  projective  plane  in  the  obvi¬ 
ous  way  shows  that  the  surface  of  the  unit  sphere  with 
antipodal  points  equated  is  isomorphic  to  the  projec¬ 
tive  plane. 

The  most  important  beneAt  to  come  from  this  iso¬ 
morphism  is  that  it  allows  probability  distributions  on 
the  sphere  to  be  reinterpreted  as  distributions  in  the 
projective  plane.  Since  diametrically  opposite  points 
on  the  sphere  must  be  treated  as  equivalent  in  order 
to  represent  the  projective  plane,  an  appropriate  dis¬ 
tribution  must  possess  the  property  of  antipodal  sym¬ 
metry,  i.e.  the  probability  value  at  any  point  on  the 
sphere  must  be  the  same  as  the  probability  at  the  di¬ 
ametrically  opposite  point. 

A  useful  characterization  of  distributions  on  the 
sphere  is  presented  in  Betan  [Bera79].  Beran  considers 
exponential  distributions  on  the  sphere,  that  is,  distri¬ 
butions  of  the  form  ezp{P}  where  P  is  a  polynomial 
evaluated  over  the  surface  of  the  sphere.  This  iissump- 
tion  is  not  as  restrictive  as  it  seems,  since  any  strictly 
positive  function  F  on  the  sphere  can  be  represented 
as  ezp{fn{P}}.  One  good  reason  for  considering  ex- 

®Note  this  is  also  the  space  of  homogeneous  coordinates. 


921 


ponential  distributions  is  their  ease  of  use  in  maximum 
likelihood  estimation  [Mend87]. 

Assuming  a  distribution  of  the  form  exp{P},  Be- 
ran  notes  that  the  polynomicil  P  can  be  decom¬ 
posed  using  spherical  harmonics,  anrdogous  to  the  way 
polynomials  in  Euclidean  space  are  decomposed  us¬ 
ing  Fourier  analysis.  If  the  distribution  is  required 
to  have  antipodal  symmetry,  all  odd  order  harmon¬ 
ics  are  identically  zero.  This  leaves  an  expression 
exp{yo  4-  y2  -I-  Ti  + . . .}.  The  zeroth  harmonic  is  a  con¬ 
stant,  so  the  exp{Yo}  term  can  be  factored  out  and  ab¬ 
sorbed  into  the  distribution’s  normalization  constant. 
Therefore,  the  low  order  approximation  to  any  antipo- 
dally  symmetric  exponential  distribution  on  the  sphere 
is  of  the  form  exp{Y2}-  A  distribution  having  this  form 
has  already  been  studied  in  the  statistical  literature, 
where  it  is  called  Bingham’s  distribution  [Bing74]. 

Bingham’s  distribution  can  be  described  as  a 
trivariate  Gaussian  vector  with  zero  mean  and  arbi¬ 
trary  covariance  matrix,  conditioned  on  the  length  of 
the  vector  being  unity.  Bingham’s  distribution  thus 
represents  the  portion  of  a  trivariate  Gaussian  distri¬ 
bution  that  intersects  the  surface  of  the  unit  sphere, 
with  varying  ellipsoidal  shapes  of  the  underlying  Gaus¬ 
sian  contours  producing  a  variety  of  distributional 
forms  on  the  sphere  (see  Figure  4).  Bingham’s  distri¬ 
bution  has  been  used  previously  in  a  computer  vision 
setting  to  represent  uncertainties  in  line  and  plane  ori¬ 
entations  estimated  from  vanishing  point  an^ysis  and 
stereo  line  correspondences  [Coll90b,  Coll90a]. 


of  the  true  point  location.  The  analysis  of  the  last  sec¬ 
tion  shows  that  Bingham’s  distribution  is  a  first  order 
approximation  to  any  noise  process  in  the  projective 
plane;  therefore  it  is  assumed  that  observed  points  are 
corrupted  by  a  Bingham  noise  process,  centered  about 
the  true  point  location. 

Normalizing  the  homogeneous  coordinates  of  each 
sample  point  yields  an  antipodal  pair  of  points  (or 
equivalently,  an  axis)  on  the  unit  sphere.  Assum¬ 
ing  a  manageable  level  of  noise,  the  normalized  sam¬ 
ple  points  form  a  cluster  of  points  on  the  sphere  dis¬ 
tributed  according  to  a  bipolar  Bingham  distribution 
(Figures  4b  and  4c).  Assuming  no  bias  in  the  obser¬ 
vations,  the  polar  axis  of  this  distribution  should  co¬ 
incide  with  the  normalized  homogeneous  coordinates 
of  the  true  point  in  the  projective  plane.  An  estimate 
of  the  true  point  position  can  therefore  be  obtained  as 
an  estimate  of  the  polar  axis  of  the  Bingham  distribu¬ 
tion  that  best  fits  the  normalized  sample  points.  The 
most  common  method  for  estimating  a  distribution’s 
parameters  from  a  sample  of  observations  is  maximum 
likelihood  estimation. 

Relevant  statistics  for  Bingham’s  distribution  can 
be  summarized  eis  follows;  a  more  detailed  presentation 
can  be  found  in  [Coll90b,  Bing74].  Given  a  set  of  n  unit 
vectors  <f>i  =  (xi,yi,Zi)  assumed  to  be  distributed  ac¬ 
cording  to  Bingham’s  distribution,  a  suilicient  statistic 
for  the  orientation  and  shape  parameters  of  the  distri¬ 
bution  is  the  sample  second  moment  or  scatter  matrix 

E®«»»  E^f*!  ' 

E®<yt  Ev*^ 

E^t^t  Ey*'*t  E*»* 

Since  the  scatter  matrix  is  a  symmetric  real  ma¬ 
trix,  it  can  be  decomposed  into  M  =  AAA*,  where 
A  =  [a, ,  a, ,  03]  is  an  orthogonal  matrix  of  eigenvec¬ 
tors,  and  A  =  (liag(Ai,  Az,  A3)  is  a  diagonal  matrix  of 
corresponding  eigenvalues  with  Ai  <  Az  <  A3  summing 
up  to  1. 

It  can  be  shown  that  the  maximum  likelihood  es¬ 
timate  of  the  pole  of  a  bipolar  Bingham  distribution  is 
the  eigenvector  eissociated  with  the  largest  eigenvalue, 
or  vector  using  the  above  convention.  Equations 
for  computing  confidence  regions  on  the  sphere  can  be 
found  in  [Coll90b,  Bing74}. 


n 


Figure  4:  Bingham’s  Distribution  -  representative  con¬ 
tours  for  varying  shape  parameter  magnitudes. 


4.2  Data  Fusion  in  the  Projective  Plane 

A  point  or  line  location  estimate  in  homogeneous  co¬ 
ordinates  represents  a  point  in  the  projective  plane; 
multiple  estimates  form  a  sample  of  data  points  in  the 
projective  plane.  To  fuse  data  points  in  the  projective 
plane  each  point  is  assumed  to  be  a  noisy  observation 


5  Comments  and  Future  Work 

An  approach  to  planar  model  extension  has  been  pre¬ 
sented  that  uses  apriori  knowledge  of  the  relative  posi¬ 
tions  of  at  least  four  coplanar  points  or  lines  to  derive 
the  positions  of  other  points  and  lines  on  the  same 
plane  in  a  way  that  is  invariant  to  relative  camera  lo¬ 
cation  and  intrinsic  camera  parameters.  One  of  the 
main  contributions  of  this  work  is  the  development  of 
an  appropriate  methodology  for  fusing  geometric  in¬ 
formation  in  the  projective  plane. 


922 


[FaugSS]  O.D.  Faugeras  and  F.  Lustman,  “Motion  and 
Structure  from  Motion  in  a  Piecewise  Planar 
Environment,”  International  Journal  of  Pattern 
Recognition  and  Artificial  Intelligence,  Vol.  2, 
1988,  pp.  485-508. 

[Gros90]  W.I.  Grosky  and  L.A.  Tamburino,  “A  Unified 
Approach  to  the  Linear  Camera  Calibration 
Problem,”  IEEE  Transactions  on  Pattern  Anal¬ 
ysis  and  Machine  Intelligence,  Vol.  12,  1990, 
pp.  663-671. 

[Horn86]  6.K.P.  Horn,  Robot  Vision,  MIT  Press,  Cam¬ 
bridge,  MA.,  1986. 

[Kuma90]  R.  Kumar  and  A.R.  Hanson,  “Pose  Refinement: 

Application  to  Model  Extension  and  Sensitiv¬ 
ity  to  Camera  Parameters,”  Proceedings  Darpa 
I.  U.  Workshop,  Pittsburgh,  PA.,  September 

1990,  pp.  660-669. 

[Mend87]  J.M.  Mendel,  Lessons  in  Digital  Estimation 
Theory,  Prentice-Hall  Signal  Processing  Series, 
Prentice-Hall,  Inc.,  NJ.  1987. 

[Mohr90]  R.  Mohr  and  E.  Arbogast,  “It  Can  be 
Done  Without  Camera  Calibration,”  Pattern 
Recog.  Letters,  V.  12,  1990,  pp.  39-43. 

[Mohr91]  R.  Mohr  and  L.  Morin,  “Relative  Positioning 
from  Geometric  Invariants,”  Computer  Vision 
and  Pattern  Recognition,  Maui,  Hawaii,  June 

1991,  pp.  139-144. 

[Spti64]  C.E.  Springer,  Geometry  and  Analysis  of  Pro¬ 
jective  Spaces,  W.H.  Freeman  and  Company, 
San  Francisco,  1964. 

[Tsai82]  R.Y.  Tsm,  T.S.  Huang,  and  W.  Zhu,  “Esti¬ 
mating  Three-Dimensional  Motion  Parameters 
of  a  Rigid  Planar  Patch,  II:  Singular  Value  De¬ 
composition,”  IEEE  Transactions  on  Acoustics, 
Speech,  and  Signal  Processing,  Vol.  30,  No.  4, 
1982,  pp.  525-534. 

References 

[Bera79]  R.  Beran,  “Exponential  Models  for  Directional 
Data,”  The  Annals  of  Statistics,  Vol.  7,  No.  6, 

1979,  pp.  1162-1178. 

[Bing74]  C.  Bingham,  “An  Aotipodally  Symmetric  Dis¬ 
tribution  on  the  Sphere,”  7?ie  Annals  of  Statis¬ 
tics,  Vol.  2,  1974,  pp.  1201-1225. 

[Coll9l]  R.T.  ColUns,  “Model  Acquisition  and  Extension 
in  the  Projective  Plane,”  COINS  Technical  Re¬ 
port,  Computer  and  Information  Sciences,  Uni¬ 
versity  of  Massachusetts. 

[Coll90a]  R.T.  Collins  and  R.S.  Weiss,  “Vanishing  Point 
Calculation  as  a  Statistical  Inference  on  the  Unit 
Sphere,”  Proc.  Third  International  Conference 
on  Computer  Vision,  Osaka,  Japan,  December 
1990,  pp.  400-403. 

[Coll90b]  R.T.  Collins  and  R.S.  Weiss,  “Deriving  Line 
and  Surface  Orientation  by  Statistical  Meth¬ 
ods,”  Proc.  Darpa  l.V.  Workshop,  Pittsburgh, 

PA.,  Sept.  1990,  pp.  433-438. 


The  approach  to  data  fusion  described  in  Section  4 
uses  maximum  likelihood  estimation  to  fit  a  set  of  dis¬ 
tribution  parameters  to  a  point  sample.  This  method 
implicitly  assumes  that  aU  points  in  the  sample  are 
independent,  and  identically  distributed.  While  the 
independence  assumption  may  be  a  necessary  evil, 
points  in  the  sample  will  probably  not  be  identicaUy 
distributed,  since  some  extracted  image  features  are 
more  accurate  than  others.  Future  work  will  address 
ways  of  combining  point  estimates  of  different,  but  es¬ 
timable,  accuracy. 

As  with  other  projective  geometric  algorithms  for 
estimating  structure  based  on  homographic  transfor¬ 
mations,  finding  correct  point  and  line  correspon¬ 
dences  is  a  crucial  problem.  Faugeras  and  Lustman 
present  a  hypothesize  and  test  paradigm  [Faug88].  Ini¬ 
tial  guesses  are  made  of  sets  of  geometric  primitives 
that  may  be  coplanar,  and  an  interframe  transforma¬ 
tion  matrix  is  estimated  from  them.  To  the  extent  that 
other  features  in  the  image  are  transformed  in  a  con¬ 
sistent  way,  they  are  added  to  the  hypothesis  set  and  a 
revised  transformation  is  computed;  if  no  matches  con¬ 
sistent  with  the  transformation  estimate  are  found,  the 
current  hypothesis  is  dropped.  This  strategy  could  be 
adapted  to  the  present  task. 

The  model  extension  work  described  here  requires 
apriori  knowledge  of  at  least  four  reference  points  or 
lines  in  each  plane  in  the  scene,  and  only  points  and 
lines  on  those  planes  can  be  reconstructed.  An  impor¬ 
tant  item  for  future  research  is  extensions  for  estimat¬ 
ing  relative  locations  of  points  outside  the  planes  of 
reference.  Mohr  [Mohr9l,  Mohr90]  shows  that  under 
limited  noise  complete  3D  reconstructions  are  possible 
using  reference  points  on  two  different  object  planes. 
This  seems  a  promising  line  of  research. 
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Where  to  Look  Next  using  a  Bayes  Net:  An  Overview 
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1  Introduction 

Our  goal  is  to  form2tlize  and  implement  mechanisms 
for  true  task-oriented  (not  task-specific)  vision.  In  our 
model  of  a  task-oriented  vision  system,  a  question  is  first 
asked  about  the  scene.  The  system  determines  what 
scene  information  would  be  sufficient  to  answer  the  ques¬ 
tion.  The  level  of  detail  sufficient  to  answer  the  ques¬ 
tion  may  vary  among  the  different  pieces  of  information 
extracted  from  the  scene.  Specific  vision  modules  are 
sequentially  brought  to  bear  on  selective  areas  of  the 
image,  the  selection  and  exact  processing  of  each  vision 
module  depending  on  the  results  of  previously  executed 
vision  modules.  Each  module  produces  a  partial  repre¬ 
sentation  of  the  (minimal)  information  needed  to  answer 
the  question,  and  these  results  are  combined  to  produce 
the  answer.  Table  1  summarizes  the  key  differences  be¬ 
tween  task-oriented  vision  and  the  classical,  passive  ap¬ 
proach  to  computer  vision. 

A  “where  to  look  next”  capability  can  act  as  the  cog¬ 
nitive  executive  for  active  vision.  If  an  active  compu¬ 
tational  agent  is  subject  to  an  information  load  that 
can  overwhelm  its  resources,  the  executive  can  allow  it 
to  ignore  irrelevant  stimuli,  choose  its  tasks  wisely,  sur¬ 
vive,  and  achieve  its  goals.  Alternatively,  “where  to  look 
next”  can  simply  save  time  and  effort  in  doing  visual 
jobs  that  in  humans  require  attentional  shifts,  such  as 
radiograph  and  CAT  scan  interpretation,  photo  inter¬ 
pretation,  traffic  monitoring,  etc.  Last,  the  task-oriented 
approach  should  make  for  more  dependable  vision  per¬ 
formance  built  from  more  general  (less  domain-specific) 
vision  tools.  The  claim  is  that  current  vision  modules, 
even  relatively  simple  ones,  can  become  useful  and  ro¬ 
bust  when  they  are  carefully  applied  in  a  specific  context. 

Our  approach  has  as  background  a  large  amount 
of  research  into  visual  attention,  classical  work  in  eye 
movements,  and  recent  advances  in  active  vision,  in¬ 
cluding  camera  movements  and  foveal  -  peripheral  sen¬ 
sors.  Specifically,  our  tools  are  decision  theory,  util¬ 
ity  theory,  and  Bayesian  probabilistic  models  [3,  4,  6, 
7J.  Two  recent  key  developments  are  Bayes  nets  [9],  and 
influence  diagrams  [9,  ll].  Applications  using  these  new 
techniques  ue  beginning  to  appear.  The  first  large  ex- 

'This  material  is  based  on  work  supported  by  the  National 
Science  Foundation  under  Grants  numbered  IRI-8920771  and 
IRl-8903582.  The  Government  has  certain  rights  in  this 
material. 


perimental  system  that  applied  Bayes  nets  to  computer 
vision  is  by  Levitt  [8].  The  formulation  of  that  system 
using  influence  diagram  techniques  is  discussed  in  [2].  A 
sensor/control  problem  involving  a  real  milling  machine 
is  solved  using  influence  diagram  techniques  in  [l].  A 
special  kind  of  influence  diagram,  called  a  temporal  be¬ 
lief  network,  is  discussed  in  [S],  and  is  being  studied  for 
an  application  in  sensor  based  mobile  robot  control. 

In  what  follows  we  present  the  basic  framework  of  a 
task-oriented  computer  vision  system,  called  TEA,  that 
uses  Bayes  nets  and  a  maximum  expected  utility  deci¬ 
sion  rule.  Knowledge  about  the  scene  and  about  the 
nature  of  the  specific  task  given  to  the  system  are  rep¬ 
resented  in  the  Bayes  net.  The  decision  of  what  vision 
modules  to  run  is  made  using  a  value/cost  utility  mea¬ 
sure,  where  value  is  based  on  mutual  information  mea¬ 
sured  between  nodes  in  the  Bayes  net  that  correspond  to 
actions  and  to  the  goal  of  the  task.  We  introduce  a  new 
method  for  incorporating  relational  knowledge  both  into 
the  Bayes  net  and  into  the  utility  measure.  The  decision 
of  what  areas  of  the  scene  to  run  a  vision  module  on  can 
be  made  using  this  relational  knowledge.  TEA  mod¬ 
els  camera  movements  and  distinguishes  between  vision 
modules  that  operate  either  on  foveal  or  on  peripheral 
image  data. 

Experimental  results  are  presented  from  the  TEA-0 
system,  our  initial  implementation  of  the  general  TEA 
framework.  We  also  outline  TEA-1,  which  uses  a  richer 
knowledge  representation  to  support  more  complex  vi¬ 
sual  tasks.  The  TEA  systems  solve  the  “where  to  look 
next”  problem,  enabling  us  to  study  a  problem  we  call 
“how  to  look”,  and  by  combining  our  solutions  to  these 
problems  we  expect  to  build  a  true  task-oriented  vision 


Passive  vision 

Task-oriented  vision 

use  all  vision  modules 

use  only  some  vision  modules 

process  entire  image 

process  areas  of  the  image 

maximal  detail 

sufficient  detail 

extract  representation  first 

ask  question  first 

answer  question  from 
representation  data 

answer  question  from 
scene  data 

unlimited  resources 

resource  limitations 

Table  1:  Key  differences  between  passive  vision  and  task- 
oriented  vision. 
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Figure  1:  An  example  scene  in  the  application  domain. 

system.  This  paper  is  intended  to  be  mainly  an  overview 
of  our  work.  A  fuller  treatment  appears  in  [lO]. 

2  Preliminaries 

The  TEA  system  runs  by  iteratively  selecting  the  evi¬ 
dence  gathering  action  that  maximizes  an  expected  util¬ 
ity  criterion; 

1.  List  all  the  executable  actions. 

2.  Select  the  action  with  highest  utility. 

3.  Execute  that  action. 

4.  Attach  the  resulting  evidence  to  the  Bayes  net  and 
propagate  its  influence. 

5.  Repeat,  until  the  task  is  solved. 

The  following  section  expands  on  aspects  of  the  above 
algorithm.  But  first  we  present  the  application  domain 
that  we  are  currently  using. 

The  approach  is  applicable  whenever  the  scenes  obey 
regularities  or  have  structure  that  can  be  captured  in  the 
semantic  data  structures  supporting  Bayesian  inference. 
One  example  is  biomedical  images  that  reflect  known  re¬ 
lationships  and  properties  of  anatomy.  Another  is  aerial 
views  of  certain  cultural  areas  such  as  industrial  sites  or 
airports.  There  is  nothing  inherently  2-D  in  the  method. 
With  the  TEA  system  we  currently  use  table  settings. 
Figure  1  shows  a  typical  table  top  scene. 

We  assume  a  spatially-varying  sensor,  which  makes 
the  “where  to  look  next”  question  even  more  central.  In 
the  TEA  system  the  peripheral  image  is  a  low-resolution 
image  of  the  entire  field  of  view  from  one  camera  angle, 
and  the  fovea  is  a  small  high- resolution  image  (i.e.  win¬ 
dow)  that  can  be  selectively  moved  within  the  field  of 
view. 


Figure  2;  An  example  of  a  Bayes  net  that  describes  a 
place  setting.  Action  nodes  are  drawn  using  dotted  lines. 


We  Eissume  the  system  can  not  view  the  entire  scene 
at  once.  Often  a  camera  movement  must  be  made  to  an 
area  of  the  scene  that  has  not  been  been  viewed  before. 
The  target  location  of  such  a  camera  movement  must 
be  determined  via  relations  with  other  portions  of  the 
scene  for  which  image  data  is  (or  previously  has  been) 
available.  Following  a  camera  movement  the  fovea  is 
centered  in  the  field  of  view,  but  afterwards  the  system 
can  move  the  fovea  within  the  field  of  view.  The  target 
location  for  a  fovea  movement  is  always  within  the  field 
of  view  so  it  can  be  determined  either  from  peripheral 
image  data  or  by  relations  with  other  portions  of  the 
scene. 

Our  goal  is  to  support  many  different  visual  tasks  effi¬ 
ciently.  Each  t^lsk  can  be  specified  by  asking  a  question 
about  the  scene:  Where  is  the  butter?  Is  this  breakfast, 
lunch,  dinner,  or  dessert?  Is  this  an  informal  or  fancy 
meal?  How  far  has  the  eating  progressed?  Is  this  table 
messy?  We  are  particularly  interested  in  more  qualita¬ 
tive  tasks. 

3  Framework  for  Solving  Simple  Tasks 

3.1  Bayes  Nets 

A  Bayes  net  is  a  way  of  representing  the  joint  distribu¬ 
tion  of  a  set  of  variables  in  a  way  that  is  especially  useful 
for  knowledge  representation  (see  [9]  for  details).  For  ex¬ 
ample,  Figure  2  shows  a  highly  simplified  Bayes  net  that 
describes  a  place  setting  for  a  meal.  Nodes  in  the  net 
represent  variables.  Here,  nodes  drawn  with  solid  lines 
denote  parts  of  a  place  setting.  The  variable  setiing  has 
four  possible  values  {breakfast,lunch,dinner,dessert) 
that  denote  the  respective  types  of  meals.  A  plate  can  be 
either  paper  ot  ceramic.  Links  in  the  net  represent  con¬ 
ditional  probabilities,  for  example  the  link  from  setting 
to  napkin  represents  P{napkin  \  setting),  which  says 
whether  a  napkin  is  expected  at  each  of  the  possible 
meals. 

The  Bayes  net  formalism  also  includes  a  form  of  infer¬ 
ence.  Formally,  belief  in  the  values  for  node  X  is  defined 
2is  BEL(x)  ~  P{x  I  e),  where  e  is  the  combination  of 
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all  evidence  present  in  the  net.  Elegant  solutions  have 
been  developed  [4,  9]  for  incorporating  a  single  piece  of 
evidence  into  the  net  and  for  propagating  its  effect  to  all 
other  nodes  in  the  net. 

3.2  Adding  Actions  to  a  Bayes  Net 

A  node  X  in  our  Bayes  net  has  action  nodes  (drawn 
with  dotted  lines)  connected  to  it,  each  representing 
a  variable  that  is  a  visual  action’s  “evidence  report”, 
which  contains  a  score  for  each  possible  value  of  the 
Bayes  net  node  X.  An  action’s  evidence  report  affects 
the  BEL  values  of  the  parent  node  X.  Before  the  ac¬ 
tion  is  executed,  the  action  node  is  a  “chance”  node 
like  most  of  the  nodes  in  the  net,  and  the  node  con¬ 
tains  the  BEL  of  the  evidence  report.  After  an  action 
is  (successfully)  executed  the  action  node  is  changed  to 
be  an  “instantiated”  node  (see  [9])  and  set  to  the  value 
of  the  evidence  report.  For  example,  in  Figure  2  the 
per-class-utensil  action  might  generate  the  following 
evidence  report  (4.9, 1.4,  3.2),  which  contains  the  scores 
for  each  of  the  objects,  {fork,  knife,  spoon).  Action 
nodes  are  needed  for  computing  the  utility  of  actions 
given  the  current  evidence  present  in  the  net. 

All  actions  in  the  system  are  constructed  from  one 
or  more  low-level  vision  modules.  Examples  of  some 
low-level  vision  modules  are:  color  histogram  matching 
and  location-finding,  (edge  magnitude  image)  template 
matching,  and  a  Hough  transform  for  circles. 

One  or  more  vision  modules  may  be  used  in  a  visual 
action.  The  Bayes  net  in  Figure  2,  used  in  one  of  our 
experiments  with  TEA-0,  contains  11  visual  actions.  As 
an  example,  following  is  a  more  detailed  summary  of  the 
actions  related  to  plates: 

•  per-plate.  Use  Hough  transform  for  plate-sized 
circles  to  detect  plate  in  peripheral  image.  Detec¬ 
tion  succeeds  or  fails.  The  location  is  saved.  Use 
color  histogram  to  classify  plate  as  paper  (blue)  or 
ceramic  (green),  using  a  window  centered  on  the 
plate  in  the  peripheral  image. 

•  fov-class-plate.  Plate  location  detected  previ¬ 
ously.  Move  fovea  to  plate.  Use  color  histogram 
to  classify  plate  as  paper  (blue)  or  ceramic  (green), 
using  fovea  image  data. 

Complex  actions  like  per-plate  above  are  used  in  TEA- 
0,  but  are  decomposed  into  simpler  actions  in  TEA-1 
{e.g.  detect  X,  move  peripheral  window,  classify  X  using 
peripheral  window  data,  move  fovea,  classify  X  using 
foveal  data,  move  camera).  Some  actions  have  precondi¬ 
tions  that  must  be  satisfied  before  they  can  be  executed. 
TEA-0  uses  only  one  kind  of  precondition:  know  the 
location  of  object  X. 

3.3  Adding  Relations  to  a  Bayes  Net 

Any  node  in  a  Bayes  net  that  is  bound  to  an  object  found 
in  the  scene  will  have  the  location  of  that  object  stored 
at  that  node.  Otherwise,  each  node  X  has  an  expected 
area  for  the  expected  object  associated  with  that  node. 
An  expected  area  is  determined  by  applying  geometric 
relations  with  each  node  Yi  connected  to  node  X.  A 
geometric  relation  between  X  and  Yi  uses  the  location 


in  node  V)  if  it  is  available,  otherwise  the  expected  area 
at  node  Yi  is  used.  The  expected  area  at  node  A'  is 
calculated  strictly  from  relations;  The  location  at  node 
X  is  not  used  in  the  calculation  of  the  expected  area. 

In  TEA,  an  expected  area  is  represented  as  a  scalable 
bitmap  denoting  a  subset  of  the  planar  scene  area  whose 
size  depends  on  the  object  it  is  being  related  to.  The 
resolution  of  the  grid  is  currently  the  same  as  that  of 
the  peripheral  images.  TEA-0  allows  relations  between 
siblings,  but  in  our  latest  work  TEA-1  will  have  relations 
between  parent  and  child  nodes. 

Each  node  Vi  produces  an  expected  area  for  the  object 
at  node  X.  All  these  expected  areas  must  be  combined 
to  obtain  the  final,  single,  expected  area  for  the  object 
at  node  X.  In  general  it  will  be  useful  to  characterize 
the  relation  depicted  by  the  maps  as  “must-be” ,  “must- 
not-be”  and  “could-be” .  Combination  of  two  “must-be” 
maps  would  then  be  by  intersection,  and  in  general  map 
combination  would  proceed  by  the  obvious  set-theoretic 
operations  corresponding  to  the  inclusive  or  exclusive 
semantics  of  the  relation.  In  TEA-0  the  relations  are 
“could-be”,  and  the  maps  are  unioned. 


3.4  Calculating  an  Action’s  Utility 

Let  the  action  node  a  have  the  node  A  as  its  parent, 
then  the  utility  U{q)  of  an  action  a  is  of  the  form 


U{a) 


V{a) 

Cia)- 


C{q)  is  the  cost  of  executing  the  action: 


C(o)  =  r4Co(o). 


Co(Qt)  is  the  execution  time  of  action  a  on  the  entire  pe¬ 
ripheral  or  foveal  image,  and  rj^  is  the  percentage  of  the 
image  covered  by  the  expected  area  of  the  object  asso¬ 
ciated  with  node  A.  For  foveal  images,  =  1.0  since 
the  entire  fovea  is  always  processed.  Before  any  actions 
have  been  executed,  no  objects  have  been  located,  and 
so  all  va  values  are  1.0.  Over  time,  as  other  objects  in 
the  scene  are  located  and  as  more  and  tighter  relations 
are  established,  the  value  of  r^  will  approach  zero. 

U(a)  is  meant  to  be  the  value  of  the  action,  how  useful 
it  is  for  achieving  the  task’s  goal.  All  actions  in  a  com¬ 
puter  vision  system  are  information  gathering  actions. 
Therefore,  the  value  of  an  action  is  strictly  a  measure  of 
the  information  the  action  provides: 


V(a)  =  I{target,a), 

where  the  t^lsk’s  goal  is  represented  by  the  node  target. 
I  is  Shannon’s  measure  of  average  mutual  information 
(see,  e.g.  [9]): 


/(A,y)  =  ^5^B£L(x,j/)/oi7 
*  y 


BEL{x,y) 

BEL{x)BEL{y) 


where 

BEL{x,y)  =  BEL(x  \  y)BEL{y). 

The  values  of  BEL{x)  and  BEL(y)  are  respectively 
available  at  nodes  X  and  Y  in  the  Bayes  net.  The  values 
of  BEL{x  I  y)  can  be  calculated  by  temporarily  instan¬ 
tiating  node  y  to  each  of  its  values,  propagating  beliefs, 
and  taking  the  resulting  BEL{x)  as  BEL{x  \  y)  [9]. 
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It  is  important  to  “look  ahead”  at  the  future  impact 
of  executing  an  action.  Therefore  we  use  the  following 
“lookahead”  utility  function.  Recall  that  action  node  a 
has  the  node  A  as  its  parent. 


U{a) 


V{a)  +  V{^) 
C(a)  +  C(/3) 


+  Y, 

X^Rel{A) 


(1) 


where 

V(7) 

0  =  argmax^^LocPrt(A)-^j^ 

The  first  term  in  equation  (1)  accounts  for  the  future 
value  of  establishing  the  location  of  an  object.  Action 
o  might  detect  and  locate  an  object,  but  not  provide 
any  information  (7  =  0)  toward  the  task  node  in  the 
Bayes  net,  however  it  does  locate  the  object  and  thereby 
satisfy  the  preconditions  of  other  actions  that  in  turn  will 
provide  information  useful  for  accomplishing  the  task. 
The  interpretation  of  the  first  term  in  equation  (1)  is: 
Let  /?  be  the  “best”  action  with  a  precondition  that  is 
satisfied  by  executing  action  a.  The  new  utility  of  action 
a  is  an  average  over  both  a  and  more  specifically  an 
average  of  the  value  and  cost  of  the  two  actions  a  and  /?. 
LocPre(A)  is  the  set  of  actions  with  the  precondition  of 
knowing  the  location  of  the  object  associated  with  node 
A. 

The  second  term  in  equation  (1)  anticipates  the  im¬ 
pact  of  the  expected  areas  that  action  a  will  generate 
(by  establishing  relations  with  other  objects).  Rel{A)  is 
the  set  of  edl  nodes  that  are  directly  helped  by  location 
information  about  the  object  associated  with  node  A. 
In  other  words,  nodes  A  and  X  are  siblings  or  a  parent- 
child  pair  and  they  have  a  relation  map  defined  between 
them.  Each  node  in  Rel{A)  contributes  a  term  AU(X) 
to  the  utility: 


AU(X)  =  m.ax  (Uij)  *  (1/r  -  1)). 

y^AcUon${X) 

r  is  the  percent  reduction  in  the  expected  area  for  node 
A’s  object  assuming  that  the  location  of  node  A’s  ob¬ 
ject  is  known,  and  is  computed  by  applying  the  relation 
map,  but  it  is  applied  using  the  expected  rather  than  the 
known  size  of  node  A’s  object. 


3.5  Experimental  Example 

The  TEA-0  system  is  an  initial  implementation  that  fol¬ 
lows  the  technicrd  framework  outlined  above.  TEA-0 
works  in  a  simplified  domain  (a  single  place  setting)  and 
solves  the  following  task:  decide  which  meal  the  place 
setting  is  for  breakfast,  lunch,  dinner  or  dessert.  The 
task  is  further  simplified  by  assuming  that  the  scene 
could  contain  only  a  napkin,  plate,  cup,  bowl,  and  a 
single  utensil.  The  entire  scene  can  be  viewed  in  one 
image  so  the  system  does  not  use  camera  movements.  A 
relationship  between  the  possible  objects  and  the  type  of 
meal  was  contrived  and  encoded  as  the  Bayes  net  model 
shown  in  Figure  2.  The  goal  is  to  obtain  high  values 
for  BEL(setting).  The  scene  was  an  overhead  view  of 
a  single  place  setting,  like  one  of  those  in  Figure  1 .  The 
values  of  BEL{setting)  before  any  actions  have  been 


Figure  3:  The  organization  of  a  large  Bayes  net  used  by 
TEA. 


executed  are: 


0.100 

0.300 

0.560 

WlfiiliTiM 

ES£ 

dessert 

cutable  actions  is: 


The  initial  list  of  exe- 


The  system  ends 
quence  of  actions: 


UM 

0.376946 

1 

0.156381 

0.020900 

per-detect -ut  ens il 

6.011891 

per-cup 

0.000509 

per-bowl 

up  executing  the  following  se- 
per-plate,  per-detect-napkin, 
per-detect-utensil,  per-class-utensil,  per-cup, 
f  ov-class-utansil,  par-bowl,  after  which  the  task  be- 


likely  set  for  a  dinner  meal: 


1  BEL(setting)  \ 

0.004 

bfast 

0.066 

lunch 

0.867 

dinner 

0.063 

dessert 

The  re¬ 


maining  actions  have  little  effect  on  the  task  belief. 


4  Framework  for  Solving  More 
Complex  Tasks 

More  complex  questions  will  require  a  more  complex 
Bayes  net  structure  than  used  by  TEA-0.  The  orga¬ 
nization  we  are  pursuing  for  the  TEA-1  system  is  shown 
in  Figure  3.  It  consists  of  three  separate  tree  structures: 
a  PART-OF  tree,  IS-A  trees,  and  a  task  tree. 

The  PART-OF  tree,  an  example  of  one  is  shown  in  Fig¬ 
ure  4,  models  the  physical  structure  of  the  scene.  We 
assume  the  scene  and  all  the  objects  in  the  scene  can 


930 


Figure  4:  A  PART-OF  Bayes  net. 


Figure  5:  An  IS-A  Bayes  net. 


be  modeled  as  a  hierarchy  of  parts.  All  nodes  in  this 
net  have  the  same  set  of  possible  values;  preaeni  and 
notPreaent.  The  conditional  probability  on  each  net¬ 
work  link  indicates  the  likelihood  that  a  subpart  ex¬ 
ists.  Each  object’s  location  and  expected  area  are  stored 
within  the  object’s  node  in  the  PART-OF  network. 

An  IS-A  tree  (for  example,  Figure  5)  models  an  ab¬ 
straction  hierarchy  for  each  instance  of  an  object  in  the 
scene.  The  IS-A  net  is  a  special  kind  of  network  because 
the  leaf  nodes  are  mutually  exclusive  (t.e.  the  object  can 
only  be  one  thing).  Belief  nets  with  this  special  property 
have  been  developed  [4,  9]. 

One  of  our  scientific  goals  is  to  make  a  tight  formal  and 
practical  coupling  between  “task  specific  knowledge”  and 
visual  actions.  Task  specific  knowledge  is  contained  in 
the  task  net  (for  example,  Figure  6),  and  is  thus  distin¬ 
guished  from  other  types  of  knowledge.  One  feature  of 
task  knowledge  is  that  subtask  nodes  could  be  shared  by 
several  tasks.  Questions  such  as  “Is  this  a  fancy  meal?” 
may  be  answered  using  a  range  of  image  clues.  Some 
simple  tasks,  such  as  “Where  is  the  butter?” ,  do  not  re¬ 
quire  a  task  tree  since  they  only  involve  one  particular 
node  in  a  tree. 

We  want  to  use  the  task  tree  to  add  task-specificity  to 


Figure  6;  A  task  Bayes  net. 

the  utility  function;  basically  the  idea  is  to  relativize  the 
utility  calculation  to  the  task  by  “projecting”  the  knowl¬ 
edge  in  the  PART-OF  and  IS-A  trees  onto  the  task  tree 
and  computing  utilities  there.  We  are  in  the  early  stages 
of  developing  the  notation,  semantics,  and  implementa¬ 
tion  of  the  interacting  trees.  We  wish  to  develop  general 
formulae  for  these  utilities,  and  to  study  the  benefits 
gained  by  more  complex  utility  functions,  as  opposed  to 
simpler  utility  functions  and  more  complete  planning. 

5  Conclusions  and  Future  Work 

We  are  pursuing  two  main  streams  of  work.  One  stream 
develops  the  TEA  systems,  a  progression  of  systems  that 
support  increasingly  sophisticated  task-oriented  vision 
by  providing  solutions  to  the  “where  to  look  next”  prob¬ 
lem.  The  second  stream  of  work  uses  and  extends  the 
TEA  framework  to  explore  broader  and  more  advanced 
issues  in  task-oriented  vision;  foveal  -  peripheral  vision 
algorithms,  qualitative  visual  tasks,  limited-context  vi¬ 
sion  algorithms  that  gain  in  robustness  or  accuracy  by 
being  applied  in  well-understood  circumstances,  incre¬ 
mental  visual  actions  whose  results  monotonically  im¬ 
prove  as  more  time  is  spent  on  them,  representations 
of  3-D  and  dynamic  spatial  relations,  head-shifting  and 
viewpoint  planning,  Etnd  processor  scheduling. 

5.1  “Where  to  Look  Next” 

TEA-0:  Using  relations.  An  initial  version  of  TEA-0 
has  been  implemented  and  it  should  give  the  basic  idea 
of  our  approach.  TEA-0  will  be  completed  by  imple¬ 
menting  the  full  version  of  relations,  and  by  enabling 
camera  movements. 

TEA-1:  Projecting  utilities  through  a  task  tree.  The 
main  feature  of  TEA-1  is  the  addition  of  multiple  in¬ 
teracting  trees,  which  permit  the  system  to  solve  more 
complex  tasks.  A  preliminary  design  for  projecting  util¬ 
ity  calculations  through  the  task  tree  is  complete  and 
we  have  begun  implementing  it.  A  deeper  issue  is  that 
relational  information  should  be  used  to  modify  proba¬ 
bilities  and  beliefs,  not  just  costs.  Relational  evidence 
must  then  be  formulated  in  a  probabilistic  framework, 
and  expected  areas  used  like  beliefs. 

TEAS:  Planning.  TEA-0  and  TEA-1  are  “myopic”, 
making  decisions  by  only  looking  one  step  ahead.  The 
anticipatory  utility  function  is  an  improvement,  trying 
to  pack  look-ahead  into  the  utility  of  a  single  action. 


Ultimately  our  problem  involves  full-scale  planning,  in 
which  sequences  of  actions  are  evaluated  as  to  their  ex¬ 
pected  utility.  We  intend  to  develop  a  simple  planning 
system  (for  computer  vision)  using  Bayes  nets.  We  do 
not  propose  “planning  research”  per  -le,  but  rather  shall 
likely  use  some  STRIPS-like  planning  algorithm.  The 
idea  is  to  substitute  a  search  in  action  space  rather  than 
to  try  to  pack  all  the  intelligence  into  a  (quasi-static) 
utility  function. 

5.2  “How  to  Look” 

Limited- Context  Vision  Algorithms.  One  claim  of  this 
work  is  that  vision  algorithms  can  be  more  robust  and 
reliable  if  they  are  known  to  work  in  a  limited  context. 
For  example,  TEA-0  can  use  simple  color  histograms  for 
object  identification  only  because  it  has  foveated  a  small 
area  of  the  image  previously.  Similarly  restricting  input 
to  a  small  volume  of  space  means  geometric  hashing  can 
work  more  reliably.  We  want  to  explore  limited  context 
effects  that  arise  naturally  in  task-oriented  vision  when 
the  vision  problem  is  known  to  be  simplified  or  better 
specified  than  normal  (by  camera  actions,  foveal  pro¬ 
cessing,  and  generally  by  satisfaction  of  preconditions). 

Incremental  actions.  We  want  to  investigate  vision 
modules  that  can  run  for  different  periods  of  time,  im¬ 
proving  their  results  the  longer  they  run  (e.g.  some  scale 
space  algorithms,  multi-feature  classifiers).  Such  vision 
actions  are  generalizations  of  TEA-0 ’s  peripheral  -  foveal 
actions  which  produce  a  peripheral  result  at  one  cost 
and  follow  it  up  with  a  foveal  action  for  a  further  cost. 
An  evidence/time  function  can  quantify  the  incremental 
benefit  of  such  an  action.  New  control  strategies  should 
then  emerge,  such  as  running  a  set  of  incremental  actions 
cyclically  to  attain  the  maximum  evidence  per  unit  time 
from  the  set. 

5.3  Task-Oriented  Vision 

Our  idea  of  a  true  task-oriented  vision  system  will  be 
achieved  by  bringing  together  solutions  to  the  “where  to 
look  next”  and  the  “how  to  look”  problems. 

Multiple  Tasks.  We  plan  to  solve  multiple  tasks  in  any 
given  domain  using  the  same  set  of  visual  actions.  This 
exercise  will  test  the  generality  of  our  knowledge  rep¬ 
resentations  and  visual  actions  and  probably  encourage 
us  to  extend  and  modify  both.  Also  we  expect  to  en¬ 
counter  interesting  new  problems  for  visual  actions  used 
in  answering  qualitative  questions  such  as  “Is  this  desk 
messy?” . 

Multiple  Domains.  We  believe  that  a  task-oriented 
vision  system  should  be  verified  using  more  than  one 
domain.  We  shall  seek  out  other  domains.  A  possible 
domain  is  model  trains  to  be  monitored  on  a  more  or 
less  complex  system  of  tracks.  Another  is  monitoring 
or  searching  the  laboratory  space  in  3-D  and  perform¬ 
ing  he6td  movements  as  well  as  camera  movements,  and 
ultimately  dynamic  scenes.  Medical  images  are  another 
possibility  emphasizing  reliability  as  opposed  to  active 
vision.  Expanding  the  domains  will  doubtless  mean  that 
visual  actions  need  to  be  re-engineered  and  improved  to 
apply  more  generally.  Difficulties  in  encoding  or  coping 
with  new  domains  will  motivate  extensions  and  modifi¬ 


cations  to  our  formalisms.  New  domains  may  necessitate 
the  use  of  more  complex  knowledge  representations,  in 
particular  non-tree  Bayes  nets. 
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Abstract 

The  gocil  of  image  understanding  systems  is  typically  the 
identification  of  objects  in  visual  imagery  and  the  estab¬ 
lishment  of  the  three-dimensional  relationships  among 
the  objects  2uid  the  viewer.  Because  of  the  variety  and 
scope  of  knowledge  pertinent  to  vision,  however,  the  ac¬ 
quisition  of  both  object  models  and  interpretation  strate¬ 
gies  remains  a  major  outstanding  problem  in  model- 
based  image  understanding.  We  contend  that  learning 
techniques  must  be  embedded  in  vision  systems  of  the 
future  in  order  to  reduce  or  eliminate  the  knowledge  engi¬ 
neering  aspects  of  system  construction,  and  present  the 
Schema  Learning  System  (SLS)  as  a  prototype  system 
for  learning  object  recognition  strategiess. 

1  Introduction 

The  goal  of  image  understanding  systems  is  typic2dly  the 
identification  of  objects  in  visu^  imagery  and  the  estab¬ 
lishment  of  the  three-dimensional  relationships  among 
the  objects  and  the  viewer.  It  b  a  generally  accepted 
prembe  that,  in  many  domains,  the  timely  and  appro¬ 
priate  use  of  relevant  knowledge  can  substantially  reduce 
the  combinatorially  explosive  search  encountered  in  es- 
tablbhing  ’instance-of  ’  relationships  between  image  data 
and  object  classes  in  the  knowledge  base. 

Because  of  the  variety  and  scope  of  knowledge  per¬ 
tinent  to  vision,  the  acquisition  of  both  object  models 
and  interpretation  strategies  remains  a  major  outstand¬ 
ing  problem  in  model-based  image  understanding.  While 
many  vision  algorithms  at  the  low  and  intermediate  lev¬ 
els  are  available,  successful  use  of  knowledge  in  image 
understanding  applied  to  outdoor  scenes  requires  a  care¬ 
ful  hand-crafting  of  the  knowledge  base  ([5}).  Typically 
this  requires  specifying,  for  each  object  class,  both  a  de¬ 
scription  of  the  generic  object  and  one  or  more  recog¬ 
nition  (control)  strategies  for  instantiating  instances  of 
the  object  to  image  data. 

The  success  of  many  knowledge-based  image  un- 
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tract  DAAE07-91-C-R035,  RADC  under  contriu:t  F30602-91- 
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derstanding  systems  can  be  traced  to  a  “smaU  world” 
assumption,  in  which  the  number  of  objects  in  the  do¬ 
main  are  few,  the  constraints  on  their  descriptions  are 
tight,  and  a  complete  world  model  is  at  least  a  possibility. 
Consequently,  specicil  purpose  systems  are  able  to  define, 
structure,  and  apply  relevant  task  knowledge  effectively. 
However,  as  the  scope  of  a  system  broadens  towards  a 
domain-independent,  general-purpose  system,  an  unfor¬ 
tunate  chain  of  events  occurs:  the  size  of  the  knowledge 
base  increases,  constraints  on  the  object  descriptions  be¬ 
come  looser  to  account  for  wider  variability,  the  system 
must  make  fewer  assumptions  about  the  types  of  image 
descriptions  necessary  for  matching,  and  the  complexity 
of  matching  increases  substantially. 

There  are  really  two  issues  being  discussed  here: 
the  structure  of  object  and  control  knowledge  in  vision 
systems,  and  the  acquisition  of  this  knowledge.  The  next 
section  briefly  describes  the  knowledge  component  of  the 
VISIONS  image  understanding  system  ([8,  9])  known  as 
the  Schema  System  ([5]),  from  the  point  of  view  of  knowl¬ 
edge  structuring  and  control.  Subsequent  sections  dis¬ 
cuss  the  role  of  learning  in  the  automatic  acqubition  of 
portions  of  the  knowledge  base.  It  is  our  contention  that 
learning  techniques  must  be  embedded  in  vision  systems 
of  the  future  in  order  to  reduce  the  cost  of  knowledge 
base  construction. 

2  The  VISIONS  Schema  System 

The  success  of  systems  based  on  the  “small  world”  as¬ 
sumption  has  led  us  to  adopt  a  design  philosophy  that 
partitions  both  knowledge  and  computation  at  a  coarse¬ 
grained  semantic  level.  In  the  VISIONS  system,  the 
coarse-grained  knowledge  is  encapsulated  in  schemas, 
where  each  schema  is  specialized  to  a  single  object  class. 
This  encapsulation  permits  schemeis  to  be  “experts”  in 
the  recognition  of  instances  of  an  object  class  and  per¬ 
mits  the  wide  range  of  control  strategies  necessary  for 
objects  to  be  represented  in  a  natural  way. 

Schema  instances  are  invoked  for  each  object  class 
hypothesized  to  be  in  the  image  data.  These  instances 
execute  independent  (potentially  concurrent)  processes 
called  recognition  strategies  and  communicate  asyn- 
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chronously  through  a  global  blackboard.  The  control 
component  of  each  schema  directs  the  application  of 
genercd  purpose  procedures,  called  knowledge  sources, 
to  gather  the  “right  kind”  of  support  for  (or  against) 
its  hypothesis.  Competition  and  cooperation  among  the 
schema  instances  results  in  the  combination  of  multiple, 
independent  “object  experts”  into  a  large  scale  system 
which  constructs  internally  consistent  interpretations. 

2.1  Schema  System  Components 

The  schema  system  consists  of  five  basic  components: 
the  schema  hierarchy,  the  blackboard,  the  knowledge 
sources,  the  interpretation  (control)  strategies,  and 
mechanisms  for  evidence  representation  and  combina¬ 
tion.  Each  of  these  is  discussed  very  briefly  in  the  fol¬ 
lowing  sections;  more  detail  may  be  found  in  [5]. 

2.1.1  The  Schema  Hierarchy 

The  schema  system  partitions  both  knowledge  and 
computation  in  terms  of  natural  object  classes  for  a  given 
domain.  Schemas  reside  in  class  and  part /subpart  hier¬ 
archies;  each  class  of  objects  and  object  parts  has  a  cor¬ 
responding  schema  which  stores  all  object  and  control 
knowledge  specific  to  that  class.  Knowledge  about  ex¬ 
pected  object  contexts  and  relationships  to  other  objects 
is  represented  in  the  system  by  extending  the  concept  of 
an  object  to  include  contextual  or  scene  configurations; 
as  objects,  these  entities  also  have  schemas.  A  subcon¬ 
text  or  “sub-scene”  is  like  an  object  part;  it  is  related  to 
its  parent  scene  or  context  in  predicatable  ways. 

2.1.2  Knowledge  Sources 

Knowledge  sources  are  general-purpose  procedures 
that  generate  the  levels  of  abstract  image  descriptions 
required  for  image  understanding.  Knowledge  sources 
span  the  gamut  of  traditional  techniques  in  image  pro¬ 
cessing  (e.g.  region,  line,  curve,  and  surface  extraction, 
feature  measurement,  etc),  through  intermediate  level 
processes  such  as  initial  object  hypothesis  generation 
and  grouping  operations  to  generally  useful  tools  and 
techniques  such  as  graph  matching.  The  compile-time 
arguments  and  parameters  supplied  to  gener2d-purpose 
knowledge  sources  as  part  of  the  recognition  strategy 
may  specialize  them  for  particular  purposes. 

2.1.3  Interpretation  Strategies 

Interpretation  strategies,  or  simply  strategies,  are 
control  programs  that  run  within  each  schema.  Strate¬ 
gies  procedurally  encode  knowledge  about  which  knowl¬ 
edge  sources  to  apply  and  in  what  order  to  apply  them. 
In  order  to  make  maximal  use  of  parallelism,  schemas 
may  have  multiple  concurrent  strategies.  These  strate¬ 
gies  may  correspond  to  different  methods  for  recognizing 
an  object  or  to  different  conditions  under  which  recogni¬ 
tion  must  take  place.  Schemas  can  also  contain  strategies 
for  different  subtasks,  such  as  initial  hypothesis  genera¬ 
tion  and  hypothesis  verification,  as  well  as  for  managing 


the  internal  bookkeeping  details  of  the  schema,  such  as 
updating  the  global  blackboard  when  necessary  and  de¬ 
tecting  and  resolving  conflicts  related  to  the  hypothesis. 

Each  schema  instance  acquires  information  perti¬ 
nent  to  the  hypothesis  it  is  pursuing.  Some  of  this  infor¬ 
mation  is  generic,  to  the  extent  that  its  semantics  are  not 
object  dependent.  For  example,  the  degree  of  confidence 
in  a  hypothesis,  as  well  as  its  (2D)  image  location  and 
(3D)  world  location,  is  generic  information.  Every  object 
hypothesis  has  a  confidence  level  and  an  image  location, 
and  most  have  a  meaningful  3D  location.  The  generic 
information  about  an  object  hypothesis  is  recorded  in  a 
global  hypothesis. 

Most  of  the  information  acquireo  y  a  schema  in¬ 
stance,  on  the  other  hand,  is  object  specific.  Information 
about  how  well  an  image  region  matches  an  expected 
color,  for  example,  is  non-generic  since  its  importance 
depends  on  the  object  model.  A  color  match  may  be  im¬ 
portant  for  finding  trees,  but  less  so  for  recognizing  au¬ 
tomobiles.  For  this  reason,  all  of  the  information  about 
which  KSs  support  a  particular  hypothesis  and  which  do 
not  is  considered  private  to  the  schema  instance,  and  is 
not  included  in  the  global  hypothesis. 

2.1.4  Blackboard  Communication 

The  schema  system  is  built  around  a  global  black¬ 
board.  The  global  hypotheses  written  to  the  blackboard 
represent  the  image  interpretation  as  it  evolves.  Schemas 
communicate  with  each  other  by  writing  to  and  reading 
from  the  blackboard,  dynamically  exchanging  informa¬ 
tion  about  their  respective  hypotheses.  Although  the 
blackboard  is  divided  into  sections  corresponding  to  the 
object  classes  (rather  than  processing  levels,  eis  in  other 
systems),  schemas  may  read  and  write  freely  over  the 
entire  blackboard.  The  division  into  sections  gives  some 
assurance  that  a  schema  will  not  have  to  search  through 
a  large  number  of  irrelevant  messages.  At  the  same  time, 
each  schema  instance  maintains  its  own  loccd  blackboard 
for  recording  private  information. 

The  distinction  between  the  global  and  local  black¬ 
boards  vas  motivated  both  by  computationed  and  knowl¬ 
edge  engineering  concerns.  Computationally,  most  of 
the  information  generated  by  an  interpretation  strategy 
concerns  which  KSs  have  been  run,  what  each  KS  re¬ 
turned,  etc.  While  this  information  is  crucially  impor¬ 
tant  within  the  schema  instance  for  making  dynamic  con¬ 
trol  decisions,  it  is  of  little  importance  to  other  schema 
instances.  If  the  strategies  associated  with  multiple  con¬ 
current  schema  instances  continually  dump  this  infor¬ 
mation  to  the  global  blackboard  and  then  read  it  back 
again,  the  blackboard  quickly  becomes  a  computational 
bottleneck.  The  local  blackboards  alleviate  this  problem 
by  reducing  the  message  traffic  on  the  global  blackboard. 

From  a  knowledge  engineering  viewpoint,  the  dis¬ 
tinction  between  the  global  and  local  blackboards  pro¬ 
motes  modularity.  By  allowing  only  the  strict  “global 
hypothesis”  protocol  to  be  exchanged  between  schemas. 
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the  schema  system  encourages  modularity.  Each  schema 
can  maintain  local  information  in  an  idiosyncratic  man¬ 
ner  on  its  local  blackboard,  idlowing  the  schema  designer 
the  freedom  of  any  appropriate  knowledge  representation 
and  control  style.  At  the  same  time,  because  schemas 
communicate  with  each  other  only  through  global  hy¬ 
potheses,  the  designer  of  a  new  schema  is  assured  of  a 
smooth  join  to  the  remainder  of  the  system. 

2.1.5  Evidence  Accumulation 

The  current  version  of  the  schema  system  takes 
a  particularly  simple  view  of  evidence  representation 
and  combination.  Confidence  values  lie  along  a  coarse, 
five  point  ordinal  scale;  ‘no  evidence’,  ‘slim-evidence’, 
‘partial-support’,  ‘belief’,  and  ‘strong- belief’.  When 
combining  evidence,  a  heuristic  mechanism  is  used  that 
involves  the  specification  of  key  pieces  of  evidence  that 
are  required  to  post  an  object  hypothesis  with  a  given 
confidence  to  the  global  blackboard.  Subsets  of  sec¬ 
ondary  evidence  are  used  to  raise  or  lower  these  con¬ 
fidences.  Specifications  of  these  subsets,  and  the  effect 
their  confidence  has  on  the  overall  confidence,  is  part  of 
the  knowledge  engineering  effort  involved  in  constructing 
a  schema.  Although  this  method  of  evidence  represen¬ 
tation  and  accumulation  may  lack  considerably  from  a 
theoretical  point  of  view,  it  worked  surprisingly  well  in 
interpretation  experiments  on  images  of  New  England 
house  and  toad  scenes  [5]. 

2.2  Knowledge  Engineering 

Schemas  are  assembled  by  specifying  (1)  the  appropriate 
set  of  knowledge  sources  to  be  used,  (2)  a  set  of  strate¬ 
gies  which  conditionally  sequence  their  application,  and 
(3)  a  function  to  translate  internal  evidence  into  a  confi¬ 
dence  in  the  global  hypothesis.  One  of  the  main  imped¬ 
iments  to  wide  scale  experimentation  with  the  schema 
system  has  been  the  time  and  energy  required  to  design 
a  schema.  Schema  construction  can  be  viewed  as  an  ex¬ 
ercise  in  experimental  engineering,  in  which  prototype 
schemas  are  developed  using  existing  system  resources. 
These  schemas  must  then  be  tested  on  a  representative 
set  of  objects/images,  failures  noted  and  analyzed,  and 
the  schemas  re-engineered  to  account  for  the  failures.  In 
many  cases,  the  descriptive  information  provided  by  the 
knowledge  sources  may  be  inadequate.  In  this  case,  new 
knowledge  sources  must  be  developed  and  tested  (often 
a  major  research  effort  in  its  own  right),  integrated  into 
the  system,  and  the  schemas  re-engineered  to  make  use 
of  the  new  information. 

The  problem  of  knowledge  base  construction  heis 
been  a  focus  of  research  for  several  years.  In  artificial 
intelligence,  researchers  have  focused  on  how  to  extract 
knowledge  from  experts;  vision  researchers  have  concen¬ 
trated  instead  on  how  knowledge  bases  should  be  spec¬ 
ified.  By  restricting  the  message  types  written  to  the 
global  blackboard,  the  schema  system  enforces  schema 
modularity  in  an  attempt  to  make  them  easier  to  declare 


and  improve.  The  SPAM  project  at  CMU  went  even  far¬ 
ther,  developing  a  high-level  language  for  describing  ob¬ 
jects  ([12]).  Work  in  Japan  has  involved  both  automatic 
programming  efforts  and  higher-level  languages  for  spec¬ 
ifying  image  operations  ([11]). 

3  Vision  and  Learning 

For  the  last  two  years  we  have  taken  a  different  approach 
to  knowledge  base  development.  Instead  of  making 
the  knowledge  base  easier  to  program,  we  have  decided 
to  take  the  programmer  out  of  the  loop.  Knowledge- 
directed  vision  systems  should  learn  their  own  interpre¬ 
tation  strategies. 

As  a  first  step  toward  achieving  this  goal  the 
Schema  Learning  System  (SLS;  [6,  7])  hcis  been  designed. 
The  task  of  is  to  learn  interpretation  strategies  for  the 
different  object  classes  in  a  domain.  In  particular,  the 
goal  is  to  learn  a  strategy  that  minimizes  the  cost  of 
object  recognition,  subject  to  accuracy  constraints  sup¬ 
plied  by  the  user.  For  example,  a  user  might  request  a 
strategy  for  recognizing  the  (3D)  position  and  pose  of  a 
building,  accurate  to  within  5%.  SLS  would  respond  by 
attempting  to  learn  a  recognition  strategy  that  satisfies 
this  goal  and  could  be  invoked  whenever  the  user  needs 
to  locate  a  building. 

As  implied  by  the  scenario  above,  SLS’s  opera¬ 
tions  can  be  divided  into  two  parts:  a  compile- time 
(or  “learning- time”)  component  in  which  SLS  devel¬ 
ops  recof'nition  strategies,  and  a  run-time  component 
in  which  the  interpretation  strategies  are  applied  to  new 
images.  In  general,  SLS  has  been  designed  to  optimize 
run-time  performance,  at  the  expense  of  compile-time 
(learning)  efficiency. 

SLS’s  task  is  meule  easier  by  two  simplifying  as- 
sumptions.  First,  SLS  learns  to  recognize  instances  of 
each  object  cl2iss  independently.  This  is  easier  than 
learning  concurrent,  cooperating  strategies.  Second,  SLS 
is  given  a  set  of  parameterized  knowledge  sources  from 
which  to  build  its  recognition  strategies.  Thus  SLS  is 
not  required  to  learn  new  knowledge  sources  or  3D  ob¬ 
ject  models,  but  rather  to  learn  control  strategies  for  ap¬ 
plying  knowledge  sources,  including  evidence  weighting 
schemes.  Implicit  in  this  task  statement  is  the  possibility 
that  the  object  model  may  not  be  completely  accurate, 
and  that  some  of  the  knowledge  sources  may  be  mislead¬ 
ing  or  irrelevant. 

3.1  Modeling  the  Interpretation  Process 

SLS,  like  the  schema  sv  em  before  it,  adopts  the  black¬ 
board  model  of  inter  -euitinn  and  views  object  recogni¬ 
tion  as  a  process  of  ai  ^lving  knowledge  sources  to  hy¬ 
potheses.  Hypothesrr.  aiT  proposed  statements  about  the 
image  or  its  interpretation,  whose  type  is  determined  by 
their  level  of  abstraction.  Common  levels  of  abstraction 
for  computer  vision  include:  image,  region,  image  line 
segment,  and  image  segment  group  (all  of  which  are  2D), 
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and  world  line  segment,  orientation  vector,  world  seg¬ 
ment  group,  surface  patch,  face,  volume  (all  of  which  are 
3D).  An  “interpretation”  is  a  set  of  believed  hypotheses 
at  the  level  of  abstraction  requested  by  the  user.  Knowl¬ 
edge  sources  are  image  understanding  procedures  (e.g. 
region  segmentation,  line  extraction  or  vanishing  point 
analysis)  that  can  be  applied  to  one  or  more  hypotheses. 

SLS  refines  the  blackboard  model  of  interpretation 
by  constraining  knowledge  sources  to  fall  into  one  of  two 
closes.  Generation  knowledge  sources  (GKSs)  create 
new  hypotheses  at  a  given  level  of  abstraction.  For  ex¬ 
ample,  a  stereo  line-matching  algorithm  that  produces 
a  (3D)  world  line  segment  from  a  pair  of  (2D)  image 
line  segments  is  a  GKS.  Verification  knowledge  sources 
(VKSs)  return  discrete  evidence  values  about  hypotheses 
at  the  level  of  abstraction  they  apply  to.  An  example  of 
a  VKS  is  a  pattern  matching  algorithm  that  determines 
if  the  color  or  texture  of  an  image  region  matches  the 
expected  color  or  texture  of  the  object. 

3.2  Recognition  Graphs 

Interpretation  strategies  are  represented  in  SLS  as  gener¬ 
alized  multi-level  decision  trees  called  recognition  graphs 
that  direct  both  hypothesis  formation  and  hypothesis 
verification,  as  shown  in  Figure  1.  The  premise  behind 
the  formalism  is  that  object  recognition  is  a  series  of 
small  verification  tasks  interleaved  with  representation2tl 
transformations.  Recognition  begins  with  trying  to  ver¬ 
ify  hypotheses  at  a  low  level  of  abstraction,  separating 
to  the  extent  possible  hypotheses  that  are  reliable  from 
those  that  are  not.  Verified  hypotheses  (or  at  least,  hy¬ 
potheses  that  have  not  been  rejected)  are  then  trans¬ 
formed  to  a  higher  level  of  abstraction,  where  a  new 
verification  process  takes  place.  The  cycle  of  verification 
followed  by  transformation  continues  until  hypotheses 
are  verified  at  the  goal  level  of  abstraction  (as  specified 
by  the  user),  or  until  all  hypotheses  have  been  rejected. 

The  structure  of  the  recognition  graph  reflects  the 
verification/transformation  cycle.  Each  level  of  the 
recognition  graph  is  a  decision  tree  that  controls  hypoth¬ 
esis  verification  at  one  level  of  abstraction  by  invoking 
VKSs  to  gather  support  for  or  against  each  hypothesis. 
When  the  decision  tree  determines  that  a  hypothesis  is 
reliable,  a  GKS  transforms  it  to  another  level  of  abstrac¬ 
tion,  where  the  process  repeats  itself. 

As  defined  in  the  field  of  operations  research,  deci¬ 
sion  trees  are  a  form  of  state-space  representation  com¬ 
posed  of  alternating  choice  states  and  chance  states. 
When  searching  for  a  path  from  the  start  state  to  a  goal 
state,  an  agent  is  only  allowed  to  choose  where  to  go  next 
&om  a  choice  state.  If  the  current  state  is  a  chance  state 
the  next  state  is  selected  probabilistically^ .  The  search 
process  is  therefore  similar  to  using  a  game  tree  against 

‘  Operations  research  terminology  is  based  on  trees  rather 
than  spaces,  so  it  refers  to  choice  nodes  and  chance  nodes 
rather  than  choice  states  and  chance  states,  and  to  leaf  nodes 
and  root  nodes  rather  than  goal  states  and  start  states. 


Level  of  Abstraction:  N 


Figure  1:  A  recognition  graph.  Levels  of  the  graph  are  de¬ 
cision  trees  that  verify  hypotheses  using  VKSs.  Hypotheses 
that  reach  a  subgoal  are  transformed  to  the  next  level  of  ab¬ 
straction  by  a  GKS. 


a  probabilistic  opponent. 

In  SLS,  the  choice  states  are  hypothesis  knowledge 
states  as  represented  by  sets  of  hypothesis  feature  val¬ 
ues.  The  choice  to  be  made  at  each  knowledge  state  is 
which  VKS  (if  any)  to  execute  next.  Chance  states  in 
the  tree  represent  VKS  applications,  where  the  chance  is 
on  which  value  the  VKS  will  return.  Hypothesis  verifica¬ 
tion  is  an  alternating  cycle  in  which  the  control  strategy 
selects  which  VKS  to  invoke  next  (i.e.,  which  feature  to 
compute),  and  the  VKS  probabilisticadly  returns  a  fea¬ 
ture  value.  Thus  hypotheses  advance  from  knowledge 
states  to  VKS  application  states  and  then  on  to  new 
knowledge  states.  The  cycle  continues  for  each  hypoth¬ 
esis  until  it  reaches  a  subgoal  state,  indicating  that  it 
has  been  verified  and  should  be  transformed  to  a  higher 
level  of  abstraction,  or  a  failure  state,  indicating  that  the 
hypothesis  is  unreliable  and  should  be  rejected. 

In  general,  SLS  learns  in  advance  what  VKS  to 
choose  at  each  knowledge  state  in  order  to  avoid  mak¬ 
ing  run-time  control  decisions.  As  a  result,  when  SLS 
builds  a  recognition  graph  it  leaves  just  one  option  at 
each  choice  node.  Sometimes,  however,  the  readiness  of 
a  VKS  to  be  executed  cannot  be  determined  until  run¬ 
time,  in  which  Ccise  SLS  will  leave  several  options  at  a 
choice  node,  sorted  in  order  of  desirability^ .  At  run-time 
the  system  will  choose  the  highest-ranking  VKS  that  is 

’This  is  just  one  of  many  complications  that  arise  from 
multiple-argument  knowledge  sources.  In  general,  we  will 
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ready  to  be  executed. 

3.3  The  Schema  Learning  System 

The  Schema  Learning  System  (SLS)  constructs  interpre¬ 
tation  strategies  represented  as  recognition  graphs.  SLS 
is  given  (1)  a  set  of  parameterized  knowledge  sources; 
(2)  a  set  of  user-interpreted  training  images;  and  (3)  a 
goal,  in  terms  of  a  target  representation  and  the  required 
accuracy.  It  produces  a  recognition  strategy  that  mini¬ 
mizes  the  expected  cost  of  achieving  the  goal. 

3.3.1  Exploration 

SLS  leuns  recognition  strategies  through  a  three- 
step  process  of  exploration,  learning  &om  examples,  and 
optimization.  The  first  step,  exploration,  is  algorithmi¬ 
cally  the  least  interesting.  It  exhaustively  applies  all 
available  knowledge  sources  (both  VKSs  and  GKSs)  to 
the  training  images  in  order  to  estimate  the  expected 
cost  of  each  knowledge  source  (measured  as  execution 
time)  and  the  probability  of  each  VKS  result.  The  ex¬ 
haustive  exploration  phase  also  produces  as  many  cor¬ 
rect  interpretations  as  possible  with  the  existing  GKSs 
to  serve  as  examples  for  the  second  phase  of  learning.  At 
this  point  computational  efficiency  is  unimportant,  since 
the  goal  is  run-time  efficiency. 

3.3.2  Learning  from  Examples 

SLS’s  second  step  looks  at  the  correct  interpreta¬ 
tions  produced  during  exploration  and  infers  &om  them 
a  scheme  for  generating  good  hypotheses  while  minimiz¬ 
ing  the  number  of  false  hypotheses  by  tracing  back  the 
GKSs  employed  to  produce  each  good  hypotheses.  For 
example,  a  correct  3D  pose  hypothesis  might  be  gener¬ 
ated  by  fitting  a  plane  to  a  set  of  3D  line  segments.  If 
so,  the  pose  hypothesis  is  dependent  on  the  pl^me  fitting 
GKS.  It  is  also  dependent  on  whatever  GKS  created  the 
3D  line  segments,  and  any  GKSs  needed  to  create  its 
arguments,  etc.  The  result  of  tracing  back  a  hypothe¬ 
sis’  dependencies  is  an  AND/OR  tree  like  the  one  shown 
in  Figure  2.  'AND’  nodes  in  the  tree  result  from  GKSs 
that  require  multiple  arguments,  such  as  stereo  match¬ 
ing.  ‘OR’  nodes  in  the  tree  occur  when  a  hypothesis 
is  redundantly  generated  by  more  than  one  GKS  (or  a 
single  GKS  applied  to  alternate  hypotheses). 

Each  dependency  tree  is  viewed  as  an  example  of 
how  correct  hypotheses  are  created.  The  example  is  gen¬ 
eralized  by  replacing  the  hypotheses  in  the  tree  with  their 
feature  vectors.  In  other  words,  instead  of  viewing  Fig¬ 
ure  2  as  showing  how  pose- 10  was  created  during  train¬ 
ing,  we  interpret  it  as  an  example  showing  how  poses  can 
be  created  by  a  specific  GKS  (e.g.  the  GKS  for  fitting 
lines  to  planes)  when  applied  to  hypotheses  with  specific 
features  (in  this  case,  the  feature  values  of  3D-lineset-l). 

describe  SLS  as  if  all  KSs  took  just  one  argument  in  order 
to  keep  the  description  brief;  see  [7]  for  a  more  complete 
description. 
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Figure  2:  An  example  of  a  dependency  tree  showing  the 
different  ways  that  one  correct  pose  hypothesis  can  be  created 
during  training. 


Since  the  goal  is  to  learn  a  strategy  that  will  gener¬ 
ate  (and  later  verify)  all  instances  of  an  object  or  object 
class,  SLS  collects  the  dependency  trees  of  aU  the  cor¬ 
rect  hypotheses  into  a  single  multi-sample  tree  by  AND- 
ing  their  root  nodes  together.  By  definition,  any  set  of 
conditioned  GKSs  (i.e.  GKSs  with  specific  feature  val¬ 
ues  as  preconditions  to  the  arguments  derived  from  the 
training  set)  that  satisfies  this  tree  will  generate  all  the 
correct  hypotheses  over  the  training  images.  However, 
there  is  no  reason  to  believe  that  such  a  set  of  GKSs  will 
generate  only  correct  hypotheses;  it  will  generate  incor¬ 
rect  ones  as  well.  Therefore,  SLS’s  job  in  step  two  is  to 
find  a  set  of  (conditioned)  GKSs  that  satisfies  the  multi¬ 
sample  dependency  tree  while  minimizing  the  number  of 
incorrect  hypotheses  generated. 

SLS  finds  the  optimal  set  of  generation  knowledge 
sources  (GKSs)  by  converting  the  multi-sample  depen¬ 
dency  tree  into  disjunctive  normal  form  (DNF)  and  se¬ 
lecting  the  conjunctive  subterm  that  generates  the  fewest 
incorrect  hypotheses.  Because  of  the  way  the  tree  was 
constructed,  the  GKSs  in  any  subterm  are  sufficient  to 
generate  correct  goal-level  hypotheses  for  every  object 
instance  in  the  training  set. 

The  AND/OR  dependency  tree  is  converted  into 
DNF  by  a  standard  algorithm  that  first  converts  its  sub¬ 
trees  to  DNF  and  then  either  merges  the  subterms  (if  the 
root  is  an  OR  node)  or  takes  the  symbolic  cross-product 
of  the  subterms  (if  the  root  is  ^m  AND  node).  SLS,  how¬ 
ever,  is  designed  to  find  just  the  minimal  term  of  the 
resulting  DNF  expression;  as  a  result,  any  time  during 
the  conversion  process  that  a  DNF  has  two  subterms  one 
of  which  is  a  logical  superset  of  the  other,  the  superset 
term  can  be  pruned  from  the  expression. 

Readers  may  note  that  converting  an  arbitrary 
AND/OR  tree  to  DNF  is  an  exponenti2dly  expensive  pro- 
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cess:  in  the  worst  case,  a  tree  with  N  literals  can  produce 
a  DNF  expression  with  2^  subterms.  In  the  case  of  SLS, 
the  pruning  condition  reduces  the  worst-case  complexity 
to  N  choose  [yj,  but  this  is  still  exponential.  The  worst- 
case  analysis,  however,  is  largely  inappropriate  because 
it  corresponds  to  random  data.  As  long  as  the  samples 
in  the  training  set  are  visucilly  similar  (and  as  long  as  not 
aU  of  the  knowledge  sources  produce  random  results)  the 
worst  case  will  never  arise. 

3.3.3  Optimization 

As  was  stated  earlier,  recognition  graphs  interleave 
verification  and  transformation,  using  VKSs  to  gather 
evidence  to  verify  or  reject  hypotheses,  and  GKSs  to 
transform  them  to  higher  levels  of  abstraction.  By  build¬ 
ing  a  dependency  tree  from  the  treuning  samples,  con¬ 
verting  it  to  DNF  and  picking  the  minimal  subterm 
(measured  by  the  number  of  incorrect  hypotheses  gen¬ 
erated),  SLS  learned  which  GKSs  to  include  in  a  strat¬ 
egy  that  generates  correct  hypotheses  while  minimizing 
the  number  of  false  alarms  (and  presumably  cost).  Just 
as  important,  it  learned  what  evidence  to  require  of  a 
hypothesis  before  it  should  be  transformed.  The  sub¬ 
terms  of  the  DNF  expression  are  GKSs  constrained  to 
be  applied  to  hypotheses  with  specific  sets  of  features, 
and  these  feature  sets  sire  the  subgoals  of  the  recognition 
process. 

In  the  third  step  of  the  algorithm,  SLS  optimizes 
recognition  by  building  decision  trees  for  each  level  of 
abstraction  that  minimize  the  expected  cost  of  reaching 
a  subgoal  or,  conversely,  of  deciding  that  a  hypothesis 
cannot  satisfy  any  subgoal  and  should  be  rejected.  This 
is  achieved  at  each  level  by  first  laying  out  the  graph 
of  all  possible  sequences  of  knowledge  states  and  VKS 
applications  and  then  pruning  it  to  leave  just  the  tree 
that  minimizes  the  expected  cost. 

For  each  level  of  abstraction,  the  initial  graph  lay¬ 
out  begins  with  a  start  state.  VKS  applications  are 
added  for  every  VKS  that  can  be  applied  to  a  hypothe¬ 
sis  in  the  start  state,  and  these  VKS  applications  lead  to 
new  knowledge  states,  which  in  turn  have  more  VKS  ap¬ 
plications  attached  to  them,  and  so  on.  The  expansion 
of  the  graph  continues  until  it  reaches  either  a  subgoal 
knowledge  state  or  a  knowledge  state  that  is  incompati¬ 
ble  with  every  remaining  subgoal  (i.e.  a  failure  state). 

Once  the  initial  graph  has  been  laid  out,  SLS  begins 
to  prune  it  by  working  backwards  from  the  subgoal  and 
failure  nodes  toward  the  start  state.  At  each  VKS  appli¬ 
cation  node  it  calculates  the  expected  cost  of  reaching  a 
subgoal  or  fiulure  node  from  that  particular  application 
node.  At  each  knowledge  state,  it  finds  which  of  the  pos¬ 
sible  VKS  applications  hzts  the  lowest  expected  cost  and 
removes  the  other  VKSs  from  the  list  of  candidates  (in 
the  event  that  the  optimal  VKS  might  not  be  executable 
at  run  time,  it  sorts  the  remaining  VKSs  in  order  of  least 
to  greatest  expected  cost  rather  than  removing  them. 

More  formally,  we  refer  to  the  subgoal  states  and 


the  failure  states  at  one  level  of  a  recognition  graph  as 
the  terminal  states  for  that  level.  The  cost  of  promoting 
a  hypothesis  from  knowledge  state  n  to  a  terminal  state 
is  called  the  Expected  Decision  Cost  (EDC)  of  state  n, 
and  the  expected  cost  of  reaching  a  terminal  state  from 
state  n  using  VKS  k  is  the  Expected  Path  Cost  (EPC) 
of  n  and  k.  Since  verification  KSs  return  discrete  Vcdues, 
we  refer  to  the  possible  outcomes  of  a  verification  KS  k 
as  R{k),  and  the  probability  of  a  particular  value  e  being 
returned  as  P{e\k,n),e  6  R(k)- 

The  EDC’s  of  knowledge  states  can  be  calculated 
starting  with  the  terminal  states  and  working  backwards 
through  the  recognition  graph.  Clearly,  the  EDC  of  a 
subgoal  or  failure  state  is  zero: 

EDC(n)  =  0,  n  c  {terminal  states}. 

The  expected  path  cost  of  reaching  a  terminal  state 
using  a  particular  VKS  is: 

EPC{n,k)  =  C{k)+  ^  (P(e|n,  Jb)  x  EDC(n  U  e)) 
eeR(k) 

where  n  is  the  knowledge  state  expressed  as  a  set  of 
feature  values,  n  U  e  is  the  knowledge  state  that  results 
from  VKS  k  returning  feature  value  e  and  C(ife)  is  the 
estimated  cost  of  applying  k. 

The  EDC  of  a  knowledge  state,  then,  is  the  smallest 
EPC  of  the  knowledge  sources  that  can  be  executed  at 
that  state: 

EDC{n)=  min  {EPC{n,k)) 

S(n) 

where  KS[n)  is  the  set  of  VKSs  applicable  at  node  n. 

The  equations  above  establish  a  mutually  recursive 
definition  of  the  expected  decision  cost  of  a  knowledge 
state.  The  EDC  of  a  knowledge  state  is  the  EPC  of  the 
optimal  VKS  application  at  the  state;  the  EPC  of  a  VKS 
application  is  the  expected  cost  of  applying  the  VKS  plus 
the  expected  remaining  EDC  after  the  VKS  has  been 
applied.  The  recursion  bottoms  out  at  terminal  nodes, 
whose  EDC  is  zero.  Since  every  path  through  the  object 
recognition  graph  ends  at  either  a  subgoal  or  a  failure 
node,  the  recursion  is  well  defined.  Furthermore,  since 
the  EDC  of  a  level’s  start  state  estimates  the  expected 
cost  of  verifying  a  hypothesis  at  that  level  of  abstraction, 
the  EDCs  of  all  the  start  states  can  be  combined  with 
estimates  of  the  number  of  hypotheses  generated  at  each 
level  to  estimate  the  expected  run-time  of  the  strategy 
as  a  whole. 

3.4  Experimental  Results 

The  previous  sections  give  a  simplified  description  of 
a  complex  system  that  has  only  recently  been  imple¬ 
mented.  Because  the  system  is  new,  complete  and  thor¬ 
ough  experiments  testing  its  success  both  as  a  knowledge 
engineering  tool  and  as  a  machine  learning  system  are 
only  now  underway;  in  this  section  we  report  the  results 
of  one  such  experiment. 


Figure  3:  The  first  of  twenty  pictures  taken  of  the  Marcus 
Engineering  building. 


The  goal  of  the  experiment  was  to  test  SLS  within 
the  scenario  of  learning  to  accurately  recognize  the  pose 
of  a  complex  object  from  an  approximately  known  view¬ 
point;  other  experiments  are  testing  its  ability  to  perform 
2D  recognition  and  to  perform  3D  recognition  &om  ar¬ 
bitrary  viewpoints.  The  image  data  for  the  experiment 
were  twenty  images  of  the  Marcus  Engineering  building 
on  the  UMass  campus  like  the  ones  shown  in  Figures  3 
and  4,  taken  along  a  dirt  path  at  distances  ranging 
from  three  to  four  hundred  feet  from  the  building.  The 
pictures  were  taken  with  the  image’s  y-axis  parallel  to 
gravity  (i.e.  with  zero  tilt  and  roll),  however  there  were 
still  rotations  (pan)  from  one  image  to  the  next,  so  that 
the  pose  of  the  budding  has  four  free  parameters,  three 
locational  and  one  rotational.  The  goal  was  to  learn  a 
strategy  that  could  identify  the  pose  of  the  building  to 
within  10°  rotation  (pan),  5%  depth  (scale)  and  1°  of 
the  correct  image  angle  (the  angle  from  the  focrd  point 
of  the  camera  to  the  object). 

The  knowledge  sources  available  to  SLS  included 
a  geometric  matcher  for  comparing  wire-frame  models 
to  image  data  ([1]),  perspective  ansdysis  routines  for  es¬ 
timating  orientations  ([4,  10]),  a  line  grouping  system 
([13]),  a  pattern  classification  technique  ([2])  and  tem¬ 
plate  matching  routines.  It  was  also  given  knowledge 
sources  for  checking  domain  constraints  such  as  distance 
from  an  object  to  the  camera  or  the  height  of  an  object 
above  (or  below)  the  camera  plane. 

SLS  was  tested  by  a  “leave  one  out”  scheme  in 
which  strategies  were  trained  on  nineteen  images  and 
tested  on  the  twentieth.  SLS  strategies  were  able  to 
generate  correct  3D  pose  hypotheses  in  nineteen  of  the 


Figure  4:  The  last  Marcus  Engineering  image 

twenty  tests  (in  the  twentieth  Ccise  it  generated  no  pose 
hypotheses  at  all).  It  was  easier  to  generate  goal-level  hy¬ 
potheses  than  to  verify  them,  however,  since  none  of  the 
av^able  VKSs  were  sensitive  to  small  changes  in  rota¬ 
tion  or  scale.  At  the  goal  level,  strategies  compute  prob¬ 
abilities  of  correctness  by  sampling  training  hypotheses 
rather  than  absolutely  accepting  or  rejecting  each  hy¬ 
pothesis;  Table  1  shows  the  errors  in  pan,  scale  and  im¬ 
age  angle  for  the  most  probable  pose  returned  for  each 
image,  while  Figure  5  shows  the  projection  of  the  most 
likely  pose  found  for  the  image  in  Figure  3.  (When  sev¬ 
eral  pose  hypotheses  were  tied  for  the  highest  confidence 
value  we  averaged  the  absolute  values  of  their  errors  in 
Table  1).  In  sixteen  of  the  twenty  tests,  the  most  prob¬ 
able  pose  was  within  the  tolerance  thresholds  set  by  the 
user.  In  three  of  the  tests,  the  most  probable  pose  was 
just  outside  the  thresholds,  by  either  a  fraction  of  a  de¬ 
gree  in  pan  (10.11®  for  image  2,  10.39°  for  image  16) 
or  a  percent  in  scale  (6.06%  for  image  20).  As  stated 
above,  no  poses  at  all  were  generated  in  the  twentieth 
case  (image  6). 

4  Conclusion 

It  is  generally  accepted  that  the  timely  and  appropriate 
use  of  relevant  knowledge  can  substantially  reduce  the 
search  encountered  in  establishing  ’instance-of’  relation¬ 
ships  between  image  data  and  its  interpretation(s).  Un¬ 
fortunately,  the  problem  of  how  to  acquire  and  structure 
knowledge  has  limited  most  knowledge-based  vision  sys¬ 
tems  to  highly  constrained  domains.  The  schema  learn¬ 
ing  system  (SLS)  is  an  experimental  system  that  learns 
how  to  recognize  objects  from  training  images.  At  the 
moment,  SLS  still  requires  a  human  to  supply  it  with  pa- 


# 

Prob 

Pan(») 

Scale(%) 

ImAngle(‘’) 

1 

0.78 

4.95 

1.18 

0.13 

2 

0.82 

10.11 

2.08 

0.03 

3 

0.76 

3.83 

0.74 

0.07 

4 

0.77 

1.76 

1.76 

0.14 

5 

6 

7 

0.77 

4.92 

4.16 

0.13 

0.76 

4.17 

2.42 

0.21 

8 

0.76 

2.95 

0.60 

0.09 

9 

0.77 

1.54 

2.07 

0.04 

10 

0.77 

4.48 

2.24 

0.25 

11 

0.79 

3.27 

4.72 

0.08 

12 

0.76 

2.88 

1.73 

0.05 

13 

0.78 

3.58 

4.01 

0.20 

14 

0.78 

0.07 

2.33 

0.08 

15 

0.78 

1.16 

2.84 

0.05 

16 

0.88 

10.39 

0.32 

0.15 

17 

0.77 

2.33 

2.64 

0.10 

18 

0.78 

1.40 

1.95 

0.13 

19 

0.79 

1.33 

4.25 

0.12 

20 

1.00 

9.62 

6.06 

0.16 

Table  1:  The  errors  between  the  most  probable  pose  hypoth¬ 
esis  and  the  true  pose  for  each  test  image.  Pan  refers  to  dif¬ 
ference  in  rotation  about  the  gravitational  axis  measured  in 
degrees,  scale  to  the  distance  from  the  camera  to  the  object 
measured  as  a  percentage  of  the  true  distance,  and  image 
angle  to  the  difference  in  angle  between  the  rays  from  the 
camera  to  the  object.  The  user’s  tolerance  thresholds  were 
5%  scale,  10°  pan  and  1°  image  angle. 


lameteiized  knowledge  sources  and  a  set  of  interpreted 
training  images.  The  previously  time-consuming  pro¬ 
cess  of  supplying  control  knowledge,  however,  hcis  been 
replaced  by  a  system  that  automatically  learns  control 
strategies  that  minimize  the  expected  cost  of  recognition. 
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Abstract 

In  this  paper,  we  discuss  techniques  for  extending 
the  sensor  planning  capabilities  of  the  “MVP”  (Ma^ 
chine  Vision  Planning)  system  to  include  motion  in 
a  well-known  environment.  In  a  typical  work  ceU, 
vision  sensors  are  needed  to  monitor  a  task  and  pro¬ 
vide  feedback  to  motion  control  programs  or  to  as¬ 
sess  task  completion  or  failure.  In  planning  sensor 
locations  and  parameters  for  such  a  work-cell,  all 
motion  in  the  environment  must  be  taken  into  ac¬ 
count  in  order  to  avoid  occlusions  of  desired  features 
by  moving  objects  and,  in  the  case  where  the  fea¬ 
tures  to  be  monitored  are  being  manipulated  by  the 
robot,  to  insure  that  the  features  are  always  within 
the  camera’s  view.  Several  different  sensor  locations 
(or  a  single,  movable  sensor)  may  be  required  in  or¬ 
der  to  view  the  features  of  interest  during  the  course 
of  the  task.  The  goal  is  to  minimize  the  number  of 
sensors  (or  to  minimize  the  motion  of  the  single  sen¬ 
sor)  while  guaranteeing  a  robust  view  at  all  times 
during  the  task,  where  a  robust  view  is  one  which  is 
unobstructed,  in  focus,  and  sufficiently  magnified. 
In  the  past,  sensor  planning  techniques  have  pri¬ 
marily  focused  on  static  environments.  We  present 
techniques  which  we  have  been  exploring  to  include 
knowledge  of  motion  in  the  sensor  planning  prob¬ 
lem.  Possible  directions  for  future  research  are  also 
presented. 

1  Introduction 

Recently,  there  has  been  much  research  in  the  field 
of  sensor  planning  [1,  2,  4,  5,  7,  8,  11,  12].  The  ba¬ 
sic  problem  is  that  in  setting  up  an  automated  sys¬ 
tem  for  monitoring  some  process,  the  effectiveness  of 

*Thia  work  wm  supported  in  part  by  DARPA  con¬ 
tract  N00039-84-C-0166,  NSF  grants  DMC-86-05065,  DCI- 
86-08845,  CCR-86-12709,  lRI-86-57151,  IRI-88-1319,  North 
American  Philips  Laboratories,  Siemens  Corporation  and 
Rockwell  International. 


the  system  can  largely  be  determined  by  the  loca¬ 
tions,  types  and  configurations  of  the  sensors  used. 
To  manually  determine  these  parameters  on  a  case 
by  case  basis  may  not  be  cost  efficient  or  accurate, 
and  the  resulting  system  may  not  be  optimal  in  any 
sense.  It  may  be  better  to  have  ein  automated  system 
for  determining  the  sensor  locations  and  parameters 
for  monitoring  a  given  task. 

To  that  end,  many  systems  have  been  and  are 
being  developed  which,  based  on  geometric  models 
of  an  environment  and  models  of  the  sensors,  can 
generate  sensor  locations  and  settings  which  provide 
a  robust  view  of  specific  features  so  that  the  feai- 
tures  are  detectable,  recognizable,  measurable,  or 
meet  some  other  task  constr2unts.  In  general,  the 
sensors  are  cameras  euid  a  robust  view  implies  that 
the  camera  must  have  an  unobstructed  view  of  the 
entire  feature  set,  which  must  lie  within  the  depth- 
of-field  of  the  camera  and  must  be  magnified  to  a 
given  specification.  Sensor  planning  systems  can 
then  generate  camera  locations,  orientations,  lens 
settings  (focus-ring  adjustment,  focal  length,  aper¬ 
ture),  and  in  some  cases  lighting  plans  to  insure  a 
robust  view  of  the  features  [10]. 

Most  of  the  methods  presented  to  date  have 
only  addressed  static  environments  such  as  would 
be  found  in  a  post-manufacturing  inspection  task 
(an  exception  is  the  VIO  system  of  Niepold  et.  al. 
in  [7]).  The  approach  taken  is  to  perform  an  off-hne 
analysis  of  the  geometric  and  optical  constraints  for 
a  static  environment  and,  via  a  generate-and-test  or 
a  synthesis  approach,  give  one  or  more  sensor  lo¬ 
cation  and  parameter  settings  which  are  vedid  only 
for  the  specific  static  environment  which  was  ana¬ 
lyzed.  When  objects  in  the  environment  need  to  be 
moved  for  some  reason,  new  sensor  locations  need 
to  be  computed  off-line.  This  works  well  for  quality- 
control  or  inspection  tasks  where,  for  example,  parts 
can  be  fed  to  a  specific  location  and  orientation  in 
the  environment  for  inspection. 

There  are  many  instances  where  moving  scenes 
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may  need  to  be  monitored.  Manufacturing  or  assem¬ 
bly  tasks  Ccin  be  visually  guided  or  monitored.  Teler- 
obotic  operations  need  some  form  of  sensor  feedback. 
For  these  and  other  tasks,  the  sensor  planning  tech¬ 
niques  presented  to  date  would  be  inadequate  due 
to  the  dynamic  nature  of  the  environments.  But  for 
these  applications,  it  would  still  be  better  to  have 
the  robotic  system  remain  in  control  of  the  sensor 
positions  and  settings  to  insure  reliable  monitoring 
rather  than  to  require  manual  control  over  the  sen¬ 
sors. 

In  this  paper,  we  describe  our  most  recent  re¬ 
search  in  extending  the  MVP  (Machine  Vision  Plan¬ 
ning)  System  [11,  12]  to  plan  sensor  locations  and 
settings  for  a  changing  environment.  After  a  brief 
overview  of  the  MVP  system,  the  remainder  of  this 
paper  will  focus  on  an  explanation  of  the  constraints 
we  impose  on  the  environment,  the  theory  behind 
our  method,  and  a  technique  which  helps  to  in¬ 
corporate  temporal  reasoning  into  spatial  problems. 
We  also  present  examples  which  show  this  tempo¬ 
ral  reasoning  working  in  conjunction  with  the  MVP 
system.  Finally,  we  present  an  overview  of  future 
work  in  this  area,  aimed  at  strengthening  our  ability 
to  reason  temporally  as  well  as  spatially  for  sensor 
planning  and  other  problems  in  robotics. 


2  Overview  of  MVP 

A  complete  description  of  the  MVP  system  is  be¬ 
yond  the  scope  of  this  paper.  In  brief,  MVP  takes 
a  constraint  based  description  of  the  vision  task  re¬ 
quirements  and  synthesizes  what  has  been  termed  a 
generalized  viewpoint,  which  is  an  eight-dimensional 
vector  incorporating  sensor  location,  orientation, 
and  lens  parameters  including  aperture  and  effec¬ 
tive  focal  length.  The  constraints  MVP  considers  in 
determining  viewpoints  are  depth-of-field,  field-of- 
view,  resolution,  and  unoccluded  visibility  [11,  12]. 

MVP  contains  analyticsil  relationships  for  the 
optical  task  constraints  (resolution,  focus,  field-of- 
view),  and  uses  3-D  solid  geometric  models  of  the  en¬ 
vironment  to  formulate  visibility  constraints.  (The 
geometric  models  are  limited  to  general  polyhe- 
dta,  both  convex  and  concave;  curved  surfaces  are 
not  permitted.)  The  constraint  equations  can  be 
thought  of  as  defining  hypersurfaces  bounding  fea¬ 
sible  regions  in  the  8-dimensional  parameter  space 
of  the  generalized  viewpoint.  These  constraints 
are  combined  in  an  optimization  setting  to  produce 
a  generalized  viewpoint  which  meets  all  task  con¬ 
straints  with  as  much  margin  for  error  in  sensor 
placement  and  setting  as  possible  (i.e.,  as  far  away 
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from  all  hypersurfaces  as  possible).  Figures  1  and  2 
show  how  MVP,  from  a  CAD  description  of  the  ob¬ 
ject  to  be  viewed  and  its  environment,  can  generate 
the  visibility  region  for  viewing  the  desired  features. 
This  region  is  calculated  to  be  the  total  volume  in 
space  from  which  the  features  are  viewable  without 
obstruction.  This  volume  is  used  in  the  optimization 
stage  of  MVP  for  finding  the  best  viewpoint. 

While  MVP’s  synthesis  approach  makes  it  better 
suited  to  the  extensions  we  present  here  than  other 
less  analytical  approaches,  there  is  no  reason  why 
the  temporal  extensions  discussed  here  can  not  be 
applied  to  other  sensor  planning  algorithms.  In  fact, 
it  is  our  hope  that  these  methods  can  be  applied  to 
3-D  planning  problems  in  general. 

3  The  Introduction  of  Motion 

In  sensor  planning  problems,  there  is  normally  a 
well-defined  set  of  target  features  which  need  to  be 
monitored.  These  might  correspond  to  a  section  of 
a  part  which  has  just  come  off  the  assembly  line 
which  the  vision  system  might  need  to  examine  for 
defects.  In  an  active  environment,  the  feature  set 
might  correspond  to  a  section  of  a  larger  assembly 
being  operated  on,  which  we  might  need  to  monitor 
during  the  operation.  For  this  work,  we  restrict  our¬ 
selves  to  consider  only  the  motion  of  obstacles  in  the 
environment,  and  not  the  motion  of  the  target.  We 
also  assume  complete  knowledge  of  the  environment 
in  the  form  of  3-D  geometric  models.  Finally,  we 
assume  knowledge  of  the  motion  of  the  obstacles  in 
advance;  unplanned  motion  may  not  take  place. 

The  most  important  result  of  these  limitations  is 
that  once  we  have  a  viewpoint'  which  is  valid  for  a 
given  instant  in  time  t„,  it  is  guaranteed  to  be  valid 
with  respect  to  all  optical  constraints  for  all  times 
tm  for  m  >  n.  This  is  fairly  simple  to  show.  Once 
we  have  a  properly  magnified  and  focused  view,  if 
neither  the  target  nor  the  camera  move,  the  object 
remains  in  proper  focus  with  an  unvarying  magnifi¬ 
cation.  The  movement  of  obstacles  means  that  only 
occlusion  needs  to  be  detected  at  later  times. 

One  other  point  worth  mentioning  is  that  the 
static  sensor  planning  problem  does  not  require  the 
camera  to  be  attached  to  a  robot.  For  the  dynamic 
sensor  planning  problem,  the  camera  must  be  mov¬ 
able  from  one  viewpoint  to  another  during  the  course 
of  the  task  being  monitored  in  order  to  maintain  a 
robust  view  of  the  target. 

*Here,  and  elsewhere  in  this  paper,  when  we  refer  to  a 
viewpoint  we  are  actually  referring  to  the  generalized  view¬ 
point  mentioned  earlier. 
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To  summarize,  the  exact  problem  we  are  dealing 
with  is  one  in  which  an  accurately  movable  camera 
is  being  used  to  monitor  a  task.  In  this  task,  the 
actual  target  we  are  monitoring  does  not  move,  but 
other  objects  in  the  environment,  such  as  a  robot 
arm,  or  other  mechanical  parts,  move  in  a  way  which 
is  known  a  priori.  The  problem  is  to  find  where  to 
place  the  camera,  and  when  and  where  to  move  the 
camera,  so  that  at  all  times  during  the  task,  we  have 
a  good  viewpoint  for  monitoring  the  task. 

As  an  example,  a  spot-welding  robot  may  be 
working  on  a  stationary  object  (such  as  a  car  body) 
in  a  factory.  There  might  be  a  vision  system  used  to 
monitor  the  weld  for  defects  and  for  eiccuracy.  The 
sensor  would  need  to  be  placed  in  such  a  way  as  to 
avoid  occlusion  by  the  many  moving  obstacles  in  the 
environment,  such  as  the  welding  arm  itself  and  any 
moving  peripherals  needed  for  this  task.  The  target 
itself,  that  is  the  area  to  be  welded,  would  remain 
stationary  for  this  task. 


4  Relation  to  Motion  Plan¬ 
ning 

There  is  a  very  close  relationship  between  the  prob¬ 
lem  of  sensor  planning  with  moving  obstacles  and 
that  of  motion  planning  with  moving  obstacles.  The 
essence  of  both  problems  is  to  find  a  representa¬ 
tion  which  relates  the  temporal  and  spatial  aspects 
of  an  object’s  motion.  The  essential  difference  is 
that  while  path  planning  needs  to  find  a  path  which 
moves  through  a  time- varying  environment  to  reach 
a  goal  without  hitting  any  obstacles,  sensor-planning 
is  searching  for  one  or  more  generalized  viewpoints 
which  remsun  valid  (unoccluded)  for  the  duration  of 
the  task.  The  implications  of  this  difference  are  that 
we  can  not  envision  planning  a  point  that  travels 
through  configuration  space  or  configuration  space- 
time  [3,  6].  This  is  because  our  “point”  is  actually 
a  cone  with  its  apex  at  the  sensor  and  its  base  at 
the  target  polygon.  We  need  to  detect  when  obsta¬ 
cles  will  breach  this  viewing  cone,  not  just  detect 
the  collision  of  a  single  point  with  a  configuration 
space  obstacle.  The  second  difference  is  that  in  mo¬ 
tion  planning,  we  search  for  an  unobstructed  path. 
In  sensor  planning,  the  preference  is  to  remain  sta¬ 
tionary.  We  are  in  search  of  a  single  viewpoint,  if 
possible,  which  is  valid  during  the  entire  operation, 
or,  if  that  is  not  possible,  as  few  viewpoints  as  pos¬ 
sible. 


5  The  Naive  Approach 

One  way  to  handle  added  complexities  to  a  problem 
is  to  ignore  them  until  they  become  an  issue.  Taking 
this  approach  to  motion  in  sensor  planning  yields  the 
following  nciive  algorithm: 

1.  Compute  a  viewpoint  for  the  initial  state  of  the 
system,  considering  all  obstacles  in  the  environ¬ 
ment  as  they  are  before  any  motion  takes  place. 

2.  At  every  time  interval  At,  test  the  current  view¬ 
point  against  the  model  of  the  changed  environ¬ 
ment. 

3.  If,  at  some  instant  t„,  the  viewpoint  is  found 
to  be  invahd  due  to  the  movement  of  obstacles, 
compute  a  new  viewpoint  based  on  the  current 
state  of  the  model,  and  go  back  to  step  2. 

The  algorithm  can  be  run  in  advance,  off-line, 
since  the  changes  the  environment  goes  through  over 
time  are  known,  so  it  would  not  even  be  necessary  to 
have  a  function  which  evaluates  or  generates  view¬ 
points  in  real  time.  The  entire  problem  can  be  simu¬ 
lated,  and  the  time  intervals  where  the  sensor  needs 
to  be  moved  and  reset  can  be  noted.  During  the  pro¬ 
cess,  the  robotic  system  can  pause,  reposition  the 
camera,  and  then  continue  its  operation. 

This  approach  has  several  major  drawbacks: 

•  This  algorithm  is  clearly  not  optimal  in  its  use 
of  the  viewpoint  evaluation  function.  In  a  task 
which  takes  Af  x  At  time  to  complete,  the  view¬ 
point  needs  to  be  evaluated  exactly  M  times. 

•  The  algorithm  makes  no  attempt  to  reduce  the 
number  of  sensor  replacements  required. 

•  A  viewpoint  is  used  up  until  the  moment  after 
it  has  become  invalid,  or  at  least  up  until  the 
point  at  which  the  margin  for  error  becomes 
very  small.  As  an  example,  say  the  algorithm 
determines  that  at  time  <„ ,  the  initial  viewpoint 
is  no  longer  acceptable.  Although  the  initial 
selection  of  a  viewpoint  was  chosen  to  have  a 
large  margin  for  error,  this  error  margin  only 
existed  at  time  fo;  at  some  time  before  t„,  due 
to  errors  in  sensor  placement,  etc,  the  viewpoint 
may  be  bad. 

•  The  accuracy  of  this  method  is  dependent  upon 
the  time  interval  Af  used.  At  is  essentially  the 
sampling  frequency  with  which  we  test  the  en¬ 
vironment.  The  viewpoint  may  become  invalid 
between  two  samples  and  yet  be  valid  at  each 
sample.  This  behavior  is  not  desirable. 
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•  Knowledge  of  future  positions  of  the  obstacles 
is  not  used  in  this  method.  Given  that  the  plan¬ 
ning  and  evaluation  functions  are  fast  enough, 
there  is  no  benefit  to  computing  any  informa¬ 
tion  in  advance.  It  is  conceivable  that  at  each 
time  interval  f„,  when  we  plan  a  new  viewpoint, 
the  very  next  motion  (or  the  sequence  of  mo¬ 
tions  in  the  near  future)  make  the  new  view¬ 
point  invalid  very  quickly,  perhaps  even  at  time 
f„+i.  This  is  very  bad;  the  sensor  may  need  to 
be  repositioned  far  to  frequently. 

The  problem  specification  indicates  that  we  have 
more  information  at  our  disposal  than  this  solution 
uses.  By  using  all  available  knowledge,  that  is,  our 
knowledge  of  the  temporal  as  well  as  the  spatial  as¬ 
pects  of  the  environment,  we  can  hope  to  generate 
viewpoints  which  are  valid  (by  a  larger  margin)  for 
longer  time  intervals. 


6  Temporal  Considerations 

In  order  to  make  use  of  our  knowledge  of  motion  and 
time  in  the  environment,  we  define  a  concept  which 
relates  the  geometric  orientation  of  an  object,  and 
its  motion  through  space  over  a  given  time  interval. 
The  structure  which  embodies  this  relationship  for 
a  given  object  over  a  given  time  period  is  called  a 
minimal  temporal  object.  To  illustrate  this  concept, 
note  the  polyhedron  in  figure  3.  It  is  shown  with  a 
vector  indicating  its  linear  trajectory. 

Definition  1  A  minimal  temporal  object  is  the  set 
of  all  points  through  which  a  given  object  O  passes 
during  its  motion  over  a  given  time  interval  T.  The 
minimal  temporal  object  representing  the  motion  of 
O  over  T  is  notated  as 

Figure  4  shows  the  minimal  temporal  object  for 
the  polyhedron  in  figure  3  moving  in  the  direction 
indicated.  For  a  given  object,  and  its  trajectory  and 
velocity  over  a  given  time  period,  there  is  one  unique 
7'(T,0).  The  task  of  computing  T(T,0)  is  equiv¬ 
alent  to  computing  the  complete  volume  swept  by 
O  during  its  motion  over  T.  For  general  objects 
moving  in  arbitrary  paths,  T (T,  O)  may  be  exceed¬ 
ingly  expensive  to  compute.  This  is  why,  in  working 
within  the  sensor  planning  framework,  we  have  been 
dealing  with  approximations  to  't{T,  O).  We  restrict 
ourselves  to  t'.pproximations  of  'f(T,  O)  which  meet 
the  following  definition: 


Definition  2  A  temporal  object  is  defined  as  any 
volume  which  contains  the  set  of  all  points  through 
which  a  given  object  O  passes  during  its  motion  over 
a  given  time  iniervalT.  A  temporal  object  represent¬ 
ing  the  motion  of  O  over  time  interval  T  is  notated 
as  T{T,0). 

The  most  importcint  consequence  of  this  defini¬ 
tion  is  that  T (T,  O)  is  necessarily  contained  within 
T(T,0).  Rough  methods  can  be  developed  to  com¬ 
pute  T{T,0)  and  use  them  as  approximations  to 
T(T,0)  for  planning.  Note,  that  for  any  object 
moving  through  a  given  path,  T{T,0)  is  not  unique 
while  T{T,0)  is. 

Due  to  a  result  by  Weld  and  Leu  in  [13],  the  vol¬ 
ume  formed  by  sweeping  a  polyhedral  object  along 
an  arbitrary  path  is  equivalent  to  the  volume  formed 
by  sweeping  each  face  along  the  same  path  and 
unioning  these  swept  volumes  together.  The  expen¬ 
sive  portion  of  this  algorithm  (in  the  case  of  transla^ 
tional  motion  only)  is  in  the  unioning.  A  polyhedron 
)f  n  faces  requires  n  boolean  unions  to  be  swept  via 
this  method.  Since,  in  the  Temporal  Sensor  Plan¬ 
ning  algorithm  (which  follows),  it  may  be  necessary 
to  compute  swept  volumes  often,  we  have  opted  for 
a  faster  method  of  computing  a  T (T,  O)  as  opposed 
to  an  accurate,  but  slower  method  for  computing 

t(r,o). 

We  use  a  simple  algorithm  to  compute  a  T(T,0) 
given  that  T  moves  through  a  lineetr  path.  This  is 
not  particularly  restrictive,  since  any  arbitrary  path 
can  be  approximated  by  a  piecewise  linear  path,  and 
the  algorithm  can  be  repeated  over  each  linear  seg¬ 
ment  of  a  path  to  compute  the  T(T,  O)  for  the  whole 
interveJ.  Here,  we  describe  the  algorithm  for  com¬ 
puting  T(T,0)  when  O  is  a  polyhedron  known  to 
move  in  a  linear  path  of  length  n  in  the  v  direction. 

Generation  of  T (T,  O) 

1.  Calculate  plane  V,  which  is  the  plane  defined  by 
the  unit  vector  v  and  the  point  p  of  O  in  the  ex¬ 
treme  —V  direction.  That  is,  as  O  moves  in  the 

V  direction,  p  is  the  point  “furthest  back,”  and 

V  is  the  plane  perpendicular  to  the  trajectory 
of  O,  and  containing  p. 

2.  Project  all  vertices  of  O  onto  the  P  and  take  the 
convex  hull  of  these  points,  creating  a  polygon 
s  on  V. 

3.  T(T,0)  is  the  right  generalized  cylinder  with 
a  linear  zocis  parallel  to  t7,  a  constant  cross- 
section,  with  a  base  on  s  and  a  height  of  n  plus 
h,  where  h  is  the  overall  length  of  O  in  the  v 
direction. 
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Figure  3:  Polyhedron  floating  in  space  with  direction  vector 


Figure  4:  The  unique  minimal  temporal  object 


This  algorithm  is  illustrated  in  flgure  5. 

Theorem  1  Any  volume  created  by  the  above  algo¬ 
rithm  necessarily  contains  T{T,0). 

Proof: 

1.  By  step  3  above,  ail  perpendicular  cross  sections 
of  T {T,  O)  are  the  same. 

2.  By  step  2,  the  cross  sections  must  be  polygons 
wholly  containing  any  perpendicular  cross  sec¬ 
tion  of  T'{T,  O). 

3.  The  overall  length  of  T(T,  O)  is  at  least  as  long 
as  that  of  't{T,  O)  by  the  final  step. 

4.  These  three  conditions  show  the  total  inclusion 
of  f{T,0)  by  T(T,0). 

The  key  to  using  a  temporal  objects  for  sen¬ 
sor  planning  (or,  in  fact,  for  any  collision  avoid¬ 
ance  problem)  is  that  in  planning  around  an  obsta¬ 
cle  given  by  T{T,0),  you  guarantee  that  you  have 
avoided  the  actual  obstacle  O  at  any  instant  in  in¬ 
terval  T. 


7  Temporal  Objects  in  Sensor 
Planning 

The  essence  of  our  approach  to  sensor  planning 
around  moving  objects  is  to  plan  around  the  tempo¬ 
ral  objects  generated  from  the  original  objects  and 
their  motion.  The  temporal  objects,  once  calculated 
as  regions  in  space,  are  treated  as  stationary  objects 
for  a  static  planning  problem.  In  many  cases,  this  is 
too  restrictive.  There  may  be  one  viewpoint  which 
is  valid  for  one  portion  of  the  time  interval,  and  an¬ 
other  viewpoint  which  is  valid  for  another  portion, 
yet  there  is  no  viewpoint  which  is  valid  for  the  en¬ 
tire  interval.  In  cases  such  as  this,  our  algorithm 
subdivides  the  interval  in  half  whenever  faced  with 
a  failure,  and  repleuis  for  the  two  halves  indepen¬ 
dently. 

More  formally,  this  algorithm  is  described  as  fol¬ 
lows.  Assume  we  have  a  polygonal  target  r  which  we 
wish  to  monitor  during  the  time  intervd  T  =  [to,  t„]. 
During  T,  there  is  a  set  of  known  obstacles  Oo 
through  Om,  which  move  in  known  paths.  The  goal 
is  to  plan  a  single  viewpoint  valid  for  the  entire  inter¬ 
val,  if  such  a  point  exists,  or  to  determine  a  sequence 
of  viewpoints. 


Original  Moving 
Polyhedron 


Plane  peipendicuiar  to 
Motion  Vector  containing 
■rearmost*  point 


Convex  hull  of 
projection  onto  plane  P 


Outline  of  Temporal  Objact 
Ovarlayad  on  Real  Objact 


The  Temporal  Object 


Figure  5:  Generation  of  a  Temporal  Object 


Temporal  Sensor  Planning  (TSP) 

1.  Compute  T(T,  Oi)  for  each  of  the  m  obstacles. 

2.  Use  MVP  to  compute  a  viewpoint  using  the  set 
of  temporal  objects  spanning  time  interval  T  as 
the  potential  occluding  bodies. 

3.  If  MVP  can  successfully  find  a  viewpoint  which 
is  valid  in  the  presence  of  all  of  the  temporal 
obstacles,  it  is  guaranteed  to  be  valid  for  any 
instant  in  T.  If  such  a  point  is  found  with  MVP, 
the  algorithm  terminates  with  a  successful  re¬ 
sult:  it  has  planned  a  single  viewpoint. 

4.  If  no  such  viewpoint  is  obtainable,  we  divide 
T  into  Ti  =  [to,<n/2]  and  T2  =  [tn/2,tn],  and 
run  the  Temporal  Sensor  Planning  algorithm 
on  each  subinterval. 

The  binary  partitioning  of  the  last  step  continues 
until  we  have  found  a  valid  viewpoint  for  all  of  T, 
or  until  we  have  divided  into  time  intervals  of  some 
minimal  preset  length  e.  If  sub-intervals  too  small 
are  reached,  the  determination  is  that  the  motion  is 
too  complex  for  this  method  to  provide  meaningful 
results.  Typically,  one  chooses  e  to  be  large  enough 
so  that  it  is  feasible  to  stop  and  reset  the  sensor  every 
e  interval,  since,  in  the  worst  case,  that  is  what  might 
happen.  If  the  TSP  algorithm  determines  that  the 
motion  in  the  workcell  is  too  complex  to  plan  view¬ 
points,  it  might  be  an  indication  that  the  activity 


in  the  workcell  is  too  complex  to  be  monitored  and 
that  the  task  itself  should  be  replanned. 

It  is  important  to  note  that  in  step  2  of  the  TSP 
algorithm,  we  rely  on  MVP  to  identify  the  fact  that 
it  has  been  unable  to  to  find  a  valid  sensor  view¬ 
point.  MVP  itself  relies  on  a  nonlinear  constrained 
optimization  to  compute  a  viewpoint.  The  feulure  of 
MVP  to  find  a  viewpoint  does  not  guarantee  that 
one  does  not  exist.  However,  when  a  valid  view¬ 
point  is  so  well-hidden  within  the  hypersurfaces  of 
8-dimensional  parameter  space  that  our  optimiza¬ 
tion  routine  can  not  locate  it,  there  is  a  very  good 
cheuice  that  the  viewpoint  does  not  provide  much  of 
a  margin  for  error.  Therefore  MVP’s  failure  to  find 
a  viewpoint,  while  not  a  guarantee  of  the  nonexis- 
tei  ce  of  a  valid  viewpoint,  is  an  excellent  indication 
that  there  is  no  good  viewpoint. 

8  Experimental  Results 

We  have  implemented  TSP  in  conjunction  with  the 
MVP  system,  and  have  produced  simulated  results 
showing  the  effectiveness  of  TSP  as  a  method  of 
extending  3-dimensional  planning  adgorithms  to  in¬ 
clude  the  time  domain.  While  only  the  visibility 
constraint  of  MVP  is  of  concern  to  TSP,  it  is  im¬ 
portant  to  note  that  the  other  constraints  do  play 
a  role  in  whether  or  not  TSP  can  find  a  valid  view¬ 
point.  In  these  experiments,  the  sensor  modeled  is 
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a.  Target  and  Obstacle 
(obstacle’s  trajectory  shown) 


/  I 


b.  Temporal  Obstacle 
overlayed  on  original 


c.  Volume  of  Occlusion  generated  from  Temporal  Obstacle 


d.  Volume  of  Occlusion  subtracted  from 
Work  Envelope  of  robot  to  yield 
VIsIbllltY  Volume 


e.  Visibility  Volume  showing 
Target  and  Planned  Viewpoint 


Figure  6:  Simulation  of  TSP  and  MVP 


a  typical  CCD  camera,  and  a  resolution  constraint 
of  one  pixel  per  0.1  inches  was  used.  The  geomet¬ 
ric  models  of  the  objects  and  environment  as  well  as 
those  of  the  temporal  objects  were  calculated  using 
the  ACIS  solid  modeler  [9]. 

In  the  first  example,  a  polyhedral  object  C  is  the 
obstacle  which  moves  in  a  linear  trajectory  above 
an  octagonal  target  at  a  constant  velocity  over  time 
interval  T  (figure  6a).  Next,  using  the  algorithm 
presented  earlier,  T(T,C)  is  generated  (figure  6b). 
The  corresponding  volume  of  occlusion  is  also  gener¬ 
ated  using  the  methods  presented  in  [12]  (figure  6c). 
Note  that  this  volume  represents  an  overapproxima¬ 
tion  to  the  volume  in  space  from  which,  at  some 
point  in  the  interval  T,  the  view  of  the  target  would 
be  blocked  due  to  C. 

The  volume  of  occlusion  is  subtracted  from  the 


work  envelope  of  the  robot  carrying  the  camera  to 
yield  the  visibility  volume  (figure  6d).  Note  that 
any  point  in  this  volume  is  now  guaranteed  to  have 
an  occlusion-free  view  of  the  target  region  for  the 
entire  time  interval  T.  This  volume  is  used  in  the 
optimization  portion  of  MVP  to  generate  the  shown 
viewpoint  which  is  clearly  valid  for  the  entire  motion 
of  C.  Note  that  in  Figure  6e,  the  planned  viewpoint 
is  clearly  within  the  visibility  volume;  the  line  of 
sight  from  the  viewpoint  to  all  points  on  the  target 
does  not  violate  the  temporrJ  obstacle. 

In  the  next  example,  we  lowered  C  so  that  MVP 
would  be  unable  to  find  a  viewpoint  which  main¬ 
tained  both  visibility  and  focus  over  the  entire  time 
interval.  The  partition  of  the  time  interval  produced 
two  sub-intervals,  both  of  which  had  valid  view¬ 
points  easily  computed  using  MVP  (see  Figure  7). 
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Planned  Viewpoint 
for  Time  Interval  0 


d.  Visibility  Volume  and  Solution 
for  the  Rrst  Interval 


Figure  7:  Simulation  of  TSP  and  MVP  showing  Binary  Partition 


9  Conclusions 


This  paper  has  presented  two  main  ideas.  First,  we 
have  presented  work  in  extending  geometric  plan¬ 
ning  problems  to  include  time  and  motion.  The  use 
of  the  temporal  object  makes  the  temporal  compo¬ 
nents  of  the  problem  invisible  to  any  underlying  ge¬ 
ometric  solver.  This  makes  it  a  very  useful  notion; 
many  other  geometric  planning  problems  may  ben¬ 
efit  from  this  concept. 

Second,  we  have  successfully  extended  our  MVP 
system  to  plan  sensor  locations  in  a  time- varying  en¬ 
vironment.  This  is  notable  in  that  to  the  best  of  our 
knowledge,  motion  has  not  been  widely  addressed  in 
the  sensor  planning  literature.  This  particular  plan¬ 
ning  task  is  quite  important  to  us  and  we  plan  on 
concentrating  future  work  on  improving  the  perfor¬ 


mance  of  MVP  under  various  conditions,  especially 
under  more  weakly  constredned  motion  conditions. 

For  example,  while  the  methods  outlined  above 
work  well  for  the  translational  motion  of  known  ob¬ 
stacles,  they  have  yet  to  be  used  to  model  the  motion 
of  the  target  as  well  as  the  obstacles.  The  fact  that 
we  have  constrained  ourselves  to  moving  obstacles 
has  allowed  us  to  ignore  the  effects  of  motion  on  all 
constraints  other  than  visibility.  Once  we  allow  the 
target  to  move,  resolution,  focus,  and  field-of-view 
constraints  must  be  dealt  with  in  an  environment 
which  changes  over  time.  Generating  a  tempor2d 
object  for  the  target  seems  to  be  a  reasonable  ap¬ 
proach,  but  it  may  not  be  sufficient.  The  problem 
of  a  moving  target  is  inherently  more  complex  than 
that  of  a  moving  obstacle. 

Note  that  in  sweeping  a  polyhedral  obstacle  along 
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a  linear  trajectory,  as  we  have  done,  the  result  is  an¬ 
other  polyhedral  obstacle.  Now  note  that  a  target 
consisting  of  a  single  point,  when  swept  along  a  lin¬ 
ear  path,  yields  a  line;  a  linear  target  swept  along 
a  linear  path  yields  a  polygon;  a  polygonal  target 
swept  linearly  yield  a  polyhedra.  Future  work  will 
explore  how  the  planning  algorithm  can  plan  to  view 
this  temporal  target,  which  is  one  dimension  higher 
than  the  original  target. 

There  may  be  other  methods  which  may  be  used 
to  plan  sensor  viewpoints  around  moving  obstacles 
which  take  advantage  of  the  fact  that  the  visibility 
volume  is  the  only  constraint  which  changes  through 
time.  We  may  be  able  to  examine  how  the  actual  ob¬ 
stacles  move  in  relation  to  the  current  sensor’s  line- 
of-sight  to  the  target  to  determine  when  replacement 
may  be  needed,  for  example. 

Allowing  motion  in  the  sensor  planning  problem 
opens  it  up  to  many  more  subproblems  which  we 
hope  to  explore  in  future  work.  For  example,  it  may 
be  desirable  to  examine  ways  of  ensuring  that  the 
series  of  viewpoints  lay  along  a  path  which  is  acces¬ 
sible.  We  may  also  be  exploring  ways  of  generating 
a  continuous  path  along  which  to  move  the  sensor 
which  maintain  a  particular  view  of  a  target. 
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Abstract 

As  mobile  robots  attempt  more  difficult  tasks  in 
more  complex  environments,  they  are  faced  with 
combinatorially  harder  perceptual  problems.  In 
fact,  computation  costs  for  perception  can  easily 
dominate  the  costs  for  planning  in  a  mobile  robot. 
Existing  perception  systems  on  mobile  robots  are 
potentially  many  orders  of  magnitude  too  slow  for 
real-world  domains.  In  this  paper  we  show  that 
active  vision  at  the  system  level  can  make 
perception  more  tractable.  We  describe  how  our 
planning  system  for  a  complex  domain,  tactical 
driving,  makes  specific  perceptual  requests  to  find 
objects  of  interest.  The  perception  system  then 
scans  the  scene  using  routines  to  search  for  these 
objects  in  limited  areas.  This  selective  vision  is 
based  on  an  understanding  and  analysis  of  the 
driving  task.  We  illustrate  the  effectiveness  of 
request-driven  routines  by  comparing  the 
computational  cost  of  general  scene  analysis  with 
that  of  selective  vision  in  simulated  driving 
situations. 


1.  Introduction 

Today’s  mobile  robots  can  drive  down  hallways 
and  roads  and  across  fields  without  getting  stuck  or 
colliding  with  obstacles.  As  these  r  bots  attempt 
more  challenging  tasks,  they  will  need  better 
perception  and  planning  systems.  In  the  past, 
planning  and  perception  have  been  independent 
components  of  the  robot  reasoning  system.  Figure 
1-1  shows  that  in  traditional  systems  the  planning 
component  works  on  a  symbolic  world  model,  which 
it  naively  assumes  the  perception  component  keeps 
up  to  date.  The  perception  component  must 
therefore  find  and  understand  everything  of 
potential  interest  to  the  planner.  This  arrangement 
cannot  be  used  in  robots  in  the  real  world  because 
scenes  are  too  complex  and  change  too  fast  for 
unguided,  exhaustive  interpretation. 

Consider  the  driving  scene  shown  in  Figure  1-2. 
It  contains  several  vehicles.  The  vehicles  are 


Figure  1-1:  A  traditional  robot  control  system. 


Figure  1*2:  A  driver  can  see  traffic  objects  in 
many  places. 


different  distances  and  directions  from  the  observer, 
and  so  appear  in  different  locations  in  the  image.  A 
general  perception  system  attempting  to  find  all 
vehicles  in  the  scene  would  have  to  search  all 
possible  locations,  ranges,  and  poses.  For  each  of 
these  possibilities,  the  perception  system  would  also 
have  to  consider  variations  in  vehicle  shape,  color, 
and  illumination.  The  driving  scene  changes  from 
moment  to  moment,  with  vehicles  moving, 
disappearing,  reappearing,  and  becoming  partly 
occluded.  The  combinatorics  of  perception  in  the 
real  world  are  enormous. 

Figure  1-2  also  illustrates  that  even  if  traffic 
objects  can  be  foimd  in  various  poses  throughout  a 
scene,  it  is  wasteful  to  look  for  them  because  they 
are  not  all  important  to  the  robot.  We  propose  to 
reduce  perceptual  complexity  by  using  system-level 
active  vision — using  the  planner  to  limit  what  a 
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robot  has  to  look  for,  and  where  it  has  to  look^. 
Figure  1-3  shows  the  resulting  architecture.  A  robot 
driver  can  then  use  its  perception  effectively,  as 
shown  in  Figure  1-4. 
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Figure  1-3:  An  active-vision  robot  control 
system. 
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Figure  1*4:  Desirable  visiial  search 
constraints. 


In  the  next  section  we  introduce  the  domain  of 
driving  in  traffic,  and  describe  how  we  define  it 
using  an  environment  simulator  and  computational 
model.  We  then  estimate  the  computational  cost  of 
general  sensing  in  the  driving  domain.  Such  an 
estimate  demonstrates  the  futility  of  a  naive 
approach  to  perception  for  a  real  problem,  and 
provides  a  basis  for  evaluating  our  active  vision 
techniques.  Section  3  describes  two  such 
techniques.  Section  4  illustrates  how  selective 
perception  techniques  work  for  driving.  We  discuss 
how  we  have  implemented  our  driving  model  in  a 
computer  program  so  that  we  can  stu^  perceptual 
issues  in  simulation.  Next  we  demonstrate  how 
perceptual  routines  are  used  in  a  driving  situation. 
Finally,  the  section  shows  the  results  of  simulating 
a  driving  situation  and  compares  the  computational 
cost  to  the  naive  approach.  We  conclude  in  Section 
5  that  general  perception  is  hopelessly  intractable 
and  that  selective  perception  is  necessaiy  for  a  task 
such  as  driving. 


^Ihis  paper  emphaaizea  demand-driven  perception  and 
routinea,  but  does  not  addren  other  reductiona  in  search  made 
possible  by  modeling  the  world  better.  Our  current  woik 
addresses  those  issues. 


2.  Driving  with  General  Perception 

Driving  in  traffic  is  an  example  of  a  complex  task 
in  a  complex,  dynamic  environment.  This  paper 
addresses  tactical  driving  [12],  which  is  the 
selection  of  speed  and  steering  maneuvers.  Tactical 
driving  requires  features  of  both  real-time  servo- 
control  systems  and  S3mibolic  reasoning  systems. 
Decisions  must  be  made  d3mamically  in  response  to 
changing  traffic  situations.  Tactical  driving  also 
involves  reasoning  about  road  configurations  and 
traffic  control  devices  to  figure  out  rig^t-of-way 
puzzles.  The  environment  is  visually  complex 
because  of  the  many  objects  and  types  of 
objects — cars,  roads,  markings,  signs,  signals,  etc. 
These  objects  vary  in  appearance  under  different 
conditions.  Previous  work  in  autonomous  road 
following  and  car  and  sign  recognition  [3, 4,  5, 8, 10] 
has  shown  that  perceiving  individual  traffic  objects 
in  constrained  situations  is  computationally 
expensive;  finding  all  objects  in  all  situations  in  real 
time  is  effectively  intractable.  In  this  section  we 
estimate  just  how  difficult  this  perception  problem 
is. 


2.1.  Domain  Assumptions  That  Affect 
Perception 

The  environment.  Driving  is  a  complex  human 
activity  that  is  difficult  to  describe  precisely. 
However,  in  order  program  an  autonomous  robot  to 
perform  such  an  activity,  we  must  be  able  to 
describe  exactly  what  it  is  tiie  robot  has  to  do.  First 
of  all,  we  must  describe  the  important  aspects  of  the 
environment.  In  our  research,  we  (hd  this  by 
developing  a  model  of  the  driving  environment  and 
implementing  it  in  a  microscopic  traffic  simulator 
called  PHAROS  [13].  PHAROS  contains  detailed 
representations  of  roads,  lanes,  intersections,  signs, 
signals,  markings,  and  cars.  Althouf^  PHAROS 
provides  a  rich  setting  for  driving  experiments,  it 
also  makes  important  abstractions  that  simplify  and 
limit  the  driving  problem  for  our  work.  For 
example,  all  roads  are  structured  with  lanes;  also, 
there  are  no  pedestrians  or  bicylists  in  PHAROS. 

Task  requirements.  Once  we  have  created  a 
representation  of  the  world,  we  must  specify  what  it 
means  to  drive.  Although  driving  laws  are 
published  in  books  [9],  and  the  driving  task  has 
been  thoroughly  analyzed  in  isolated  situations  [11], 
there  are  no  driving  descriptions  that  actually 
specify  what  actions  to  take  at  any  time.  We  have 
developed  a  computational  model  of  tactical  driving 
called  Ulysses  [14]  that  does  encode  what  action  to 
take  in  any  situation  in  the  PHAROS  world. 
Ulysses  is  a  sophisticated  model  that  incorporates 
knowledge  of  speed  limits,  car  following,  lane 
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changing,  traffic  control  devices,  right  of  way  rules, 
and  simple  vehicle  dynamics.  We  use  a 
programmed  implementation  of  Ulysses  to  drive  one 
vehicle  in  the  PHAROS  world.  Ulysses  is  our 
definition  of  the  driving  task  for  this  research. 

A  general  perception  system  has  to  find  all  traffic 
objects  of  potential  interest  to  the  driving  planner. 
In  our  driving  model,  these  objects  include  road 
regions,  road  markings,  vehicles,  signs  and  signals. 
By  characterizing  the  size  of  these  objects,  we  can 
determine  the  maximum  angular  resolution  needed 
to  detect  them.  Traffic  objects  are  characterized  in 
more  detail  in  the  technical  report  version  of  this 
paper  [15]. 

Since  it  is  not  practical  to  consider  perceiving  the 
entire  world,  we  arbitrarily  set  a  range  limit  on  the 
system.  In  some  driving  situations  it  may  be 
desirable  to  see  long  distances  ahead;  for  example, 
signals  are  supposed  to  be  visible  from  218m  away 
on  a  road  with  lOOkph  traffic  [6,  pg.  4B-11],  and  the 
sight  distance  needed  for  passing  is  given  as  305m 
at  this  speed  [1,  pg.  147].  While  Ulysses  is  capable 
of  driving  on  simulated  highways,  for  this  work  we 
have  concentrated  on  arterial  urban  streets  where 
the  visiial  environment  is  more  diverse.  Since 
streets  have  lower  speeds  than  hi^ways,  we  have 
chosen  150m  as  the  perceptual  range  limit. 

Sensors.  Since  the  environment  is  so  varied  in 
appearance,  we  assume  that  analysis  must  be  based 
on  several  types  of  sensor  data.  We  assume  the 
robot  has  cameras  and  laser  rangefinders  whose 
images  can  be  registered.  This  combination  is 
complementary  in  that  a  rangefinder  is  almost 
immune  to  the  illumination  changes  across  an 
ofcgect  which  confuse  color  based  segmentation.  A 
camera,  on  the  other  hand,  can  discern  markings 
and  odier  important  regions  on  uniform  surfaces 
like  signs  [7].  Since  the  robot  must  find  objects  in 
all  directions,  we  assume  it  has  sensors  pointing  in 
several  directions.  We  thus  avoid  issues  such  as 
sensor  aiming  time.  Finally,  we  assume  that  the 
driving  robot  has  sensors  with  enough  resolution  to 
discern  the  smallest  traffic  objects  at  the  maximum 
range.  In  our  model  of  the  environment,  the 
smallest  feature  is  a  lane  marking  that  is 
perpendicular  to  the  line  of  si|ht<  At  150m,  such 
lines  require  pixels  of  about  2.SxlO~^  degrees. 


2.2.  Cost  Estimates 

Interpreting  general  traffic  scenes  will  be  very 
difficult.  We  would  expect  the  general  perception 
system  to  extract  many  features  from  Uie  image, 
including  regions,  boimdaries,  lines,  comers,  etc. 


The  complexity  of  the  environment  will  in  general 
prevent  perception  from  using  single,  uniform 
features  to  uniquely  distinguish  objects.  Ebctraction 
will  use  intensity,  color,  optical  flow,  range,  and 
reflectance  data.  Features  must  be  grown, 
characterized,  and  merged.  Scene  features  will  be 
matched  against  features  of  traffic  object  models  to 
identify  traffic  objects.  This  matching  will  be  done 
in  two  stages;  for  example,  sign  surfaces  will  first  be 
located  before  the  sign  messi^  is  examined  at 
higher  resolution.  Our  analysis  is  explained  in 
detail  in  [15]. 

Based  on  work  at  CMU  on  robot  driving,  and  on 
an  examination  of  the  literature,  we  estimate  that 
the  cost  of  general  perception  in  this  domain  is 

Co5»»1.1x10'*P+1.1/>2+i.7x10-5/>3  (1) 

where  P  is  the  number  of  pixels  in  the  image.  The 
unit  of  cost  is  an  arithmetic  operation  on  a  data 
value  or  pixel.  The  linear  term  reflects  the 
computations  to  find  colors,  edges,  optical  flow,  etc. 
at  each  pixel;  the  squared  term  comes  from  pixel 
clustering  and  comparing  pairs  of  features  to  each 
other;  and  the  cubic  term  is  our  estimate  of  the  cost 
of  matching,  using  constraints  to  prune  the  search 
space.  Trf^c  objects  are  small  enou^  to  require 
about  1cm  resolution,  which  translates  to  about 
8x10^  pixels.  The  net  cost  is  then 

Cost  -  8.9x10"  +  7.2x10*^  +  9.0x10** 

-9.0x10**. 

This  means  that  a  computer  that  could  perform  10* 
operations  in  the  100ms  Ulysses  decision  cycle 
would  be  almost  11  orders  of  magnitude  too  slow  to 
analyze  the  scene.  Even  if  our  estimates  are  off  by 
several  orders  of  magnitude,  it  is  clear  that  genend 
perception  is  intractable. 


3.  A  Model  of  Active  Vision  for  Driving 


3.1.  Using  Task  Knowledge  to  Ck>nstrain 
Search 

We  limit  visual  search  in  our  driving  system  by 
having  the  planner  direct  the  actions  of  perception. 
Perception  is  demand-driven  rather  than  forward¬ 
flowing.  The  driving  planner  uses  task-specific 
knowledge  to  determine  what  objects  are  relevant  to 
the  current  driving  situation. 

Since  perception  only  acts  upon  the  request  of  the 
planner,  Ulysses  starts  its  decision  process  with 
almost  no  information.  At  the  beginning  of  each 
decision  cycle,  perception  finds  one  key  object — ^the 
road  in  front  of  the  robot.  Ulysses  then  applies 
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appropriate  driving  knowledge  to  interpret  the 
situation  and  generate  action  constraints.  This 
knowledge  generally  involves  other  objects  related 
to  the  key  object.  For  example,  Ulysses  looks  for 
cars  on  and  signs  next  to  the  road  in  front  of  the 
robot.  Thus  the  program  generates  demands  for 
perception  system  to  look  for  these  cars  and  signs. 
When  objects  are  found — especially  objects 
indicating  an  intersection — additional  rules  may 
become  relevant  and  more  objects  may  need  to  be 
foimd.  Ulysses’  demands  thus  incrementally 
determine  what  the  perception  system  must  look  for 
in  the  scene. 

Ulysses  also  controls  visual  search  by 
constraining  where  the  perception  system  has  to 
look  in  the  image.  As  Figure  1-4  shows,  not  all 
traffic  objects  are  important  to  the  robot;  only  the 
ones  with  specific  relations  to  the  robot  are 
important.  For  example,  the  Stop  sign  in  the 
middle  of  the  figure  affects  the  robot  because  it  is 
along  the  robot’s  intended  path.  The  Stop  sign  to 
the  far  right  is  not.  Ulysses  can  thus  specify  where 
to  look  for  objects — not  directly  in  terms  of  azimuth 
and  elevation  ranges,  but  in  terms  of  previously 
observed  objects.  Ulysses  conveys  this  information 
by  making  perceptual  requests  in  the  form  of 
perceptiial  routines. 


3.2.  Perceptual  Routines 

Perceptual  routines  are  sequences  of  image 
processing  operations.  Ullman  describes  how  visual 
routines^  may  be  used  in  the  human  vision  system 
to  determine  object  properties  and  spatial 
relations  [16].  While  the  input  to  such  routines  is 
an  image  -  processed  from  the  bottom-up,  the 
routines  themselves  are  invoked  from  the  top-down 
when  needed  in  different  tasks.  Agre  and  Chapman 
used  visual  routines  in  their  video  game-playing 
system,  Pengi  [2].  The  Pengi  planner  executed 
visual  routines  just  like  any  other  actions  in  order 
to  get  information  about  the  world  state. 

The  perception  (Eastern  for  Ulysses  was  inspired 
by  Pengi.  As  we  described  alMve,  the  rules  for 
driving  also  have  spatial  relations  incorporated  in 
them.  Ulysses  uses  routines  to  directly  find  roads 
related  to  the  robot,  cars  related  to  the  roads, 
signals  related  to  particular  intersections,  etc. 
Figure  3-1  lists  the  routines  currently  avail^le  to 


uw  the  term  "perceptual  routinee"  instead  of  "visual 
routines"  in  our  woik  to  emphasize  that  they  may  include  hi^ier 
levels  than  just  low-level  vision,  and  that  they  may  include  other 
sensing  tedinologies. 


Ulysses.  The  routines  in  Ulysses  assume  more 
domain-dependent  processes  are  available  to  the 
routines  than  is  the  case  in  Pengi;  for  example,  the 
routines  must  understand  how  to  trace  roads 
instead  of  just  contours,  and  how  to  stop  scanning 
when  a  sign  is  recognized.  Most  of  the  routines  in 
the  figure  mark  objects  and  locations  when  they 
finish,  so  Ulysses  can  continue  finding  objects  later. 
For  example,  track-lane  stops  when  an  intersection 
is  encountered  (indicated  by  the  change  in  lane  lines 
in  the  well-marked  PHAKOS  world),  but  marks  the 
intersection  so  that  find-path-in-intersection,  find- 
sifpial,  etc.  can  find  objects  relative  to  that 
intersection.  The  next  section  illustrates  how 
Ulysses  uses  routines. 
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Figure  3-1:  Perceptual  routines  for  Ulysses. 


Agre  and  Chapman  point  out  two  advantages  of 
routines  in  terms  of  knowledge  representation. 
First,  only  relevant  objects  in  the  world  need  to  be 
represented  explicitly  in  the  agent’s  internal  model. 
Pengi,  for  example,  does  not  have  to  model  all  of  the 
blocks  on  the  video  screen.  Second,  when  the 
planner  needs  to  know  if  there  is  an  object  with  a 
certain  relation  to  a  reference  object,  it  does  not 
need  to  check  all  visible  objects  to  find  out 

These  representational  advantages  have 
perceptual  analogs  which  are  better  illustrated  by 
Ulysses  because  it  addresses  perception  more 
realistically.  First  not  only  do  perceptual  routines 
avoid  having  to  represent  all  world  objects 
internally,  they  avoid  having  to  look  for  all  the 
objects.  In  Pengi  perception  was  abstract  and 
essentially  cost-free;  for  a  real  driving  robot,  looking 
for  things  is  extremely  expensive  and  any 
reductions  in  sensing  requirements  are  important. 
Second,  checking  the  relationship  between  two 
objects  is  more  complicated  than,  say,  looking  at  a 
value  in  a  property  list.  Computing  arbitrary 
geometric  relations  may  be  difficult,  and  it  is  much 
better  to  do  it  once  and  find  the  appropriate  object 
directly.  For  example,  finding  a  car  on  a  road  could 
involve  making  a  region-containment  test  on  every 
visible  car;  it  would  be  better  to  scan  the  road  region 
until  the  car  was  found. 
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4.  Driving  with  Perceptual  Routines 


4.1.  Implementing  Ulysses 

Execution  Architecture.  Ulysses  has  a  simple 
architecture  that  makes  it  almost  completely 
reactive.  In  other  words,  at  each  decision 
time — eveiy  100ms — ^Ulysses  makes  a  fresh 
analysis  of  the  situation  without  using  any 
information  from  the  past  (except  for  a  few  bits  of 
state).  Ulysses  thus  does  not  actually  make  and 
execute  plans,  but  continuously  decides  what  the 
robot  should  do  now.  We  are  assuming  for  now  that 
any  number  of  perceptual  routines  can  run  in  one 
decision  cycle.  This  scheme  allows  Ulysses  to 
,  respond  quickly  to  any  new  situation. 

The  reactive  system  may  seem  unrealistic 
because  it  requires  the  perception  system  to  do  all  of 
its  work  in  a  short  time  interval.  However,  time  can 
be  traded  off  against  the  number  of  cameras  and 
computational  resources,  so  we  can  still  use  this 
model  by  interpreting  cost  in  terms  of  resources 
rather  than  time.  Future  work  with  Ulysses  will 
investigate  architectures  that  make  use  of  a  larger, 
persistent  world  model. 

Sinnilated  Perception.  Most  of  the  cars  in 
PHAROS  (which  we  call  "zombies")  do  not  perform 
any  perception  or  scene  interpretation  tasks. 
However,  the  car  controlled  by  Ulysses  gets  all  of  its 
information  about  the  world  through  a  restricted 
interface  to  PHAROS.  PHAROS  simulates  the 
execution  of  perceptual  routines  and  passes 
symbolic  information  back  to  Ulysses.  By 
monitoring  Uie  interface  between  the  two  programs, 
we  can  determine  exactly  what  information  Ulysses 
requires  at  any  time.  Although  the  PHAROS  world 
is  artificial  and  devoid  of  real-world  features,  we 
examine  perception  costs  as  if  the  robot  were  in  a 
more  realistic  environment. 


4.2.  Use  of  Routines  at  an  Intersection 

We  will  now  illustrate  the  use  of  our  perceptual 
routines  in  a  simulated  driving  scenario.  The 
situation,  shown  in  Figure  4-1,  is  that  the  robot  is 
approaching  a  left-side  road  and  must  turn  left. 
Tliere  are  a  few  speed  limit  signs,  a  STOP  sign,  a 
"left  turn  only"  sign  overhead  and  a  "left  turn  only" 
arrow  on  the  pavement.  The  robot’s  road  has  four 
lanes,  while  the  side  road  has  only  two. 

Since  Ulysses  is  reactive,  it  uses  the  perceptual 
routines  over  and  over  again  as  it  drives  the  robot. 
Later  we  will  show  how  the  perceptual  cost  varies 
during  the  scenario;  here  we  will  show  only  one 


Figure  4-1:  Partway  through  the  left  side  road 
scenario. 


decision  cycle.  At  this  point  in  the  scenario  there  is 
a  car  approaching  from  the  through  direction,  one  in 
the  a^iacent  lane  behind,  and  one  in  the  oncoming 
lanes  paving  turned  from  the  side  road). 

The  first  thing  the  robot  does  is  find  the  lane 
immediately  in  front  of  it  It  then  tracks  the  lane 
forward  (stopping  at  the  intersection),  looks  for  cars 
in  the  lane,  and  looks  for  signs  along  the  side  of  the 
road.  Figure  4-2  shows  the  areas  scanned  by  these 
four  routines.  The  lane  tracking  routine  leaves  a 
marker  at  the  entrance  to  the  intersection. 


Figure  4-2:  The  robot  scans  for  the  lane,  cars, 
and  signs  ahead. 


Figure  4-3  depicts  the  process  of  determining  the 
channelization  of  the  robot's  lane  (i.e.,  whether  it 
can  turn  left  firom  the  lane).  Ulysses  scans  the 
width  of  the  road — ^profiles  it — at  the  intersection 
(at  the  marker  left  earlier)  to  determine  the  position 
of  the  robot’s  lane  at  the  intersection.  The  robot 
also  looks  for  signs  over  the  lane  and  markings  in 
the  lane  to  indicate  turn  restrictions.  These  scans 
are  repeated  until  the  routines  have  reached  the 
end  of  the  lane.  Markers  indicate  where  one  leaves 
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off  (at  an  observed  sign  or  marking)  and  the  next  intersection  to  its  downstream  road.  As  with  the 
one  starts.  first  lane,  Ulysses  first  finds  the  lane,  then  looks  for 

-  cars  and  signs.  In  this  case  one  sign  is  found. 


Figure  4<3:  The  robot  checks  lane  position  and 
looks  for  channelizing  signs  and 
marks. 


Figure  4-5:  The  robot  looks  for  a  lane,  cars, 
and  signs  downstream  of  the 
intesection. 


Next  Ulysses  analyzes  the  intersection.  This 
requires  searching  the  intersection  area  for  a  path 
through  the  intersection,  lead  cars  on  this  path, 
crossing  or  obstructing  cars,  traffic  signals  around 
the  intersection,  and  other  approach  roads. 
Information  Ulysses  already  has  allows  it  to 
determine  whether  there  is  a  traffic  control  sign 
facing  the  robot,  and  how  many  lanes  the  robot’s 
road  has.  Lane  counts  help  determine  which  road  is 
bigger  and  are  part  of  the  ric^t-of-way 
determination  process  [14]. 

Figure  4-4  shows  areas  scanned  to  check  for 
approaching  traffic  on  each  road,  and  for  STOP  or 
YIELD  signs  facing  this  traffic.  Ulysses  first  finds 
the  lanes  on  the  approach  roads,  and  then  looks  for 
the  closest  car  to  the  intersection  in  each  lane. 


Figure  4-4:  The  robot  checks  the  intersection 
approaches  for  traffic  and  signs. 


In  Figure  4-5  the  robot  is  looking  beyond  the 


At  this  point  Ulysses  has  examined  eveiything 
necessary  to  constrain  the  robot’s  speed.  The 
remaining  steps  are  performed  to  select  a  lane. 
Ulysses  considers  lane  selection  even  as  the  robot 
approaches  the  intersection  because  in  some 
situations  it  may  be  desirable  or  necessary  to 
change  lanes  to  move  around  a  slow  car.  Figure  4-6 
shows  that  Ulysses  finds  the  adjacent  lane  (to  the 
right  only,  in  this  case)  and  analyzes  it  much  as  the 
robot’s  lane.  Ulysses  tracks  the  lane  ahead  to  the 
intersection,  and  looks  for  cars,  overhead  signs 
(signs  to  the  right  have  already  been  found), 
markings,  and  lane  position  at  the  intersection. 
The  robot  also  looks  in  the  lane  behind  to  find  the 
car  behind. 


Figure  4-6:  The  robot  looks  forward  and 
backward  for  constraints  affecting 
the  adjacent  lane,  and  traffic  in 
that  lane. 
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4.3.  The  Cost  of  Using  Routines 

The  perceptual  routines  are  subject  to  the  same 
environmental  conditions  as  the  general  perception 
system,  so  must  pay  the  same  feature  extraction 
costs  in  general.  However,  routines  can  reduce 
perception  costs  in  several  ways. 

•  Reduced  search  area.  The  azimuth 
and  elevation  angles  swept  by  the 
routine  are  much  smaller  than  the  area 
searched  by  the  general  perception 
system. 

•  Range-limited  resolution.  The 
maximum  depth  reached  in  the 
routine’s  search  area  limits  the 
resolution  required. 

•  Object-limited  resolution.  The 
routines  look  for  only  one  type  of  object 
at  a  time;  resolution  is  determined  by 
the  size  of  that  type,  not  by  the  smallest 
of  all  types. 

•  Limited  features.  Since  the  routines 
look  for  only  one  type  of  object  at  a  time, 
they  may  be  able  to  use  simpler 
techniques  to  extract  a  more  limited 
variety  of  features  from  the  image. 

The  effectiveness  of  these  reductions  depends  on  the 
sitxiation;  for  example,  if  the  robot  used  several 
routines  to  search  the  same  area  for  different 
objects,  it  would  not  get  the  benefit  of  limited 
features  or  object-limited  resolution. 

In  some  cases  recognition  algorithms  used  by  the 
routines  could  be  simpler  than  those  used  in  the 
general  system.  For  example,  when  looking  for  a 
car  on  a  road  ahead,  it  is  only  necessaiy  to  detect  a 
large  solid  blob.  The  routine  is  looking  only  on  a 
road,  so  any  blob  there  may  reasonably  be  assumed 
to  be  a  car.  Fewer  features  are  needed  to  recognize 
blobs  than  are  needed  to  recognize  cars.  However, 
for  the  purposes  of  this  paper  we  assume  that  the 
recognition  method  is  the  same  for  routines  as  it  is 
for  the  general  system. 

We  calculate  the  cost  of  using  routines  using  the 
same  formula  we  used  for  the  general  system 
(Equation  1).  However,  we  use  a  feature  size 
particular  to  the  routine,  and  the  particular  area 
seardied  by  the  routine  (as  illustrated  by  Figures 
4-2  through  4-6  above)  to  calculate  the  numl^r  of 
pixels.  Resolution  is  also  allowed  to  decrease  as  the 
search  area  gets  closer  to  the  robot  (as  it  is  in  the 
general  perception  case).  We  describe  the  cost 
calculations  for  routines  in  greater  detail  in  [15]. 

The  perception  interface  in  PHAROS  computes 


the  cost  of  Ulysses’  requests  automatically.  This 
allows  us  to  experiment  with  various  scenarios  and 
sensing  strategies  to  see  how  they  affect  perceptual 
cost.  Figure  4-7  shows  the  computed  cost  for  the 
robot  during  the  entire  left  side-road  scenario.  The 
situation  illustrated  in  Figure  4-1  occurs  sometime 
between  points  (3)  and  (4).  We  can  see  that  the 
perceptual  cost  of  using  routines  is  3  orders  of 
magnitude  less  than  general  perception. 
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Figure  4-7: 

Perceptual  coet  of  using  routines  during  the 
left  aide-road  scenario.  Notes:  Before  (1)  the 
robot  is  driving  on  a  4-lane  road;  at  (1)  the  robot 
gets  to  within  sensor  range  (150m)  of  the 
intersection  and  starts  a  lane  change;  at  (2)  the 
robot  gets  to  within  sensor  range  of  die  roads 
leaving  the  intersection;  at  (3)  the  robot  finishes 
the  lane  change;  at  (4)  the  robot  enters  the 
intersection  and  ceases  searching  for 
approaching  traffic,  traffic  control  devices,  etc.; 
and  at  (6)  the  robot  begins  driving  on  the  2-lane 
road. 


5.  Conclusions 

We  have  described  how  tactical  driving  is  a 
complex  task  in  a  complex  environment.  While 
clever  bottom-up  image  processing  techniques  will 
be  necessary  for  a  robot  to  handle  the  perceptual 
load  in  this  domain,  they  will  not  be  sufficient.  A 
robot  must  also  use  task  knowledge  to  constrain  the 
perception  problem.  We  have  designed  a  driving 
model  and  used  it  to  control  perception  for 
driving— to  specify  what  the  robot  should  look  for 
and  where  it  should  look  for  it.  With  these 
constraints,  the  perception  task  is  simpler  and  and 
can  be  performed  more  quickly  by  special  purpose 
routines.  A  general  analysis  of  the  complexity  of 
general  perception  and  simulation  of  driving  using 
routines  shows  that  routines  reduce  compuational 
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cost  by  several  orders  of  magnitude. 

We  are  continuing  to  develop  our  driving  model 
and  the  perceptual  routines  for  driving.  We  are  also 
exploring  additional  ways  of  reducing  perceptual 
needs,  such  as  using  models  of  the  coherence  of  the 
world  over  time.  This  research  will  help  to  discover 
ways  that  a  robot  can  perform  interesting  tasks  in 
perceptually  difficult  environments. 
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Abstract 

Artificial  neural  networks  are  cs^able  of  performing 
the  reactive  aspects  of  autonomous  driving,  such  as 
staying  on  the  road  and  avoiding  obstacles.  This  pa¬ 
per  describes  an  efficient  technique  for  training  in¬ 
dividual  networks  to  perform  these  reactive  driving 
tasks.  But  driving  requites  mote  than  a  collection 
of  isolated  capabilities.  To  achieve  true  autonomy, 
a  system  must  determine  which  ctqrabilities  should 
be  employed  in  the  current  situation  to  achieve  its 
objectives.  Such  goal  directed  behavior  is  diffi¬ 
cult  to  implement  in  an  entirely  connectionist  sys¬ 
tem.  This  paper  describes  a  rule-based  technique 
for  combining  multiple  artificial  neural  networks 
with  mtqr-based  symbolic  reasoning  to  achieve  high 
level  behaviors.  The  resulting  system  is  not  only 
able  to  stay  on  the  road,  it  is  able  follow  a  route  to  a 
predetermined  destination,  turning  sqrpropriately  at 
intersections  and  stopping  when  it  has  reached  its 
goal. 

1  Introduction 

Artificial  neural  networks  are  commonly  employed  as  mono¬ 
lithic  non-linear  classifiers.  The  technique,  often  used  in 
domains  such  as  speech,  diaracter  and  target  recognition,  is 
to  train  a  single  network  to  classify  input  patterns  by  show¬ 
ing  it  many  examples  from  numerous  classes.  The  mapping 
function  from  inputs  to  outputs  in  these  classffication  tasks 
can  be  extremely  complex,  resulting  in  slow  learning  and 
unintelligible  internal  representations. 

However  there  is  an  alternative  to  this  monolithic  network 
qrproach.  By  training  multiple  networks  on  different  aspects 
of  the  task,  each  can  learn  relatively  quickly  to  become  an 
expert  in  its  sub-domain.  This  paper  describes  a  technique  we 
have  developed  to  quickly  train  expert  networks  for  vision- 
based  autonomous  vehicle  control.  Using  this  technique,  spe¬ 
cialized  networks  can  be  trained  in  under  five  minutes  to  drive 
in  situations  such  as  single-lane  road  driving,  highway  driv¬ 
ing,  and  collision  avoidance. 

Achieving  full  autonomy  requites  not  only  the  ability  to 
train  individual  expert  networks,  but  also  the  ability  to  in- 

*The  principle  support  for  the  Navlab  has  come  from  DARPA, 
under  contracte  DACA76-85-C-0019,  DACA76-85-C-0003  and 
DACA76-8S-C-0002.  This  research  was  also  funded  in  part  by  a 
grant  from  Fujitsu  Corporation. 


tegrate  their  refuses.  This  paper  focuses  on  rule-based 
arbitration  techniques  for  combining  multiple  driving  experts 
into  a  system  that  is  enable  of  guiding  a  vehicle  in  a  variety 
of  circumstances.  These  techniques  are  compared  with  other 
neural  network  integration  schemes  and  shown  to  have  a  dis¬ 
tinct  advantage  in  domains  where  symbolic  knowledge  and 
techniques  can  be  employed  in  the  arbitration  process. 

2  Driving  Module  Architecture 

The  architecture  for  an  individual  ALVINN  driving  module 
is  shown  in  Figure  1.  The  input  layer  consists  of  a  single 
30x32  unit  “retina”  onto  which  a  sensor  image  from  either  the 
video  camera  or  the  laser  range  finder  is  projected.  Each  of 
the  %0  input  units  is  fully  connected  to  the  hidden  layer  of  5 
units,  which  is  in  turn  fully  connected  to  the  output  layer.  The 
30  unit  output  layer  is  a  linear  representation  of  the  currently 
iqppropriate  steering  direction.  The  centermost  output  unit 
represents  the  “travel  straight  ahead”  condition,  while  units  to 
the  left  and  right  of  center  represent  successively  sharper  left 
and  right  turns.  The  steering  direction  dictated  by  the  network 
may  serve  to  keep  the  vehicle  on  the  road  or  to  prevent  it  from 
colliding  with  nearby  obstacles,  depending  on  the  type  of 
sensor  input  and  the  driving  situation  the  network  has  been 
trained  to  handle. 

To  drive  the  Navlab,  an  image  from  the  ^propriate  sensor 
is  reduced  to  30  x  32  pixels  and  projected  onto  the  input  layer. 
After  propagating  activation  through  the  network,  the  output 
layer’s  activation  profile  is  translated  into  a  vehicle  steering 
command.  The  steering  direction  dictated  by  the  network 
is  taken  to  be  the  center  of  mass  of  the  “hill”  of  activation 
surrounding  the  output  unit  with  the  highest  activation  level. 
Using  the  center  of  mass  of  activation  instead  of  the  most 
active  output  unit  when  determining  the  direction  to  steer 
permits  finer  steering  corrections,  thirs  improving  ALVINN’s 
driving  accuracy. 

3  Individual  Driving  Module  IVaining  and 
Performance 

We  have  developed  a  scheme  called  training  “on-the-fly”  to 
quickly  teach  individual  modules  to  imitate  the  driving  reac¬ 
tions  of  a  person.  As  a  person  drives,  the  network  is  trained 
with  back-propagation  using  the  latest  video  camera  image  as 
input  and  the  person’s  current  steering  direction  as  the  desired 
output.  To  facilitate  generalization  to  new  situations,  addi¬ 
tional  variety  is  added  to  the  training  exemplars  by  shifting 
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Figure  1 :  The  architecture  for  an  individual  ALVINN  driving 
module 


and  rotating  the  original  camera  image  in  software  to  make 
it  tqjpear  that  the  vehicle  is  situated  differently  relative  to  the 
environment  (See  Figure  2).  The  correa  steering  direction  as 
dictated  by  the  driver  for  the  original  image  is  altered  for  each 
of  the  transformed  images  to  account  for  the  altered  vehicle 
placement.  Adding  these  transformed  patterns  to  the  training 
set  allows  the  network  to  leam  to  recover  from  driving  mis¬ 
takes,  without  requiring  the  human  trainer  to  explicitly  stray 
from  the  road  center  and  then  return.  For  mote  details  about 
the  technique  and  purpose  of  traiiting  on-the-ily,  see  [Pomer- 
leau,  1991]. 

Running  on  three  Sun-4  Sparcstations,  training  on-the-fly 
requires  about  five  minutes  during  which  a  person  drives  the 


■OHH-Hlin 

BBH-oiaa 

Shifted  and  Rotated  Images 


Figure2:  The  single  original  video  image  is  shifted  and  rotated 
to  create  multiple  training  exemplars  in  which  the  vehicle 
appears  to  be  at  a  different  locations  relative  to  the  road. 


Figure  3:  Video  images  taken  on  three  of  the  roads  ALVINN 
modules  have  been  trained  to  handle.  They  are,  from  left 
to  right,  a  single-lane  dirt  access  road,  a  single-lane  paved 
bicycle  path,  and  a  lined  two-lane  highway. 


Navlab  at  about  six  miles  per  hour  over  a  1/4  to  1/2  mile 
stretch  of  training  road.  C>nce  it  has  learned,  the  network 
can  accurately  traverse  the  length  of  road  used  for  training 
and  also  generalize  to  drive  along  parts  of  the  road  it  has 
never  encountered  under  a  variety  of  weather  conditions.  In 
addition,  since  determining  the  steering  direction  from  the 
input  image  merely  involves  a  forward  sweep  through  the 
network,  the  system  is  able  to  process  20  images  per  second, 
allowingit  to  dirive  at  up  to  the  Navlab ’s  maximum  speed  of  20 
miles  per  hour' .  This  is  over  twice  as  fast  as  any  other  sensor- 
based  autonomous  system  has  driven  the  Navlab  [Kluge  and 
Thorpe,  1990,  Crisman  and  Thorpe,  1990]. 

The  flexibility  provided  by  training  on-the-fly  has  facili¬ 
tated  the  development  of  individual  driving  networks  to  han¬ 
dle  numerous  situations.  Using  video  camera  images  as  in¬ 
put,  networks  have  been  trained  to  drive  on  single-lane  dirt 
roads,  single-lane  paved  roads,  two-lane  suburban  neighbor¬ 
hood  streets,  and  lined  two-lane  highways  (See  Figure  3). 

By  replacing  the  video  input  with  alternative  sensor  modal¬ 
ities,  ALVINN  has  learned  other  interesting  behaviors.  One 
such  sensor  onboard  the  Navlab  is  a  scanning  laser  range 
finder.  The  range  finder  provides  images  in  which  pixel  val¬ 
ues  represent  the  distance  from  the  range  finder  to  the  corre¬ 
sponding  area  in  the  scene.  Obstacles  such  as  trees  and  cars 
appear  as  discontinuities  in  depth,  as  can  be  seen  in  the  sim¬ 
ulated  range  finder  image  at  the  bottom  of  Figure  4.  Using 
this  sensor,  separate  ALVINN  modules  have  been  trained  to 
avoid  collisions  in  obstacle-rich  environments  and  to  follow 
alongside  rows  of  parked  cars. 

A  third  type  of  image  used  as  input  to  ALVINN  modules 
comes  £rom  a  laser  reflectance  sensor.  In  this  type  of  image,  a 
pixel’s  value  corresponds  to  the  amount  of  laser  light  that  re¬ 
flects  off  the  corresponding  point  in  the  scene  and  bade  to  the 
sensor.  The  road  and  off-road  regions  reflect  differently,  mak¬ 
ing  them  distinguishable  in  the  image  (see  Figure  4).  Laser 
reflectance  images  in  many  ways  resemble  black  and  white 
video  images,  but  have  the  advantage  of  being  independent 
of  ambient  lighting  conditions.  Using  this  sensor  modality, 
we  have  trained  a  network  to  follow  single-lane  roads  in  total 
darkness. 

4  Symbolic  Knowledge  and  Reasoning 

Despite  the  variety  of  capabilities  exhibited  by  individual  driv¬ 
ing  networks,  until  recently  the  system  has  been  far  from 

'The  Navlab  has  a  hydraulic  drive  system  that  allows  for  very 
precise  speed  control,  but  that  prevents  the  vehicle  from  driving  over 
20  miles  per  hour. 
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Figure  4:  Images  taken  of  a  scene  using  the  three  sensor 
modalities  the  system  employs  as  input.  From  left  to  right 
they  are  a  video  image,  a  laser  range  finder  image  and  a  laser 
reflectance  image.  Obstacles  like  trees  appear  as  discontinu¬ 
ities  in  laser  range  images.  The  road  and  the  grass  reflect 
different  amounts  of  laser  light,  making  them  distinguishable 
in  laser  reflectance  images. 


truly  autonomous.  First,  the  one  driving  network  architec¬ 
ture  shown  in  Figure  1  was  capable  of  driving  only  on  the 
type  of  road  on  which  it  was  trained.  If  the  road  character¬ 
istics  changed,  ALVINN  would  often  become  confused  and 
stray  from  the  road.  In  addition,  a  real  autonomous  system 
needs  to  be  capable  of  planning  and  traversing  a  route  to  a 
goal.  The  neural  network  driving  modules  are  good  at  re¬ 
active  tasks  such  as  road  following  and  obstacle  avoidance, 
but  the  networks  have  a  limited  capability  for  the  symbolic 
tasks  necessary  for  an  autonomous  mission.  The  system  of 
networks  cannot  decide  to  turn  left  at  an  intersection  in  order 
to  reach  a  goal.  After  making  a  turn  from  a  one  lane  road 
to  a  two  lane  road,  the  system  does  not  know  that  it  should 
stop  listening  to  one  network  and  start  listening  to  another. 
Just  as  a  human  needs  symbolic  reasoning  to  guide  reactive 
processes,  the  networks  need  a  source  of  symbolic  knowledge 
to  plan  and  execute  a  mission. 

Ideally,  the  symbolic  knowledge  source  would  reason  like 
a  person.  It  would  use  its  knowledge  of  the  world  to  plan 
a  sequence  of  observations  and  corresponding  actions  to  tra¬ 
verse  the  route.  For  instance,  to  achieve  the  goal  of  reaching  a 
friend’s  house,  the  mission  description  might  be  a  sequences 
like,  “Drive  until  the  sign  for  Seneca  Road  is  seen,  and  turn 
left  at  that  intersection.  Then  drive  until  the  third  house  on 
the  left  is  seen,  and  stop  in  front  of  it.” 

In  this  ideal  system,  once  the  mission  is  planned,  the  sym¬ 
bolic  knowledge  source  would  rely  entirely  on  perception  to 
control  the  execution  of  the  mission.  In  other  words,  the  sym¬ 
bolic  resource  module  would  be  able  to  recognize  events  and 
use  what  it  sees  to  guide  the  interaction  of  the  networks.  The 
symbolic  resource  module  would  be  capable  of  reading  the 
street  sign  at  an  intersection  and  making  the  appropriate  turn 
to  continue  on  to  its  destination.  It  would  also  be  able  to  iden¬ 
tify  the  new  road  type  and  choose  the  appropriate  network  for 
driving  on  that  kind  of  road.  Unfortunately,  the  perception 
capabilities  required  by  such  a  module  are  beyond  the  current 
state  of  the  art. 

In  order  to  bridge  the  gap  between  mission  requirements 
and  perception  capabilities,  we  use  additional  geometric  and 
symbolic  information  stored  in  an  “annotated  map”.  An  an¬ 
notated  map  is  a  two  dimensional  data  structure  containing 
geometrical  information  about  the  area  to  be  traversed,  such 
as  the  locations  of  roads  and  landmarks.  In  addition,  each 
object  in  the  map  can  be  annotated  with  extra  information  to 
be  interpreted  by  the  clients  that  access  the  map.  For  example. 


as  far  as  the  annotated  map  is  concerned,  a  mailbox  is  simply 
a  two  dimensional  polygon  at  a  particular  location  with  some 
extra  bits  associated  with  it.  The  “extra  bits”  might  represent 
the  three  dimensional  shape  of  the  the  mailbox,  or  even  the 
name  of  the  person  who  owns  it.  The  module  which  managc.s 
the  annotated  map  does  not  interpret  this  extra  information, 
but  rather  provides  a  mechanism  for  client  modules  to  access 
the  annotations.  Tliis  reduces  the  knowledge  bottleneck  that 
can  develop  in  large,  completely  centralized  systems. 

Ti.e  annotated  map  is  not  just  a  passive  geometric  database, 
but  instead  is  an  active  part  of  our  system.  Besides  having  a 
2D  representation  of  the  physical  objects  in  a  region,  anno¬ 
tated  maps  can  contain  what  are  called  alarms.  Alarms  are 
conceptual  objects  in  the  map,  and  can  be  lines,  circles,  or 
regions.  Each  alarm  is  annotated  with  a  list  of  client  mod¬ 
ules  to  notify  and  the  information  to  send  to  each  when  the 
alarm  is  triggered.  When  the  annotated  map  manager  notices 
that  the  vehicle  is  crossing  an  alarm  on  the  m£q),  it  sends  the 
information  to  the  pertinent  modules.  Once  again,  the  map 
manager  does  not  interpret  the  information:  that  is  up  to  the 
client  modules. 

Alarms  can  be  thought  of  as  positionally  based  production 
rales.  Instead  of  using  perception  based  production  rales  like, 
“If  A  is  observed,  then  perform  action  B”,  an  annotated  msqi 
based  system  has  rales  of  the  form,  “If  location  A  is  reached, 
then  perform  action  B”.  Thus  we  reduce  the  problem  of  mak¬ 
ing  high  level  decisions  from  the  diflicult  task  of  perceiving 
and  reacting  to  external  events  to  the  relatively  simple  task  of 
monitoring  and  updating  the  vehicle’s  position. 

The  first  step  in  building  an  annotated  map  is  collecting 
geometric  information  about  the  environment.  We  build  our 
maps  by  driving  the  vehicle  over  roads  and  linking  the  road 
segments  together  at  intersections.  At  the  same  time,  a  laser 
range  finder  is  used  to  record  the  positions  of  landmarks  such 
as  mailboxes  and  telephone  poles.  Plarming  a  particular  mis¬ 
sion  requires  adding  specific  instructions  to  the  map  in  the 
form  of  “trigger  annotations”.  This  is  currently  a  process  per¬ 
formed  by  the  person  planning  the  mission.  For  example,  the 
human  expert  knows  that  when  approaching  an  intersection, 
the  vehicle  should  slow  down,  so  the  expert  chooses  the  ap¬ 
propriate  location  to  put  the  trigger  line.  The  trigger  line  goes 
across  the  road  at  that  point,  and  is  annotated  with  a  string 
of  bits  that  represents  the  new  speed  of  the  vehicle.  During 
the  ran,  when  the  vehicle  crosses  the  trigger  line,  the  m^ 
manager  sends  the  string  of  bits  to  a  module  that  interprets 
the  information  and  slows  the  vehicle  to  the  desired  speed.  In 
the  current  system,  alarms  are  interpreted  as  commands,  but 
there  is  no  predefined  “correct”  way  for  a  module  to  react  to 
an  alarm.  Depending  on  its  content,  an  alarm  could  also  be 
interpreted  as  a  wakeup  call,  or  even  as  simply  advice. 

Because  position  information  is  so  critical  to  an  annotated 
map  system,  we  use  multiple  techniques  to  determine  the  ve¬ 
hicle’s  current  location.  We  use  an  Inertial  Navigation  System 
(INS)  which  can  determine  the  vehicle's  location  with  an  enor 
of  approximately  1%  of  distance  traveled  [Amidi  &  Thorpe, 
1990].  To  eliminate  positioning  error  that  accumulates  over 
time  in  the  INS  data,  the  annotated  map  system  also  uses  in¬ 
formation  from  perception  modules.  For  example,  since  the 
driving  networks  presumably  keep  the  vehicle  on  the  road, 
lateral  error  in  the  vehicle  positioning  system  relative  to  the 
road  can  be  identified  and  eliminated.  In  addition,  a  module 
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Figure  S:  The  components  of  the  annotated  mt^  system  and 
the  interaction  between  them.  The  annotated  maf  system 
keeps  track  of  the  vehicle’s  position  on  a  map.  It  provides  the 
arbitrator  with  symbolic  information  concerning  Ae  direction 
to  steer  to  follow  the  preplanned  route  and  the  terrain  the 
vehicle  is  currently  encountering.  The  neural  network  driv¬ 
ing  modules  are  condensed  for  simplicity  into  a  single  block 
labeled  perceptual  neural  networks. 


using  the  laser  range  finder  compares  the  landmarks  it  sees  to 
the  landmarks  collected  when  the  map  was  built,  and  trian¬ 
gulates  the  vehicle’s  position  on  the  map.  These  techniques 
allow  perception  modules  to  provide  useful  positioning  in¬ 
formation  without  requiring  them  to  explicitly  recognize  and 
interpret  particular  objects  such  as  street  signs.  The  position 
corrections  provided  by  perception  modules  are  interpreted  as 
a  change  in  the  transform  between  the  location  that  the  INS 
reports  and  the  real  vehicle  position  on  the  mtq).  A  separate 
module,  called  the  navigator,  is  in  charge  of  maintaining  and 
distributing  this  position  transform. 

Annotated  m^s  provide  the  system  with  the  symbolic  in¬ 
formation  and  control  knowledge  necessary  for  a  fully  au¬ 
tonomous  mission.  Since  the  control  knowledge  is  geomet¬ 
rically  based,  and  since  planning  is  done  before  the  mission 
starts,  runtime  control  comes  at  a  low  computational  cost. 
Figure  S  shows  the  structure  and  interaction  of  the  annotated 
map  system’s  components.  It  also  illustrates  the  annotated 
map  system’s  interaction  with  the  other  parts  of  the  system, 
including  the  perceptual  neural  netwoilu  and  the  arbitrator 
(discussed  below).  Figure  6  shows  a  map  and  annotations  for 
a  mission  segment. 

5  Rule-based  Driving  Module  Integration 
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Figure  6;  A  section  of  a  map  created  and  maintained  by  the  an¬ 
notated  m^  system.  The  map  shows  the  vehicle  traversing  an 
intersection  between  a  single-  and  a  two-lane  road.  The  lines 
across  the  roads  are  alarms  which  are  triggered  when  crossed 
by  the  vehicle.  Triggering  an  alarm  results  in  a  message  be¬ 
ing  passed  from  the  map  manager  to  the  arbitrator  indicating 
a  change  in  terrain  type.  The  circles  on  the  map  represent  the 
positions  of  landmark,  such  as  trees  and  mailboxes.  The  an¬ 
notated  map  system  uses  the  locations  of  known  landmarks  to 
correct  for  vehicle  positioning  errors  which  accumulate  over 
time. 


We  use  the  symbolic  knowledge  provided  by  the  annotated 
msp  system  to  help  guide  the  interaction  of  the  reactive  driv¬ 
ing  neural  network.  Figure  7  shows  the  system  architecture 
with  emphasis  on  the  neural  networks.  Whereas  Figure  S 
subsumed  the  neural  network  systems  into  one  unit  labeled 
“perceptual  neural  networks’’.  Figure  7  subsumes  the  anno¬ 
tated  map  system  into  one  package.  In  this  diagram,  each 
box  represents  a  separate  process  running  in  parallel.  Images 
from  the  three  onboard  sensors  are  provided  to  the  five  driv- 
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Figure?:  The  integrated  ALVINN  architecture.  The  arbitrator 
uses  the  terrain  information  provided  by  the  annotated  map 
system  as  well  as  symbolic  models  of  the  driving  networks’ 
cjqjabilities  and  priorities  to  determine  the  tqrpropriate  module 
for  controlling  the  vehicle  in  the  current  situation. 


ing  networks  shown  in  the  second  row  of  the  diagram.  The 
driving  networks  propagate  activation  forward  through  their 
weights,  with  each  determining  what  it  considers  to  be  the 
correct  steering  direction.  These  steering  directions  are  sent 
to  the  arbitrator,  which  has  the  job  of  deciding  which  netwodc 
to  attend  to  and  therefore  how  to  steer. 

The  arbitrator  makes  use  of  both  the  geometric  and  control 
information  provided  by  the  annotated  mq)  system  to  perform 
a  mission  autonomously.  First,  the  route  following  module 
within  the  annotated  mq)  system  uses  the  geometric  informa¬ 
tion  in  the  annotated  map  to  recommend  a  vehicle  steering 
direction.  The  direction  recommended  by  the  route  follower 
is  the  direction  it  thinks  the  vehicle  shoi^d  steer  in  order  to 
follow  the  preplanned  route.  When  the  vehicle  is  driving 
down  a  road,  the  route  follower  queries  the  aimotated  map  for 
the  position  of  the  road  ahead  of  the  vehicle.  The  route  fol¬ 
lower  uses  this  geometric  information  to  generate  a  steering 
direction. 

The  annotated  map  system  also  provides  the  arbitrator  with 
information  about  the  current  driving  situation,  including  what 
type  of  road  the  vehicle  is  on,  and  whether  there  is  an  inter¬ 
section  or  dangerous  permanent  obstacle  ahead.  For  example, 
suppose  during  the  planning  phase  the  human  expert  notices 
that  at  a  particular  point  the  road  changes  from  one  lane  to 
two.  The  expert  would  set  a  trigger  line  at  the  corresponding 
point  on  the  map  and  annotate  it  with  a  message  that  will  tell 
the  arbitrator  to  stop  listening  to  the  one  lane  road  following 
network  and  start  listening  to  the  two  lane  road  following 
network.  When  the  alarm  is  triggered  during  the  run,  the  ar¬ 
bitrator  combines  the  advice  from  the  annotated  map  system 
with  the  steering  directions  of  the  neural  network  modules 
using  a  technique  called  relevancy  arbitration. 

Relevancy  arbitration  is  a  straightforward  idea.  If  the  anno¬ 
tated  map  system  indicates  the  vehicle  is  on  a  two-lane  road, 
the  arbitrator  will  steer  in  the  direction  dictated  by  the  two- 
lane  road  driving  network,  since  it  is  the  relevant  module  for 


the  current  situation.  If  the  annotated  map  system  indicates 
the  vehicle  is  approaching  an  intersection,  the  arbitraior  will 
choose  to  steer  in  the  direction  dictated  by  the  annotated  mjq) 
system,  since  it  is  the  module  that  knows  which  way  to  go  in 
order  to  head  towards  the  destination.  In  short,  the  arbitrator 
combines  symbolic  knowledge  of  driving  module  oqrabili- 
ties  with  knowledge  of  the  present  terrain  to  determine  the 
relevant  module  for  the  current  circumstances. 

The  relevancy  of  a  module  need  not  be  based  solely  on 
the  current  terrain  information  provided  by  the  annotated  map 
system.  Instead,  the  arbitrator  also  employs  rules  for  deter¬ 
mining  a  module's  relevancy  from  the  content  of  the  module’s 
message.  The  obstacle  avoidance  network  has  one  such  rule 
associated  with  it.  The  obstacle  avoidance  network  is  trained 
to  steer  straight  when  the  terrain  ahead  is  clear  and  to  swerve 
to  prevent  collisions  when  confronted  with  obstacles.  The 
arbitrator  gives  low  relevancy  to  the  obstacle  avoidance  net¬ 
work  when  it  suggests  a  straight  steering  direction,  since  the 
arbitrator  realizes  it  is  not  an  t^plicable  knowledge  source 
in  this  situation.  But  when  it  suggests  a  sharp  turn,  indi¬ 
cating  there  is  an  obstacle  in  the  vehicle’s  path,  the  urgency 
of  avoiding  a  collision  takes  precedence  over  other  possible 
actions,  and  the  steering  direction  is  determined  by  the  obsta¬ 
cle  avoidance  network.  This  priority  arbitration  is  similar  in 
many  ways  to  the  subsumption  architecture  [Brooks,  1986], 
although  the  most  cotrunon  interaction  between  behaviors  in 
Brooks’  systems  is  for  higher  level  behaviors  to  override  less 
sophisticated,  instinctual  ones. 

By  combining  m^-related  knowledge  about  the  current 
driving  situation  with  knowledge  about  abilities  and  priori¬ 
ties  of  individual  driving  modules,  the  integrated  architecture 
provides  the  system  with  curabilities  that  far  exceed  those  of 
individual  driving  modules  alone.  Using  this  architecture,  the 
system  has  successfully  followed  a  1/2  mile  path  through  a 
suburban  neighborhood  from  one  specific  house  to  another.  In 
navigating  the  route,  the  system  was  required  to  drive  through 
three  intersections  onto  tluee  different  roads  while  swerving 
to  avoid  parked  cars  along  the  way.  At  the  end,  the  vehicle 
came  to  rest  one  meter  from  its  destination. 

6  Analysis  and  Discussion 

Rule-based  integration  of  multiple  expert  networks  has  sig¬ 
nificant  advantages  over  previously  developed  connectionist 
arbitration  schemes.  One  such  advantage  is  the  ease  of  adding 
new  modules  to  the  system.  Using  rule-based  arbitration,  the 
new  module  can  be  trained  in  isolation  to  become  an  expert 
in  a  new  domain,  and  then  integrated  by  writing  mies  for 
the  arbitrator  which  specify  the  new  module’s  area  of  exper¬ 
tise  and  its  priority.  This  is  in  contrast  to  other  connectionist 
expert  integration  techniques,  such  as  the  task  decomposition 
architecture  [Jacobs  el.  al.,  1990],  connectionist  glue  [Waibel, 
1989]  and  the  meta-pi  architecture  [Hampshire  and  Waibel, 
1989].  To  combine  experts  using  these  techniques  requires  the 
training  of  additional  neural  network  structures,  either  simul¬ 
taneously  with  the  training  of  the  experts  in  the  case  of  the  task 
decomposition  architecture,  or  after  expert  training  in  the  case 
of  the  connectionist  glue  and  meta-pi  architectures.  Adding 
a  new  expert  using  these  techniques  requires  retraining  the 
entire  integrating  structure  from  scratch,  which  involves  pre¬ 
senting  the  system  patterns  from  each  of  the  experts’  domains, 
not  just  the  new  one.  This  large  scale  retraining  is  particu- 
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larly  difficult  in  a  task  like  autonomous  navigation  because 
it  requires  either  driving  over  all  the  experts’  domains  again, 
or  storing  a  large  number  of  domain-specific  images  for  later 
reuse. 

Another  significant  advantage  of  rule-based  arbitration  is 
the  ease  with  which  non-neural  network  knowledge  sources 
can  be  integrated  into  the  system.  Symbolic  tasks  such  as 
planning  and  reasoning  about  a  map  are  currently  difficult  to 
implement  using  neural  netwoiks.  In  the  future,  it  should  be 
possible  to  implement  more  and  more  symbolic  processing 
using  connectionist  techniques,  but  until  then,  rule-based  ar¬ 
bitration  provides  a  way  of  bridging  the  gap  between  neural 
netwoiks  and  traditiomd  A1  systems. 

The  technique  is  not  without  shortcomings  however.  The 
current  implementation  relies  too  heavily  on  the  accuracy  of 
the  annotated  mrqr  system,  particularly  for  negotiating  inter¬ 
sections.  The  question  might  be  asked,  why  is  the  mapping 
system  required  for  intersection  traversal  in  the  first  place? 
V^y  can’t  the  driving  networks  handle  intersections?  The 
answer  is  that  when  rqrproaching  an  intersection,  an  individ¬ 
ual  driving  network  will  often  provide  ambiguous  steering 
commands,  since  there  are  multiple  possible  roads  to  follow. 
If  left  on  its  own,  a  road-following  network  will  often  alter¬ 
nately  steer  towards  one  or  the  other  road  choices,  causing  the 
vehicle  to  oscillate  and  eventually  drive  off  the  road.  In  ad¬ 
dition,  even  if  the  network  could  learn  to  definitively  choose 
one  of  the  branches  to  follow,  it  still  wouldn’t  know  which 
is  the  appmpriate  branch  to  choose  in  order  to  head  toward 
the  destination.  In  short,  the  mapping  modules  can  be  viewed 
both  as  a  useful  source  of  high  level  symbolic  knowledge, 
and  as  an  interim  solution  to  the  difficult  perceptual  task  of 
intersection  navigation. 

The  annotated  map  system  as  currently  implemented  is 
not  a  perfect  solution  to  the  problem  of  high  level  guidance 
because  it  requires  both  detailed  knowledge  of  the  route,  and 
an  accurate  idea  of  the  current  vehicle  position.  In  certain 
controlled  circumstances,  such  as  rural  mail  delivery,  the  same 
route  is  followed  repeatedly,  making  an  accurate  mtq)  of  the 
domain  feasible.  However  a  system  ctqrable  of  foUowing 
less  precise  directions,  like  “go  about  a  half  mile  and  turn 
left  on  Seneca  Road”,  is  clearly  desirable.  Such  a  system 
would  require  more  reliance  on  observations  from  perception 
modules  and  less  reliance  on  knowledge  of  the  vehicle’s  exact 
position  when  making  high  level  decisions. 

Conceptually,  this  shift  towards  reliance  on  {lerception  for 
high  level  guidance  could  be  done  in  two  ways.  First,  observa¬ 
tions  of  objects  like  the  Seneca  Road  street  sign,  could  be  used 
to  update  the  vehicle’s  position  on  the  mjq).  In  fact,  position 
updates  based  on  perceptual  observations  are  currently  em¬ 
ployed  by  the  annotated  map  system  when  it  triangulates  the 
vehicle’s  location  based  on  the  positions  of  known  landmarks 
in  laser  range  images.  But  position  updates  are  only  helpful 
when  the  observations  are  location  specific.  For  observations 
of  objects  like  stop  lights,  or  arbitrarily  located  objects  like 
“road  construction  ahead”  signs,  the  system’s  response  should 
be  independent  of  the  vehicle’s  location. 

These  location  independent  observations  could  be  modeled 
as  positionless  alarms  in  the  annotated  map.  When  a  percep¬ 
tion  module  sees  an  object  like  a  ‘Toad  construction  ahead” 
sign,  it  would  notify  the  map  manager.  The  map  manager 
would  treat  the  sighting  as  an  alarm,  distributing  the  infor¬ 


mation  associated  with  the  alarm  to  the  pertinent  modules. 
Perception  triggered  alarms  would  allow  the  system  to  transi¬ 
tion  between  its  current  perceptual  abilities  and  future,  more 
advanced  capabilities. 

Although  the  system  is  not  yet  capable  of  identifying  and 
reading  individual  signs,  we  have  had  preliminary  success  in 
using  neural  network  perceptual  observations  to  help  guide 
high  level  reasoning.  The  technique  relies  on  the  faa  that 
when  the  vehicle  reaches  an  intersection,  the  output  of  the 
driving  network  becomes  ambiguous.  This  ambiguity  mani¬ 
fests  itself  as  an  output  vector  with  more  than  one  active  steer¬ 
ing  direction  corresponding  to  the  multiple  possible  branches 
to  follow.  This  output  ambiguity  has  b^n  successfully  em¬ 
ployed  to  update  the  vehicle’s  position  and  to  follow  coarse 
directions.  As  the  vehicle  qrproaches  an  intersection,  the  an¬ 
notated  map  system  signals  the  arbitrator  that  an  intersection 
is  coming  up  and  that  the  vehicle  should  follow  the  right-hand 
branch  in  order  to  head  towards  the  goal.  This  level  of  de¬ 
tail  does  not  require  either  a  highly  accurate  map  or  precise 
knowledge  of  the  vehicle’s  current  position.  The  arbitrator 
takes  the  annotated  map  system ’s  message  as  a  signal  to  watch 
the  output  of  the  current  driving  network  carefully.  When  the 
driving  network’s  output  becomes  ambiguous,  the  arbitrator 
signals  the  annotated  map  system  that  the  vehicle  has  reached 
the  intersection  and  to  update  the  vehicle’s  position  accord¬ 
ingly.  The  arbitrator  also  uses  the  “turn  right”  portion  of  the 
annotated  map  system’s  message  in  order  to  choose  the  cor¬ 
rect  steering  direction  from  the  driving  network’s  ambiguous 
output  vector.  This  closer  interaction  between  the  perception 
networks  and  the  aimotated  map  allows  the  system  to  use  per¬ 
ception  for  intersection  traversal,  instead  of  relying  solely  on 
knowledge  from  the  map  for  guidance. 

Another  shortcoming  of  rule-based  arbitration  as  currently 
implemented  its  binary  nature.  Currently,  a  module  is  deemed 
by  the  annotated  map  system  as  either  appropriate  or  inappro¬ 
priate  for  the  current  road  type.  This  binary  decision  does 
not  address  the  question  of  intelligently  combining  modules 
trained  for  the  same  domain,  such  as  the  video-based  single¬ 
lane  driving  network  and  the  laser  reflectanoe-based  single¬ 
lane  driving  network.  There  arc  obviously  some  situations, 
such  as  night  driving,  when  one  network  is  better  suited  than 
the  other.  To  take  more  subtle  circumstances  into  account 
when  weighting  the  steering  directions  dictated  by  multiple 
networics,  we  are  developing  augmented  arbitration  rules  that 
consider  more  context  than  just  the  current  road  type.  We 
are  also  currently  working  on  connectionist  techniques  that 
can  determine  a  network’s  reliability  directly  from  its  output 
alone.  Preliminary  results  in  this  area  look  very  promising. 

One  final  drawback  of  the  current  system  is  the  need  for  a 
human  expert  to  preplan  the  mission  by  providing  map  anno¬ 
tations.  In  the  future,  we  will  replace  the  human  expert  with 
an  expert  system  capable  of  annotating  the  map  appropriately. 
We  understand  the  techniques  the  human  expert  uses  to  find 
the  shortest  route  and  to  annotate  the  map,  so  automating  the 
process  should  not  be  difficult. 

In  conclusion,  a  modular  architecture  permits  rapid  de¬ 
velopment  of  expert  neural  networks  for  complex  domains 
like  autonomous  navigation.  Rule-based  arbitration  is  a  sim¬ 
ple  and  efficient  method  for  combining  these  experts  when 
symbolic  knowledge  is  available  for  reasoning  about  their  ap¬ 
propriateness.  Rule-based  arbitration  also  permits  the  com- 
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bination  of  neural  network  experts  with  non-neural  network 

processing  techniques  such  as  planning,  which  are  difficult  to 

integrate  using  other  arbitration  schemes. 
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Abstract 

A  new  model  for  vehicle-type  motion  is  proposed, 
which  assumes  that  the  motion  is  a  rotation  around  an  axis 
dirough  the  vdiicle  center  followed  by  a  forward  translation 
along  the  main  axis  of  the  vehicle.  The  contribution  of  this 
pq>er  is  threefold:  (1)  When  the  rotation  and  the  amplitude  of 
translation  are  constant,  this  ^pe  of  motion  is  shown  to  be 
equivalent  to  a  constant  camera-centered  motion.  This  indicates 
that  a  constant  motion  in  the  convoitional  camera-centered 
model,  which  is  commonly  considered  artificial,  can  in  fact  be  a 
reasonable  model  in  real  life.  (2)  We  show  that  a  constant 
vehicle-type  motion  can  be  interpreted  as  a  constant  screw 
motion.  (3)  A  linear  algorithm  for  estimating  constant  vehicle- 
type  motion  is  presented  and  experimoits  using  real  scene 
images  are  included. 

I.  Introduction 

An  important  class  of  rigid  object  motion  is  that  of 
vehicles  such  as  cars,  ^ps  and  aircrafts.  We  shall  call  such 
motion  vehicle-type  motion.  Using  conventional  camera- 
centered  motion  to  model  vehicle-type  motion  is  not  luUural  for 
mariy  applications.  Resently,  several  models  for  three- 
ditnensional  (3D)  motion  have  been  suggested,  among  which 
are  those  of  Shariat  and  Price  [1],  and  Weng,  Huang  and  Ahuja 
[2].  In  the  former,  a  motion  is  modeled  as  that  of  a  rolling 
wheel.  In  the  latter,  a  3D  motion  is  described  as  a  cone  rolling 
without  slip  around  another  fixed  cone.  These  models  are  not 
particularly  suitable  for  vehicle-type  motion  estimation. 

In  this  paper,  a  new  model  for  vehicle-type  motion  is 
proposed.  This  model  represents  a  3D  motion  as  a  rotation 
around  an  axis  through  the  vehicle  center  followed  by  a  for¬ 
ward  translation.  Several  advantages  arise  with  this  model; 
I^t,  this  model  is  much  closer  to  real  vehicle  motion  than 
existing  motion  models,  especially  when  the  motion  is  small. 
Second,  in  the  case  of  constant  motion,  for  which  the  rotation 
and  amplitude  of  translation  are  assumed  invariant,  as  we  shall 
see  later,  a  vehicle-type  motion  is  equivalent  to  a  constant 
camera-centered  motion.  This  gives  rise  to  several  benefits,  (i) 
constant  camera-centered  motion,  which  has  been  used  by  many 
researchers,  is  commonly  thought  to  be  artificial,  i.  e.,  few 
cases  of  ol^ect  motion  in  real  life  could  fit  the  assumption.  The 
research  of  this  pqier,  however,  discovers  that  constant 
camera-centered  motion  is  in  fact  a  good  model  for  vehicle 
motion,  (ii)  sittce  a  constant  vehicle-type  motion  can  be  con¬ 
verted  into  a  constant  camera-centered  motion,  it  can  be 
estiiTuited  many  existing  two-view  algorithms  of  motion 


estimation,  e.  g.,  Zhuang,  Huang  and  Haralick  [3],  Furthermore, 
we  shall  show  that  a  constant  vehicle  motion  can  be  interpreted 
as  a  constant  screw  motion.  Finally,  for  constant  vehicle-type 
motion,  a  long  image  sequence  can  be  used  to  combat  noise. 

The  motivatian  behind  this  research  is  to  discover  a 
proper  model  for  vehicle  motion  estimation  using  long  image 
sequences.  Since  a  general  vehicle  motion,  if  smooth,  can  be 
approximated  by  a  piecewise  constant  motion,  the  constant 
vehicle-type  motion  model  has  wide  applications. 

n.  Model  for  Vehicle-type  Motion 

In  most  existing  techniques  of  motion  estimation,  a  3D 
motion  is  usually  modeled  as  a  camera-centered  motion,  i.  e., 
the  camera  is  assumed  to  be  fixed  at  the  origin  of  the  world 
coordinate  system  with  optical  axis  of  the  camera  coinciding 
with  the  z-direction  (or  negative  z-direction),  and  the  motion  is 
rqiresented  1^  a  rotation  around  an  axis  though  the  camera 
center  followed  by  a  translation.  Assume  that  is  a  point  on  a 
3D  object  at  time  instant  t,-;  the  same  point  at  time  instant 
ti>i.  Kg,’  and  7^-  the  rotation  and  translation,  respectively,  in 
this  model  Grom  time  instants  to  The  motion  is  described 
by 

=  Rg.’^i  -t-  (1) 

Although  it  is  mathematically  convenient,  this  model  appears  to 
lack  practical  meaning  since  there  are  few  cases  in  real  life 
where  an  object  moves  in  such  manner.  Especially,  in  the  case 
of  constant  motion,  the  camera-centered  model  is  commonly 
thought  to  be  artificial. 

1.  Vehicle-type  motion 

Suppose  there  is  a  vehicle  moving  in  the  world  coordi¬ 
nate  system  with  the  camera  center  at  the  origin.  Considering  a 
3D  point  on  a  moving  vehicle  at  position  at  time  instant 
and  at  position  ^-,1  at  time  instant  i,-,],  let  R,-  be  the  rotation  of 
the  point  around  an  axis  through  the  object  center  ^  of  time 
instant  7;  be  the  translation  along  the  main  axis  of  the  object 
from  time  instants  to  (,-,1.  See  Figure  1.  Then  the  motion  of 
this  point  in  the  model  of  vehicle-type  motion  is  described  as 

=  R.(^  -  c;.  )  -f  7^  +  c!  .  i  =  0.  1,  ...  N-\  (2) 

with  the  constraints 

fi  =  kfRiT*.,  .  i  =  1.  2,  ...  N~l  (3) 
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Figure  1.  A  model  of  vehicle-type  motion 

+  /  =  1.  2.  ...  N  (4) 


[3].  If  every  image  point  is  visible  in  all  frames,  the  minimum 
numbers  of  points  as  a  function  of  the  number  of  frames  are 
listed  in  Table  1. 


Table  1.  Minimum  number  of  image  points 
as  a  frmction  of  the  number  of  frames 


number  of  frames 

minimum  number  of  points 

2 

8 

3 

4 

4 

3 

5.  6.  7.  8 

2 

29 

1 

where  Jtj  is  a  scale  parameter  for  translation  amplitude.  Note 
diat  in  this  model,  the  center  position  of  an  object  is  determined 
only  by  the  preceding  center  position  and  the  translation,  but 
not  the  rotation. 

2.  Converting  vehicle-type  motion  to 
camera-centered  motkm 

Li  the  motion  described  by  Eq.  (2),  the  rotations  and  the 
amjditudes  of  translations  for  different  time  instant  pairs  may 
have  different  values.  We  now  consider  the  special  case  of  con- 
stam  motion,  where  the  rotations  and  the  amplitudes  of  transla¬ 
tions  are  assumed  constant  This  implies  that  Rj  ==  R  and  Jti  =  1 
for  all  i,  and  the  formulas  for  a  vehicle  motion  become 

=  +  .  i  =0.  1,  .  . .  N-l  (5) 

r,  =  Rf,.i ,  i  =  1,  2,  . . .  N-l  (6) 

jS  =£$-,  +  75-, ,  /  =  1.  2.  ...  N  (7) 

Substituting  Eqs.  (6)  and  (7)  into  Eq.  (3),  one  obtains 
^,+1  =  R  (^  -  c}.,  -  75-,)  +  R75-,  +  ^•-,  +  75-, 

=  R(  ^  )  +  75-,  +  cj-. 

By  recursively  substituting,  the  motion  equation  can  fiiully  be 
sijiq)lied  to 

=  ®( -  Ss )  +  7*0  +  ?o  (8) 

Noticing  diat  To  and  remain  constant  one  can  recognize  that 
the  motion  of  Eq.  (8)  is  actually  equivalent  to  that  of  a 
camera-centered  model.  To  make  it  clear,  let 

f  =  a-«)2*o  +  75.  (9) 

where  I  is  a  3  x  3  unit  matrix.  Equation  (8)  can  be  then  written 

as 

^,*1  =  R^,  +  f ,  i  =  0, 1,  ...  N-l  (10) 


3.  Determining  motion  model  parameters 

After  having  calculated  K  and  F*.  we  proceed  to  deter¬ 
mine  the  motion  model  parameters  in  Eq.  (S).  Since  the  real 
object  center  is  very  hard  to  estimate  from  images,  we  take  as 
object  center  the  ceittroid  of  the  3D  points  whose  images  are 
used  for  motion  estimation.  The  3D  structure  of  image  points 
can  be  estimated  from  the  coordinates  of  the  image  points  and 
the  motion  parameters  ■  and  F*  found  in  the  preceding  section. 
Using  the  formulas  in  Ref.  [3],  the  3D  structure  of  the  image 
points  are 


where  pj^{x,y,z'f  is  the  3D  coordinates  of  the  image  point 
=  pc,  1)^;  R  and  f  are  the  rotation  matrix  and  the  transla¬ 
tion  vector,  respectively,  subject  to  lF*l  =  I,  described  in  the 
camera  cetttered  model.  Once  the  coordirutes  of  the  3D  points 
on  the  vehicle  are  estimated,  the  position  of  the  object  center  of 
the  vehicle  can  be  computed.  Let  the  object  center  at  time 
instant  i  be  The  rotation  center  of  the  vehicle  is  deter¬ 
mined  by  minimizing  the  distances  of  the  rotation  centers  and 
object  centers  at  etKh  instant  subject  to  the  constant  modon 
constraints  of  Eqs.  (6)  and  (7).  Let  c^-  be  the  rotation  center  of 
the  vehicle  at  dme  instant  i .  then  it  can  be  estimated  by 

Z'El-C.  1*  =  "^.  (12) 

1=0 

subject  to 

I  -  (I  +  R)j;  +  R?^_,  1=0.  1  =  1.  2,  ...  N-l 

After  the  rotation  centers  c,  are  computed,  the  transla¬ 
tion  vectors  of  the  object  centers  at  each  time  instant  are  easily 
determined. 


Equation  (10)  has  the  same  form  as  Eq.  (1).  It  describes 
a  motion  as  a  rotation  R  around  an  axis  through  the  camera 

fo=F'+(R-/)?o 

(13) 

center  followed  by  a  translation  f.  There  are  many  existing 
teduiiques  that  may  be  used  to  solve  this  problem,  e.  g..  Ref 

f;  =  R75.,  .  i  =  1,2 _ N-l 

(14) 

Thus,  the  motion  is  completBly  described  by  the  rotation 
H,  the  rotation  centers  i*,  and  translation  vectors  7^-  in  the  model 
of  object  centered-motion. 

in.  Interpretiiig  Vchlcte-Type  Motion 
by  Screw  Decomposition 

A  3D  motion  may  has  various  interpretations  depending 
on  the  omstraints  on  the  descriptions  of  rotation  or  translation, 
e.  g.,  the  choice  of  a  rotation  center.  In  this  section,  we  show 
tfiat  a  constant  vehicle-type  motion  can  be  interpreted  as  a  con¬ 
stant  screw  motion  if  die  rotation  centers  at  all  time  instants  are 
constrained  to  lie  on  a  straight  line  called  the  screw  axis.  The 
model  of  a  screw  motion  has  intuitive  appeal.  In  particular,  the 
vehicle-type  motion  of  a  ground  vehicle  can  be  considered  as 
motion  along  a  circle  around  a  fixed  point  The  parameters 
describing  a  screw  motion  are  the  screw  axis,  the  rotation 
radius,  the  rotation  angle  and  the  translation  vector  along  the 
screw  axis. 

If  the  screw  axis  is  written  in  parameter  form 

(15) 

where  ^  is  a  general  3D  point  on  the  screw  axis;  H  is  the  direc- 
ti<»i  of  the  screw  axis  determined  from  S;  t  is  a  parameter; 
is  a  point  the  screw  axis  passes,  which  can  be  computed  by 

r  -0 

Once  the  rotation  axis  is  found,  the  screw  radius,  which 
is  defined  as  the  radius  of  die  screw  cylinder,  can  be  estimated 
relatively  to  within  a  scale  factor,  just  like  die  translations. 

£*  I  detlf?  I I  '  (17) 

"  /•o 

where  0  >>  [i,  SPf,  and  i,  y  and  f  are  the  unit  vectors  in  x-, 
y-,  and  z-directions  of  the  3D  coordinate  system.  Also,  the  rota¬ 
tion  angle  b^ween  any  two  adjacent  frames  can  be  easily  com¬ 
puted  from  die  rotation  matrix. 

IV.  Experiment  Results 


Figure  2.  A  typical  frame  of  intoisity  image 

the  world  coordinate  system  by  a  rotation  matrix  M  [3]  fol¬ 
lowed  by  a  translation.  The  camera  used  to  take  the  images  is  a 
fixed  focus  AMJ  /  Bronka  SQ-AM  metric  700  mm  camera 
with  a  lens  of  40  mm  nominal  focal  length.  The  image  size  is 
SS.6  X  SS.6  mm^.  The  target  point  positions  on  the  image  planes 
are  measured  by  a  one  micrometer  mono-comparator.  The  coor¬ 
dinates  of  these  target  point  can  be  found  in  Appendix  B  of 
Ref.  I3J. 

The  image  target  points  obtained  by  the  mono- 
comparator  have  to  be  processed  before  they  can  be  used  for 
motion  estimation.  The  major  stages  of  the  processing  include; 
(1)  lens  distortion  correction,  (2)  camera  alignmem.  In  the 
experiment,  the  positions  and  orientations  of  the  camera  are 
accurately  measured,  and  then  used  to  align  the  camera  for  By 
the  linear  algoridun  of  23iuang  et  al,  the  results  of  motion  esti¬ 
mation  for  the  image  of  Figure  2  are  listed  in  Tables  2-6.  Table 


Table  2.  Estimated  motion  using  camera-centered  model 


rUMjuo  nit 

loOtkii  miSB 

mnlRtkn 

«1  .j  Hj 

dgpm 

X  y  * 

•a32^6  094163  OJOKm 

7.06906 

0.97218  0.07776  -022095 

The  images  used  in  the  experiment  were  of  an  outdoor 
scene  containing  a  moving  ground  vehicle  in  five  image  frames 
(L13  -  L17).  Several  target  points  had  beat  marked  on  the  vehi¬ 
cle,  four  of  tiwm  were  visible  in  those  qiecific  frames.  The  2D 
coordinates  of  these  target  points  on  the  image  plane  were  used 
as  input  data  for  motion  estimation.  The  details  of  the  image 
data  can  be  found  in  Ref.  [3]. 

The  world  coordiitates  used  in  Ref.  [3]  is  a  right-handed 
^lem  fixed  on  dw  ground  with  its  x-axis  and  z-axis  on  the 
horizontal  ground  plane  and  y-axis  vertically  upward.  The  cam¬ 
era  is  mounted  cn  the  ground  such  that  the  optical  axis  of  the 
camera  is  in  the  negative  z-direction  of  the  ground  coordinate 
gyuem.  The  camera  coordinate  system,  which  has  its  origin  at 
die  camera  center  widi  the  x-  and  y-axes  parallel  to  those  of  the 
image  coordinates,  z-axis  to  the  negative  optical  axis,  relates  to 
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Table  3.  The  motion  parameters  measured  in 
camera-centered  ground  coordinate  system 


BUi 

roMjoD  nde 

MoriaboB 

•l  "2  "J 

tfegne 

1  y  B 

•032761  093642  012348 

7.06906 

092369  009288  -037171 

Table  4.  Structural  object  centers  at  each  time  instant 


time  inetant 

X 

1  *  j 

13 

•138055 

gllr.l/ll 

gTTAl' 

14 

-155930 

1  -0.17120  1 

15 

-172131 

MMHl 

Kfir??!' 

16 

-186464 

1  -019004  1 

17 

mma 

TaUe  S.  Estimated  motion  using  object-centered  model 


tin 

tontimo  nit 

ratatiaD  tmle 

Jiasl 

■rwnwft  mor  j 

inval 

*3 

Jay 

X 

y 

ft 

13-14 

•03277 

09364 

0.12SS 

7.0ti91 

-OlTtt 

-OjOUO 

0.1307 

14 -IS 

-0iS277 

09364 

0.12SS 

7.0691 

-01620 

•05100 

01494 

13-16 

■oszn 

05364 

0.1 2SS 

7.0691 

•01433 

-OjQQOS 

0.1<S7 

16-17 

-aszrr 

05364 

0.12SS 

7.0691 

-01230 

•0,0074 

01794 

Table  6.  Estimated  motion  described  as  screw  motion 


screw  axis 

4 

■l  *2  *3 

■0.^V6i  0.93642  0.12548 

2^0 

X  y  t 

-1.4S794  -0.13349  -7.68458 

radius 

225946 

raudon  «nala  tin  daarae) 

7.066906 

tmuUtian  alona  Kiew  axis 

026234 

TaUe  7.  The  computed  and  the  measured  distances 
between  target  points  (measured  in  mm) 


time 

instaot 

comcaited  taraet  distance 

measured  tamet  distance  ! 

21-20 

22-21 

20-22 

KMUM 

22-21 

20-21  1 

13 

ECO 

MMSM 

14 

Erm 

STV'IKM 

IS 

SlWIIll 

wsmM 

16 

ItggLl 

EWlll 

17 

ERTRl 

wmm 

7  gives  a  comparison  between  computed  and  measured  dis¬ 
tances  between  target  points.  The  motion  trajectories  of  the 
vehicle  for  the  three  different  motion  models  are  also  shown  in 
Figure  3. 


Figure  3.  Motitm  trajectories  in  three  different  models 


V.  Conclusion 

A  model  for  vehicle-type  motion  is  proposed  in  this 
ptper,  which  may  be  applied  to  the  motion  of  many  man-made 
vehicles.  Under  the  constant  motion  assumption,  we  show  that: 

(1)  A  constant  camera-centered  motion  is  equivalent  to  a 
constant  vehicle-^pe  motion.  In  the  past,  constant 
camera-centered  motion  is  commonly  considered  as  ficti¬ 
tious.  This  paper  has  shown  that  constant  camera- 
centered  motion  can  be  a  good  model  for  the  motion  of 
vehicles. 

(2)  A  constant  vehicle-type  motion  is  a  constant  screw 
motion.  This  gives  vehicle  motion  a  very  vivid  intapre- 
tation.  When  a  moving  object  is  a  ground  vehicle,  the 
motion  trajectory  is  a  circle. 

(3)  If  a  vehicle  motion  is  constant  or  almost  constant,  this 
model  makes  it  possible  to  use  long  image  sequences  for 
motion  estimation  and  thus  make  the  algorithm  robust  A 
linear  algorithm  based  on  existing  two-view  methods  is 
represented,  and  experimental  results  on  a  carefully  cali¬ 
brated  image  sequence  are  included. 

Although  this  paper  describes  mainly  the  case  of  con¬ 
stant  motion,  the  model  proposed  may  be  used  to  approximate  a 
long  sequence  of  general  vehicle  motion  piecewise  if  the 
motion  is  smooth.  We  are  cunently  investigating  this,  as  well 
as  more  complicated  special  cases  of  vehicle-type  motion. 
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Abstract 

In  this  paper  I  present  a  simple  stereo- 
based  proximity  detector  and  derive 
asymptotic  bounds  on  its  sensitivity  us¬ 
ing  a  Bayesian  analysis  of  its  performance. 

The  system  can  be  considered  to  be  a  filter 
tuned  to  a  region  of  three-space.  It  is  ex¬ 
tremely  fast,  running  in  real  time  on  a  per¬ 
sonal  computer,  and  is  easily  parallelized. 

Thus  it  is  an  efficient  alternative  to  deriv¬ 
ing  a  full  depth  map  when  proximity  infor¬ 
mation  is  needed.  Since  most  of  its  opera¬ 
tions  must  be  performed  by  any  edge-based 
stereo  system,  it  can  be  used  in  conjunction 
with  more  complicated,  higher  latency  sys¬ 
tems  at  little  or  no  additional  cost.^ 

1  Introduction 

The  construction  of  fuUy-genetal  visual  systems  has 
proven  very  difficult.  Even  simple  modules  for  com¬ 
putations  such  as  stereo  matching  or  structure-from- 
motion  have  proven  very  difficult  to  make  robust  and 
efficient.  Recently,  there  have  been  a  number  of  calls 
for  examining  qualitative  information  [10]  [1]  or  for 
basing  vision  research  on  concrete  tasks  [5][1][2][6]. 
One  approach  is  to  examine  simple  activities  which 
agents  must  commonly  perform  such  as  navigation 
and  manipulation.  By  examining  these  activities  we 
can  determine  what  information  is  needed  in  order 
to  perform  them.  We  can  then  look  at  the  problem 
of  extracting  these  different  pieces  of  information  as 
well-defined  computational  problems. 

In  this  paper,  I  will  examine  a  particular  type  of 
information,  proximity,  which  is  useful  for  tasks  such 

‘Support  for  this  research  was  provided  in  part  by 
the  University  Research  Initiative  under  Office  of  Naval 
Research  contract  N00014-86-K-0685,  and  in  part  by 
the  Advanced  Research  Projects  Agency  under  Office  of 
Naval  Research  contract  N00014-85-K-0124. 


as  navigation,  and  present  a  simple  stereo-based 
proximity  detector. 

Proximity  detection  is  the  problem  of  determin¬ 
ing  whether  there  is  an  object  within  a  given  dis¬ 
tance  from  the  agent.  The  most  obvious  application 
of  proximity  detection  is  collision  avoidance.  What 
is  important  is  what  proximity  detection  does  not 
involve:  it  does  not  require  reporting  the  exact  dis¬ 
tance  or  shape  of  the  object,  or  how  many  objects, 
or  what  their  boundaries  are — it  is  a  one-bit  answer. 
This  reduced  information  requirement  allows  the  use 
of  dramatically  simpler  machinery^. 

Formally,  we’ll  define  proximity  detection  to  be 
the  problem  of  determining  whether  there  is  an  ob¬ 
ject  within  some  region  of  3-space,  R,  called  the 
seniitive  region  of  the  detector.  Here  R  is  defined 
relative  to,  and  moves  with,  the  agent’s  body.  The 
system  presented  in  this  paper  limits  R  to  a  one- 
parameter  family  of  regions  for  a  given  camera  con¬ 
figuration  and  choice  of  resolution  (see  below). 

2  Algorithm  description 

A  typical  edge-based  stereo  system  extracts  edges 
&om  both  images,  finds  possible  matches  between 
edges  in  the  left  and  right  images  and  then  uses 
some  sort  of  optimization  procedure  to  resolve  con¬ 
flicts  between  possible  matches,  so  as  to  determine 
for  each  edge  point  its  disparity  and  hence  its  depth 
[3].  For  proximity  detection,  we  are  concerned  only 
with  determining  when  an  object  comes  within  a 
certain  distance  of  the  viewer.  Thus  it  would  be 
sufficient  simply  to  determine  whether  any  points 
had  disparities  corresponding  to  the  correct  depth 
range.  A  further  simplification  is  to  note  only  when 
objects  come  into  range.  This  requires  only  deter¬ 
mining  if  there  are  any  edge  points  at  the  disparity 
corresponding  to  the  outer  boundary  of  the  sensitive 
region  of  the  proximity  detector. 

*Fot  another  example  of  a  simplified  stereo  system 
see  [9]. 
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2.1  A  simple  algorithm 

If  we  knew  foi  some  reason  that  each  edge  point  had 
an  unambiguous  match,  we  could  omit  the  optimiza¬ 
tion  step  mentioned  above.  Since  we  ate  only  con¬ 
cerned  with  a  particular  disparity  d,  we  would  only 
have  to  match  at  that  one  disparity.  This  could  be 
accomplished  with  the  following  procedure: 

match-framasdaft,  right,  d) 
la  •  find-adgasflaft) 
lx  *  find-adgas(xight) 
matchas  «  0 

for  aach  rov  i  and  column  j 
if  la(i,j)  and  Ixfitj+d)  ara 
coo^atibla  adga  points 
than  incramant  matchas 
ratnm  matchas 

where  by  “compatible  edges  points”  I  mean  edge 
points  which,  for  example,  have  the  same  sign.  If 
the  algorithm  returns  a  non-zero  value,  then  there 
must  be  an  object  at  the  specified  disparity. 

The  disadvantage  of  the  algorithm  is  that  matches 
between  frames  are  rarely  unambiguous  or  perfect. 
However,  this  algorithm  has  the  advantages  that  (1) 
it  is  very  simple  to  implement,  (2)  it  does  not  re¬ 
quire  high  resolution  to  help  disambiguate  matches, 
(3)  it  is  very  fast  and  can  therefore  detect  collisions 
with  very  low  latency,  (4)  it  is  easy  to  parallelize  on 
mesh-type  hardware,  and  (5)  the  basic  tasks  of  edge 
detection,  shifting  and  matching  at  a  given  disparity 
are  likely  to  be  required  for  any  edge-based  stereo  al¬ 
gorithm,  thus  the  algorithm  could  easily  be  run  in 
parallel  with  a  more  complicated,  higher  latency  al¬ 
gorithm  at  effectively  no  cost*. 

The  algorithm  can  be  considered  to  be  a  spatial 
filter,  but  a  filter  tuned  in  three-sp<u:e,  not  in  the 
normal  image-plane  spatial  domain.  It  is  sensitive 
to  a  region  of  bounded  disparity  whose  thickness  is 
determined  by  the  quantisation  of  disparities.  In 
particular,  it  is  the  set  of  points  in  3-space  which 
project  to  points  in  the  images  with  disparity  in  the 
interval  {d  —  ^,d+  !•). 

2.2  How  bad  is  the  simple  algorithm? 

Let’s  assume  that  our  edge  detector  is  good  so  that 
we  can  ignore  matching  errors  due  to  noise  in  our 
edge  detector.  If  we  place  am  object  in  the  sensitive 
region  of  the  filter  we  will  get  some  response  r  which 
is  due  to  three  components:  a  noise  term  n  which  is 
the  total  number  of  false  matches  when  the  object  is 
not  present  in  the  sensitive  region  (i.e.  the  response 
of  the  filter  to  the  background),  a  cro$B-»eciion  term 

*Fot  exanq>le,  it  could  be  run  in  conjunction  with  the 
Man-Poggio  algorithm  [7]  simply  by  tapping  one 
layer  of  the  disparity-selective  cells  and  summing  their 
outputs. 


c  which  is  the  total  number  of  matched  edge  points 
on  the  object,  and  a  final  term  o  which  is  the  number 
of  false-positive  matches  which  are  occluded  by  the 
object,  that  is,  which  would  have  contributed  to  the 
response  of  the  filter  if  the  object  were  not  present. 
Thus  total  response  r  is 

r  =  n  4-  c  —  o 

If  we  neglect  the  o  term,  then  we  are  left  with  a  sim¬ 
ple  signal  which  we  would  like  to  estimate,  corrupted 
with  additive,  but  not  necessarily  gaussian,  noise: 

r  =  n  -f  c 

Suppose  we  want  to  make  a  simple  threshold- 
based  detector  for  the  filter  which  answers  that  there 
is  an  object  present  in  the  sensitive  region  iff  r  is 
greater  than  some  threshold  T.  We  can  scale  the 
variables  r,  n,  and  c  to  the  interval  [0,1]  so  that 
they  each  represent  a  percentage  of  matched  pixels, 
and  treat  r,  c  and  n  as  random  variables.  If  we  as¬ 
sume  that  c  and  n  are  independent  with  probability 
densities  C  and  TV  respectively^,  then  we  can  deter¬ 
mine  the  a  priori  probabilities  of  false  positives  and 
false  negatives  for  various  settings  of  the  threshold 
T. 

Let  a  be  the  o  priori  probability  of  there  being  an 
object  in  the  sensitive  region.  For  a  false  positive  the 
reding  is  entirely  due  to  noise.  Since  the  conditional 
probability  density  function  of  r  given  that  there  is 
no  object  in  the  sensitive  region  is  simply  ilT,  and  the 
a  priori  probability  of  there  being  no  object  in  the 
sensitive  region  is  1  —  a,  we  have  that  the  a  priori 
probability  Ppp  of  a  false  positive  with  threshold  T 
is  ^ 

P,p  =  (1-q)/  N(r)dT  (1) 

Jt 

Since  c  and  n  are  independent,  we  have  that  the 
probability  of  getting  a  response  ro  given  that  there 
is  an  object  in  the  sensitive  region  is 

P(r  =  rojpositive)  =  f  C{x)N{rQ  -  x)dx  (2) 
Jo 

Thus  the  a  priori  probability  PpN  of  a  false  negative 
given  threshold  T  is 

Ppif  =  a  f  f  C(x)N{r  -  x)dxdr  (3) 
Jo  Jo 

Beyond  this,  nothing  can  be  said  without  assum¬ 
ing  something  about  the  forms  of  C  and  N.  If  we 
assume  that  the  noise  is  bounded  by  some  noise  level 
no,  that  is  that  N{x)  =  0  for  z  >  no,  then  we  can 

*This  is  for  explanatory  purposes  only.  Such  distri¬ 
butions  are  not  normally  known. 


always  unambiguously  detect  objects  with  cross  sec¬ 
tions  greater  than  no-  If  we  are  only  concerned  with 
detecting  objects  with  large  cross  sections  then  the 
simple  algorithm  with  a  threshold  T  >  no  will  suf¬ 
fice.  Alternatively,  if  N  has  a  long  tail,  we  can  still 
treat  the  noise  as  being  bounded  by  some  arbitrary 
no  at  the  price  of  a  false  positive  rate  of 

(1  —  a)  f  N{x)dx 
Jna 

while  retaining  the  property  that  we  will  always  de¬ 
tect  objects  with  cross  sections  greater  than  no- 

2.3  Noise  and  sensitivity 

Not  surprisingly,  the  noise  level  limits  the  size  of 
the  objects  which  we  can  detect  in  the  sensitive  re¬ 
gion.  Although  the  precise  limit  is  in  terms  of  the 
cross  section  of  the  object,  with  a  few  assumptions 
we  can  translate  this  into  a  bound  on  object  vol¬ 
ume  which  will  give  us  the  order  of  growth  of  the 
worst  case  sensitivity  with  noise.  Let’s  define  the 
worst  case  threshold  volume  to  be  the  least  volume 
Vt  such  that  any  object  with  volume  at  least  Vt  will 
be  unambiguously  detectable. 

Since  we  ate  concerned  with  how  small  an  object 
we  can  detect,  we’ll  concern  ourselves  with  objects 
which  fit  entirely  within  the  sensitive  region'’  It  is 
also  necessary  to  distinguish  between  what  I  will  call 
loell-teztured  objects,  which  have  sufficient  texture  to 
cause  the  edge  detector  to  fire  all  over  their  surface, 
and  untextured  objects  which  only  reliably  produce 
a  response  from  the  edge  detector  along  their  gener¬ 
ating  contours  (image-plane  boundaries). 

Let  P  be  the  projection  of  the  object.  The  volume 
V  of  the  object  is  bounded  above  by  the  product  of 
the  thickness  of  R,  the  maximum  depth  depth  of  R, 
the  focal  length  of  the  lens,  and  the  area  of  P.  All 
but  the  latter  are  constants,  thus  V  =  0(P).  This 
is  sufficient  to  give  us  the  order  of  growth  of  the 
threshold  volume  with  noise: 

Theorem  1  For  small  objects  (objects  which  fit  in 
R),  the  order  of  growth  of  the  worst  case  threshold 
volume  of  the  simple  detector  is  O(no)  for  well  tex¬ 
tured  objects  or  O(nQ)  for  untextured  objects. 

Proof:  Recall  that  the  detector  can  unambiguously 
detect  any  object  with  cross  section  greater  than  tiq. 
Since  P  =  n(V),  we  need  only  show  that  the  cross 
section  is  (1{P)  for  textured  objects  or  for 


*The  analysis  of  larger  objects  is  highly  shape- 
dependent.  There  are  a  nunnber  of  pathological  cases 
which  can  break  the  detector  such  as  very  long  cones 
with  little  texture  coming  head-on. 


untextured  objects  respectively®.  In  the  textured 
case,  we  have  that  the  cross  section  is  bounded  below 
by  some  constant  times  the  area  of  P,  thus  c  =  n(P). 
For  untextured  objects,  we  must  at  least  be  able 
to  sense  the  generating  contour  of  the  object.  The 
region  with  the  least  boundary  per  unit  area  is  a 
circle,  for  which  the  area  is  quadratic  in  the  length 
of  the  boundary.  Thus  c  =  Q{P^^^). 

3  Strategies  for  coping  with  noise 

In  practice,  there  is  always  matching  noise.  I  will 
discuss  three  possible  modifications  to  the  algorithm 
to  reduce  the  noise  level. 

3.1  Ignoring  noise 

Before  complicating  the  algorithm,  it  is  worth  evalu¬ 
ating  just  what  the  impact  of  noise  is  on  the  system. 
As  was  mentioned  above,  the  simple  version  of  the 
algorithm  will  suffice  if  objects  tend  to  have  large 
cross  sections  (i.e.  are  large  and  textured). 

Some  environments  will  produce  very  little  noise. 
For  example,  if  there  is  very  little  texture  in  the 
environment  and  it  is  widely  spaced,  then  noise  is 
less  likely  to  be  a  problem.  Unfortunately,  this  is 
very  difficult  to  quantify.  Empirically,  we  can  always 
ignore  the  noise,  set  a  threshold,  and  see  what  the 
results  are.  Analytically  however,  the  best  we  can 
do  is  assume  an  unrealistic  noise  model  and  use  it 
to  characterize  the  response  of  the  algorithm. 

Let’s  ignore  depth  effects  and  model  scene  texture 
as  a  binary  random  field  in  which  each  pixel  inde¬ 
pendently  assumes  a  value  of  1  (edge  pixel)  with 
probability  /3,  or  0  (no  edge)  with  probability  1-/3. 
This  is  roughly  like  viewing  a  random-dot  display  at 
infinity.  Then  in  order  for  a  pixel  in  the  background 
to  be  fakely  matched,  both  it  and  the  pixel  to  which 
the  matching  is  occurring  must  be  marked  as  edge 
points.  Since  they  are  independent,  this  occurs  with 
probability  /3^,  and  so  the  expected  fraction  of  pixels 
which  will  be  falsely  matched  will  also  be  /3^.  Thus 
no  =  0{P^). 

Given  this,  we  can  consider  the  noise  level  to  be 
quadratic  in  the  density  of  texture  in  the  scene. 

Theorem  2  For  small  objects  in  front  of  a  binary 
random  field,  the  order  of  growth  of  the  worst-case 
threshold  volume  of  the  simple  detector  is  0(0^)  for 
well-textured  objects,  or  0(0*)  for  untextured  ob¬ 
jects. 

Proof:  Follows  from  theorem  1  and  the  above  anal¬ 
ysis. 

*0  is  roughly  the  opposite  of  O.  To  say  that  /(n)  = 
ft(g{n))  means  that  for  sufficiently  large  n,  /  is  bounded 
below  by  some  constant  times  g. 
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Of  course  binary  random  fields  at  infinity  are  not 
a  realistic  model  of  the  average  scene.  The  result 
is  useful  however,  both  because  it  is  reasonable  to 
expect  the  quadratic  dependence  of  the  noise  level 
on  the  level  of  texture  in  the  background  to  carry 
through  to  other  domains,  and  because  it  points  out 
how  unreasonable  it  is  to  expect  the  simple  detector 
to  find  untextured  objects  in  a  cluttered  background. 

3.2  Filtering 

Since  disparity  decreases  with  increasing  distance^, 
we  need  only  insure  that  any  edges  in  the  back¬ 
ground  are  sufficiently  far  apart  that  they  can¬ 
not  be  confused  with  each  other  at  the  disparity 
we’re  concerned  with.  This  can  be  accomplished 
with  band-pass  filtering  as  with  the  Marr-Poggio- 
Grimson  stereo  algorithm[4]  or  simply  with  low-pass 
filtering  (smoothing).  When  an  image  is  band-pass 
filtered,  its  zero-crossings  occur  on  average  with  the 
frequency  of  the  filter’s  center-frequency.  Without 
actually  band-pass  filtering,  we  can  still  reduce  the 
noise  level  by  low-pass  filtering.  This  is  the  same  as 
moving  to  a  coarser  point  in  scale  space. 

Unfortunately  filtering  introduces  its  own  distor¬ 
tions.  The  filtering  can  disturb  the  locations  of  the 
edges,  thus  causing  false-negatives.  In  addition,  fil¬ 
tering  limits  the  size  of  the  texture,  thus  limiting  the 
sensitivity  of  the  detector. 

3.3  Improved  edge  matching 

It  is  ako  possible  to  reduce  noise  by  making  the  simi¬ 
larity  measure  used  for  local  matching  of  edges  more 
selective.  For  example,  orientation,  slope  or  inten¬ 
sity  might  be  taken  into  account,  or  a  wider  area  of 
support  might  be  matched.  I  have  not  experimented 
with  this  approach. 

3.4  Accommodation 

Another  tactic  is  to  directly  modify  the  optical  prop¬ 
erties  of  the  imaging  system  itself.  Since  the  transfer 
function  of  a  lens  varies  with  depth,  adding  increas¬ 
ing  filtering  as  we  move  farther  from  the  plane  of 
fixation,  we  can  automatically  blur  texture  far  from 
the  sensitive  region  by  focusing  the  lens  on  the  sensi¬ 
tive  region.  Pentland  has  successfully  used  blurring 
to  estimate  depth  in  synthetic  images  [8].  We  can 
ako  converge  the  cameras  on  the  sensitive  region  so 
that  it  k  at  zero  dkparity.  This  technique  has  been 
used  both  in  the  Marr-Poggio-Grimson  stereo  algo¬ 
rithm  [7]  [4]  and  by  the  animate  vkion  system  of  the 
Rochester  robot  [2].  These  techniques  have  the  ad¬ 
vantage  that  they  are  depth  dependent — they  have 
their  maximum  effect  on  points  far  from  the  sensi- 

'For  roughly  parallel  cameras. 


tive  region.  Thus  even  false  matches  are  much  more 
likely  to  be  due  to  texture  near  the  sensitive  region. 

4  Experimentation 

I  have  implemented  a  version  of  the  simple  algo¬ 
rithm  and  performed  preliminary  experiments  with 
live  data  in  an  office  environment.  The  implemented 
system  samples  the  image  at  a  resolution  of  64  x  48 
pixels.  Two  cameras  with  3mm  fixed-focus  lenses 
(110  degree  field  of  view)  and  a  baseline  of  65mm 
are  used.  The  optic  axes  of  the  cameras  were  roughly 
paraUel  but  were  not  precisely  aligned.  The  imple¬ 
mentation  first  smoothes  the  images  with  a  3  x  3 
kernel  and  then  marks  pixels  which  are  strong  lo¬ 
cal  maxima  of  the  x  derivative  of  intensity  (vertical 
edges)  with  the  sign  of  the  derivative.  Finally,  edges 
are  matched  for  sign.  None  of  the  improvements 
Ikted  in  section  3  were  used. 

The  system  is  implemented  on  a  Macintosh  IIx  us¬ 
ing  a  home-brew  lisp  compiler.  The  current  imple¬ 
mentation  is  I/O  bound,  the  time  to  process  a  frame 
being  only  142msec,  while  the  time  to  grab  the  two 
frames  and  display  debugging  information  k  at  least 
200ms,  depending  on  the  amount  of  debugging  infor¬ 
mation.  A  port  to  a  high-performance  digital  signal 
processor  k  underway.  The  source  code  used  in  the 
tests  is  available  from  the  author.  Sample  results 
for  scenes  with  dktant  and  far  objects  are  shown  in 
figure  2. 

Test  results  for  six  objects  (three  people,  a  chair, 
a  fire  extinguisher,  and  a  toy  cow)  are  given  in  figure 
1.  The  figure  gives  depth-tuning  curves,  the  graphs 
of  response  vs.  depth,  for  each  object.  The  tuning 
curve  is  a  useful  measure  of  performance.  The  data 
show  that  the  detector  k  tuned  to  a  depth  interval 
of  approximately  two  to  four  feet^.  Another  useful 
performance  measure  is  peak  signal-to-noise  ratio, 
ako  given  in  the  figure.  The  worst  S/N  ratio  was 
3dB,  meaning  that  the  peak  response  to  the  object 
was  still  twice  the  background  noise  level.  This  is  a 
fairly  comfortable  margin. 

The  system  had  the  easiest  time  detecting  people 
since  people  are  relatively  large  and  well  textured. 
The  fire  extinguisher  and  toy  cow  were  more  difficult 
because  they  were  smaller  and  poorly  textured.  The 
toy  cow  was  an  extreme  case,  being  only  7  inches  tall 
and  occupying  only  60-70  pixels  when  at  the  center 
of  the  sensitive  region.  The  limited  dynamic  range 
of  the  cameras  was  a  problem.  Intensity  values  of 
lightly  colored  surfaces  were  generally  saturated,  sig¬ 
nificantly  perturbing  the  response  of  the  edge  detec¬ 
tor  near  saturated  regions  while  entirely  removing 

*The  author  apologizes  for  performing  the  experi¬ 
ments  in  feet,  rather  than  meters  as  is  done  in  real  sci¬ 
ence,  but  he  couldn’t  find  a  metric  tape  measure. 
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Figure  1;  Depth-tuning  curves  and  peak  S/N  ratios 
for  various  objects.  S/N  ratios  for  objects  are  mea¬ 
sured  as  the  ratio  of  peak  response  to  that  object 
over  the  noise  level  (12  pixels).  Images  are  64  x  48 
pixels,  cameras  have  110  degree  fields  of  view  with 
a  baseline  of  65mm.  Objects  IDH,  JLS,  and  TK 
are  people.  Each  object  was  tested  in  nine  positions 
of  varying  depth  for  20  trials  each.  The  largest  vari¬ 
ance  was  6.9  pixels.  Readings  past  7  feet  are  entirely 
matching  noise.  The  noise  level  was  measured  as  the 
largest  such  noise  value  observed  over  all  trials  for  all 
data  points.  The  background  was  a  cluttered  office. 


edges  within  those  regions.  Moreover,  this  satura¬ 
tion  varied  between  the  two  cameras  so  that  dispar¬ 
ities  near  saturated  areas  were  largely  meaningless. 
The  curved,  specular  surface  of  the  fire  extinguisher 
was  abo  difficult  for  the  detector. 

These  experiments  were  limited  by  the  inability  to 
mount  the  stereo  pair  on  a  mobile  platform.  I  hope 
to  use  the  system  for  navigation  on  a  mobile  robot 
later  this  fall. 

5  Conclusion 

The  system  presented  here  is  simple,  robust,  fast, 
easily  patallelisable,  and  can  be  implemented  in 
combination  with  more  complicated  stereo  systems 
for  little  or  no  computational  cost.  This  is  not  to 
imply  that  it  is  suitable  for  all  tasks.  Quite  the  con¬ 
trary.  The  system  is  an  example  of  how  knowledge 
of  a  task  can  simplify  the  information  required  from 
a  vision  module  and  thus  dramatically  simplify  the 
computational  machinery  needed  to  implement  it. 

This  raises  several  questions.  First,  how  broad 
a  range  of  tasks  can  be  supported  by  such  simple 
systems?  Second,  is  there  a  small  set  of  simple  sys¬ 
tems  which  together  can  perform  a  broad  range  of 
tasks?  Finally,  how  do  we  learn  from  and  general¬ 
ize  such  simple,  domain-specific  systems?  The  first 
two  are  empirical,  the  latter  more  philosophical.  A 
common  complaint  about  domain-specific  process¬ 
ing  is  that  it  is  “just  a  bunch  of  hacks”  which  don’t 


teach  us  anything  about  the  real  problems.  I  believe 
that  these  systems  can  be  analyzed  in  ways  which  al¬ 
low  us  both  to  make  predictions  about  performance 
and  to  apply  what  we’ve  learned  to  other  problems. 
With  the  analysis  in  this  paper,  I  hope  convince  the 
reader  that  we  can  learn  more  from  these  systems 
than  just  the  fact  that  they  work. 
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Figure  2:  Performance  of  the  system  in  an  unmodified  office  environment.  “Grey-scale”  is  the  original 
sampled  image  from  the  left  camera,  “left”  and  “right”  are  the  respective  images  overlaid  with  derived 
edges,  “matches”  is  the  set  of  edges  matched  between  images  at  disparity  2,  and  “stereo”  is  the  left  grey 
scale  image  overlaid  with  matched  edges. 
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Abstract 

In  this  paper,  we  extend  our  stealth  terrain 
navigation  approach  to  plan  paths  such  that 
two  groups  of  vehicles  move  in  a  bounding 
overwatch  manner.  Furthermore,  the  planned 
paths  for  the  vehicles  themselv^  are  sub¬ 
ject  to  intervisibility  constraints,  configuration 
constraints,  and  different  terrain  traversabili- 
ties  due  to  the  variations  in  terrain  type  and 
slope.  A  spatial-temporal  sampling  approach 
is  adopted  to  discretize  the  solution  space  and 
facilitate  fast  computation  on  a  massively  par¬ 
allel  machine.  One  of  the  key  computations  in 
the  planning  is  region-to-region  visibility  anal¬ 
ysis,  for  which  a  fast  parallel  algorithm  is  de¬ 
scribed.  The  algorithms  are  implemented  on 
a  Connection  Machine  CM-2,  and  the  experi¬ 
mental  results  show  that  the  planning  system 
effectively  generates  good  paths. 

1  Introduction 

This  paper  describes  a  path  planning  method  for  stealth 
terrain  navigation  with  bounding  overwatch  using  a  mas¬ 
sively  parallel  machine.  This  is  an  extension  of  the 
stealth  terrain  navigation  approach  used  in  [7].  We  con¬ 
sider  the  problem  of  planning  paths  for  two  groups  of  ve¬ 
hicles,  each  of  which  consists  of  two  vehicles,  from  their 
common  initial  location  to  their  common  final  goal.  The 
terrain  through  which  they  must  move  is  “hostile”  in 
the  sense  that  there  are  adversaries  moving  through  the 
terrain.  The  vehicles  should  remain  hidden  from  the  ad¬ 
versaries  to  the  greatest  extent  possible.  Initial  informa¬ 
tion  is  available  concerning  the  locations  and  movements 
of  these  adversaries,  but  it  is  expected  that  these  mod¬ 
els  will  degrade  over  time,  so  that  the  plan  developed 
must  support  reconnaissance  activities.  This  require¬ 
ment  should  be  fulfilled  in  a  bounding  overwatch  manner 
such  that  one  of  the  two  groups  serves  as  observer  while 
the  other  group  moves  along  a  safe  path  to  a  new  ob¬ 
servation  point  using  the  information  collected  from  the 
observer;  the  two  groups  then  switch  roles.  Furthermore, 

*The  support  of  the  Defense  Advanced  Research  Projects 
Agency  (ARPA  Order  No.  6350)  and  the  U.S.  Army  Engi¬ 
neer  Topographic  Laboratories  under  Contract  DACA76-88- 
C-0008  is  gratefully  acknowledged. 


the  planned  paths  for  the  vehicles  themselves  are  subject 
to  intervisibility  constraints  for  line  of  sight  communi¬ 
cation,  and  configuration  constraints  such  that  the  two 
vehicles  in  a  group  move  in  parallel,  and  the  progress  of 
the  vehicles  through  the  terrain  is  differentially  impeded 
by  terrain  type  and  slope.  Generally,  these  problems  are 
instances  of  path  planning  in  two  dimensional  space  with 
time  varying  constraints.  Such  problems  are  known  to  be 
computationally  hard  [2,  5].  This  will  lead  us  to  the  de¬ 
velopment  of  heuristic  and  approximate  algorithms  that 
can  avoid  a  direct  assault  on  these  combinatorial  prob¬ 
lems,  while  at  the  same  time  developing  demonstrably 
good  solutions  to  such  problems. 

Our  basic  approach  is  to  represent  the  problem  using 
discretizations  of  space  and  time,  and  to  develop  mas¬ 
sively  parallel  algorithms  for  the  fundamental  underly¬ 
ing  computations  (e.g,  visibility  analysis,  reachability  on 
terrain).  The  discretization  allows  many  basic  computa¬ 
tions  to  be  arranged  in  a  regular  pattern,  and  therefore 
solved  in  parallel  efficiently. 

The  remeiinder  of  this  paper  is  organized  as  follows:  In 
Section  2  we  describe  our  planning  scheme  and  its  ap¬ 
plication  to  the  above  problem.  Section  3  describes  the 
visibility  analysis  ^dgorithIns  which  are  essential  to  the 
choice  of  path  points  for  maximal  safety  and  observation 
points  for  the  reconnaissance  activities.  The  algorithms 
are  implemented  on  a  Connection  Machine  CM-2  [4]  and 
experimental  results  are  given  in  Section  4. 

2  The  path  planning  scheme 

In  our  path  planning  scheme,  the  path  planning  process 
is  divided  into  stages,  and  the  two  groups  take  turns  serv¬ 
ing  as  an  observer  while  a  path  is  planned  for  the  other 
group  at  each  stage.  The  planning  process  in  one  stage 
is  referred  to  as  a  aubplanning  process,  and  the  length 
of  a  stage,  referred  to  as  the  subplan  interval,  is  usu¬ 
ally  determined  before  the  subplanning  process.  Since 
the  entire  planning  process  is  done  by  repeating  the  sub¬ 
planning  process,  the  following  discussion  focuses  on  the 
subplanning  process.  Since  a  spatial  sampling  approach 
is  adopted,  the  terrain  is  represented  by  a  regular  grid 
with  elevation  data  at  each  grid  cell. 

We  first  describe  the  subplanning  process  for  a  single 
agent,  and  then  extend  the  approach  to  plan  for  a  group 
of  agents.  The  criteria  for  choosing  the  subgoal  in  the 
bounding  overwatch  problem  are  described  at  the  end  of 
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this  section. 


2.1  Subplanning  for  a  Single  Agent 

In  the  subplan  for  a  single  agent,  a  subgoal  is  chosen  from 
the  area  that  is  reachable  within  the  subplan  interval  by 
a  good  path.  In  the  case  of  a  single  agent,  a  “good” 
path  usually  refers  to  a  path  that  remains  hidden  from 
the  adversaries  as  much  as  possible  by  taking  advantage 
of  the  terrain.  This  property  is  referred  to  as  safety. 

Since  it  is  very  difficult  to  represent  the  visibility  map 
of  a  moving  object  analytically,  a  temporal-sampling  ap¬ 
proach  is  adopted,  in  which  a  sampling  period  is  deter¬ 
mined  in  advance,  a  visibility  map  is  computed  for  each 
sampling  period,  and  the  safety  is  defined  by  these  sam¬ 
ples.  The  visibility  map  consists  of  a  binary  value  on 
each  grid  cell  and  represents  the  regions  visible  from  the 
predicted  adversary  locations.  It  is  computed  by  the 
point-to-region  visibility  analysis  algorithm  [1 ,  7] ,  which 
will  also  be  described  in  Section  3.2. 

These  visibility  maps  are  combined  with  terrain 
traversability  using  a  dynamic  programming  paradigm 
to  compute  the  reachable  region  and  evaluate  the  safety 
of  the  best  path  to  each  grid  cell  in  the  region.  This 
computation  is  performed  incrementally  for  each  sam¬ 
pling  period  throughout  the  subplan  interval.  The  data 
structure  can  be  described  as  a  path  credit  table 

Paik.credH(P,  <) 

where  P  is  the  cell  index  and  t  is  the  time  index.  Each 
element  counts  the  number  of  safe  points  along  the  safest 
path  from  the  initial  location  to  the  cell  P  at  time  t. 
Initially, 


Path-creditfP,  0) 


if  cell  P  is  the  initial  location 
otherwise 


The  table  is  computed  slice-wise  for  each  sampling  pe¬ 
riod  from  <  =  0  to  the  end  of  the  subplan  interval,  and 
the  computation  for  each  element  depends  on  the  terrain 
traversability. 

In  our  method,  terrain  traversability  is  modeled  by 
discretizing  the  directions  in  which  the  agent  can  move 
through  a  grid  cell  and  associating  a  traversal  cost  for 
each  of  them  as  the  amount  of  time  needed  to  pass 
through  the  cell  in  that  given  direction.  The  cost  can 
be  determined  by  the  slope,  the  type  of  terrain,  and 
other  factors  that  may  affect  the  mobility.  Using  this 
modeling  of  terrain  traversability,  the  Paih.credit  table 
is  computed  according  to  the  following  recurrence  rela¬ 
tion: 

Paih.credii{P,  i)  = 

Max{  Paih.credU{P,t  —  l)-t-  Safety{P,t), 

Paih.credii{Q,  i  -  Tcosi(^Q,  P)) 

-hSafety.count(Q,  t  —  Tcosl(Q,P)  -l- 1,  f)  } 


where 


Q  is  any  neighbor  of  P 
Tcost(Q,  P)  =  the  traversal  cost  through  Q 
in  the  direction  to  P 


Safety(P,t)  =  |  q 


if  cell  P  is  safe  at  time  t 
otherwise 


Safety. count(P,ti,t2)  =  Safety{P,t) 


Safety  Pathoedit 


Figure  1:  Construction  of  the  path  credit  table  with  con¬ 
sideration  of  terrain  traversability.  For  each  neighbor  of 
P,  the  propagated  value  is  the  sum  of  all  values  shown 
in  the  two-column  table  associated  with  each  cell. 


See  Figure  1  for  an  illustration  of  this  formula, 
in  which  four  directions  are  considered  for  traversing 
through  a  cell.  One  way  to  interpret  this  formula  is 
that  each  grid  cell  propagates  delayed  path  credits  to  its 
neighbors.  Each  path  credit  is  delayed  by  an  amount  of 
time  equal  to  the  traversal  cost  before  it  is  propagated  to 
the  corresponding  neighbor  and  it  is  updated  with  safety 
information  during  the  delayed  period.  After  receiving 
the  path  credits  from  its  neighbors,  each  cell  decides  its 
new  path  credit  by  choosing  the  maximum  among  the 
received  values  and  its  own  previous  path  credit  which 
is  also  updated  by  adding  the  current  safety  to  it. 

Clearly,  Patk.credit{P,t)  >  0  if  and  only  if  cell  P  can 
be  reached  from  the  initial  location  at  time  t.  Thus,  the 
subgoal  is  chosen  only  from  cells  with  path  credit  greater 
than  zero  at  the  end  of  the  interval.  After  the  subgoal  is 
chosen,  the  path  can  be  extracted  from  the  path  credit 
table  by  a  gradient-following  method,  or  more  efficiently, 
by  storing  a  pointer  for  each  element  to  the  previous  cell 
along  the  best  path  during  the  construction  of  the  table 
and  tracing  these  pointers  back  to  the  initial  location. 

The  following  algorithm  summarizes  the  subplanning 
process  for  a  single  agent: 
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Algorithm:  subplanning  process  for  a  single  agent 
begin 

for  every  terrain  grid  cell  P 
begin 

if  P  is  the  initial  location  of  the  agent  then 
Path.credit(P,  0)  =  1; 

else 

Path-credit(P,  0)  =  0; 
end  if 

for  t  =  0  to  endjofjsuhplanJinteTval  do 
begin 

update  Path-credit(P,  t)  by  the  recurrence 
formula; 

Prev(P,  t)  =  the  pointer  to  the  cell  that 

contributes  to  the  value  of  Path-credit(P,  <); 
end  for 
end  for  every 
choose  a  subgoal; 

Path{endja f ^uhplanjtntervaf)  =  the  subgoal 
position; 

for  t  =  enduof  ^ubplan-interval  to  1  do 
Path(t—  1)  =  Prev(Path{t),ty, 
end  Algorithm 

2.2  Subplanning  for  a  Group  of  Agents 

Now  we  extend  the  method  to  plan  a  path  for  a  group 
of  agents  so  that  they  can  maintain  a  given  configura¬ 
tion  and  optimize  their  safety  and  mutual  visibility  dur¬ 
ing  their  movement.  In  our  problem,  the  agents  in  a 
group  are  requited  to  maintain  their  configuration  as 
a  line  segment  perpendicular  to  their  direction  of  mo¬ 
tion.  A  line  segment  can  be  represented  by  the  location 
of  its  center  point  and  its  orientation.  With  the  addi¬ 
tional  time-axis,  the  dimensionality  of  the  solution  space 
is  four.  Our  framework  can  be  applied  directly  in  this 
four-dimensional  search  space,  but  it  is  more  efficient  to 
make  a  further  reduction  of  the  dimensionality  by  apply¬ 
ing  a  decomposition. 

First  we  represent  the  locations  of  the  agents  by  the 
center  of  the  segment,  and  find  a  good  path  for  the  cen¬ 
ter,  where  “good”  means  a  high  likelihood  that  a  seg¬ 
ment  centered  at  this  cell  will  be  safe  and  intervisible, 
no  matter  what  orientation  the  segment  is  in.  This  path 
is  actually  a  corridor  for  the  segment,  and  the  agents  are 
allocated  within  the  corridor  using  some  simple  heuris¬ 
tics.  Since  the  corridor  has  a  statistically  good  evalua¬ 
tion  for  all  the  constraints,  we  should  be  able  to  find  an 
acceptable  allocation  easily.  Therefore,  we  can  apply  the 
subplanning  method  for  a  single  agent  to  the  subplan¬ 
ning  for  the  group  center  by  propagating  a  path  credit 
defined  on  both  safety  and  intervisibility. 

For  a  cell  to  be  a  potential  group  center,  safety  is  mod¬ 
eled  as  the  number  of  safe  cells  within  a  circle  centered 
at  the  cell  with  a  diameter  equal  to  the  segment  length. 
This  information  heis  to  be  computed  for  each  tempo¬ 
ral  sample.  For  efficiency  on  regular  grids,  the  circle  is 
approximated  by  the  circumscribing  square. 

Intervisibility  is  measured  by  sampling  several  line  seg¬ 
ments  centered  at  a  grid  cell,  and  counting  the  number 
of  pairs  of  cells  that  can  see  each  other  along  these  line 
segments.  It  needs  to  be  computed  only  once  due  to 
its  invariance  with  time.  Currently  we  sample  the  hor¬ 
izontal,  vertical,  and  the  two  diagonal  segments  to  take 


advantage  of  the  local  communication  links  of  the  reg¬ 
ular  grids  so  that  the  computation  can  be  done  on  the 
terrain  efficiently.  The  intervisibility  on  each  line  seg¬ 
ment  is  computed  by  pipelining  the  sequential  version  of 
the  line-visibility  algorithm  [7,  1]. 

A  weighted  sum  of  safety  and  intervisibility  determines 
the  point  credit  of  a  cell  to  be  the  group  center.  A  thresh¬ 
old  on  point  credit  determines  the  goodness  of  a  cell  as 
a  path  point,  and  the  path  credit  is  defined  as  the  num¬ 
ber  of  good  path  points  along  the  best  path.  The  path 
credit  table  is  constructed  the  same  way  as  in  the  case 
for  a  single  agent.  After  the  path  is  found,  the  orien¬ 
tations  of  the  segments  are  determined  by  first  setting 
them  to  the  ones  perpendicular  to  the  direction  of  move¬ 
ment,  and  then  smoothing  them  to  avoid  large  changes 
between  consecutive  samples. 

2.3  Choosing  the  Subgoal 

The  choice  of  a  subgoal  depends  upon  the  specific  mis¬ 
sion  in  each  problem  instance.  In  the  bounding  over¬ 
watch  case,  the  following  criteria  are  considered: 

1.  Reachability:  it  must  be  reachable  at  the  end  of  the 
subplan  interval; 

2.  Configuration:  its  location  should  be  ahead  of  the 
current  observer  by  an  adequate  distance  and  within 
a  corridor  predetermined  for  the  entire  movement  so 
that  the  bounding  overwatch  pattern  can  be  main¬ 
tained; 

3.  Path  quality:  it  should  be  reachable  by  a  path  that 
is  good  in  terms  of  safety  and  inter  visibility; 

4.  Future  safety:  it  should  be  safe  for  the  next  subplan 
interval; 

5.  Observability:  it  should  have  good  observation 
points  in  its  vicinity  for  monitoring  the  movement 
of  the  adversaries  in  the  next  stage. 

The  reachability  and  configuration  criteria  are  satis¬ 
fied  by  considering  only  cells  with  path  credit  greater 
than  zero  and  ahead  of  the  observers  by  about  half  of 
the  maximum  distance  the  agents  can  travel  in  a  subplan 
interval.  Path  quality  is  evaluated  by  the  path  credit  ta¬ 
ble.  The  future  safety  and  observability  criteria  require 
computation  of  the  visibility  from  these  candidate  cells 
to  the  predicted  trajectories  of  the  adversaries  in  the 
next  stage.  As  the  model  of  the  adversary  movements 
degrades  with  time,  the  possible  trajectories  of  the  ad¬ 
versaries  usually  span  a  fairly  large  region.  Since  the 
visibility  computation  is  expensive  and  these  candidate 
cells  usually  cluster  in  regions,  a  region-to-region  vis¬ 
ibility  analysis  algorithm  was  developed  [6]  to  achieve 
much  faster  computation  than  applying  the  point-to- 
region  visibility  analysis  to  each  candidate  cell.  This 
algorithm  is  briefly  described  in  the  next  section.  After 
these  computations,  the  subgoal  is  chosen  from  the  can¬ 
didate  cells  by  combining  all  the  criteria  using  a  weighted 
sum  and/or  thresholds. 

3  Algorithms  for  Visibility  Analysis 

In  this  section,  we  describe  the  algorithms  we  used  for 
the  three  visibility  analyses  in  our  planning  system:  the 
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line-visibility  analysis,  the  point-to-region  visibility  anal¬ 
ysis  (PRVA),  and  the  region- to-region  visibility  analy¬ 
sis  (RRVA).  All  algorithms  are  based  on  digital  terrain 
models  and  are  designed  for  massively  parallel  hypercube 
machines.  Hypercube  machines  provide  the  flexibility  to 
embed  a  mesh-connected  array  of  any  dimension  and  the 
efficiency  to  perform  parallel  prefix  operations  along  any 
axis  of  the  array  in  logarithmic  time.  These  advantages 
are  used  extensively  in  these  algorithms;  these  kinds  of 
parallel  prefix  operations  will  be  referred  to  as  scan  op¬ 
erations. 


1 


Visibility  tkng  the  line 


0 


Figure  2:  Computing  visibility  along  a  line.  The  X-axis 
represents  the  horizontal  distance  from  the  viewpoint. 
The  effect  of  the  middle  hill  is  shown  by  the  dashed  line 
in  the  elevation  plot. 


3.1  Visibility  along  a  line 

Visibility  between  two  points  is  defined  by  drawing  a 
line  between  the  two  points.  The  two  points  are  visible 
to  each  other  if  and  only  if  the  line  lies  completely  above 
the  terrain.  The  basic  parallel  algorithm  for  visibility 
analysis  is  to  compute  the  visibility  w.r.t.  a  given  view¬ 
ing  point  for  every  grid  cell  along  a  line  on  the  plane 
projection  of  the  terrain.  Suppose  we  have  a  list  of  pro¬ 
cessing  elements  (PEs)  representing  this  line.  Each  of 
them  contains  the  elevation  of  the  corresponding  grid 


cell  on  the  line  with  the  first  cell  as  the  viewing  point. 
The  visibility  of  each  grid  cell  from  the  viewing  point 
can  be  determined  by  first  computing  the  elevation  angle 
from  the  viewing  point  to  each  cell  and  then  comparing 
the  angle  with  the  maximal  angle  among  all  cells  closer 
to  the  viewing  point.  If  its  angle  is  greater  than  the 
maximal  angle  among  the  cells  before  it,  then  this  cell  is 
visible.  The  steps  are  illustrated  in  Figure  2.  The  total 
complexity  is  dominated  by  the  computation  of  the  max¬ 
imal  angle  before  each  cell,  which  can  be  done  by  a  scan 
operation  in  O(log  IV)  time  on  a  hypercube  machine  us¬ 
ing  W  processors,  where  W  is  the  number  of  grid  cells 
along  the  line. 

3.2  Point-to-region  visibility  analysis 

The  PRVA  algorithm  [1]  computes  the  visibility  from  a 
viewing  point  to  a  region.  Since  the  underlying  terrain  is 
represented  by  an  array  of  processors,  each  correspond¬ 
ing  to  a  grid  cell  of  the  terrain,  the  result  is  returned  as 
a  visibility  map  by  indicating  the  visibility  of  each  grid 
cell  with  a  flag  in  the  associated  PE.  For  computational 
efficiency,  the  region  is  assumed  to  be  an  upright  rect¬ 
angle.  For  a  region  of  arbitrary  shape,  we  may  obtain 
the  maximal  and  minimal  X  and  Y  coordinates  of  the 
region  and  use  the  circumscribing  upright  rectangle. 

We  use  the  term  ray  to  refer  to  a  line  segment  with  a 
direction  on  the  plane  projection  of  a  terrain  geographi¬ 
cally,  while  it  refers  to  a  list  of  PEs  computationally.  The 
processor  structure  representing  a  set  of  rays  is  called  a 
ray  structure.  We  define  a  far  side  as  a  side  of  the  rect¬ 
angle  such  that  when  drawing  a  line  from  the  viewing 
point  to  any  non-end  point  on  that  side,  the  line  will 
pass  through  the  interior  of  the  rectangle.  The  minimal 
set  of  lines  covering  all  grid  cells  in  a  rectangular  box 
will  be  the  lines  from  the  viewing  point  to  all  grid  points 
on  the  far  sides.  Therefore  the  PRVA  can  be  done  by 
constructing  a  ray  structure  for  each  far  side  of  the  rect¬ 
angle  and  running  the  line-visibility  algorithm  for  each 
ray  in  the  ray  structures. 

Let  L  be  the  length  of  a  side  of  the  rectangle,  and  W  be 
the  maximal  number  of  grid  cells  on  a  single  ray  to  this 
far  side.  An  LxW  2-D  array  of  processors  is  allocated  for 
the  ray  structure,  as  shown  in  Figure  3.  Each  row  in  the 
ray  structure  corresponds  to  a  ray.  After  broadcasting 
the  coordinates  of  ihe  viewing  point  and  the  end  points 
of  the  far  side,  each  processor  can  find  its  correspond¬ 
ing  grid  cell  by  the  digital  differential  analyzer  (DDA) 
technique  [3],  and  thus  obtain  the  elevation  data.  The 
line-visibility  algorithm  is  then  conducted  along  all  rays 
simultaneously.  The  result  will  be  sent  back  to  its  corre¬ 
sponding  grid  cell.  If  several  results  are  sent  back  to  the 
same  grid  cell,  they  are  combined  by  an  OR  operation. 

The  complexity  of  this  parallel  algorithm  is 

0{LW  log  W  LW  X  Comm)  operations 
with  minimal  time 


0(\ogW  -f  Comm) 

using  LxW  processors,  where  Comm  stands  for  the  com¬ 
plexity  of  a  global  communication. 
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Figure  3:  Ray  and  ray  structure,  (a)  The  rays  are  shown 
by  line  segments  between  the  viewing  point  V  and  side 
a  of  the  rectangle.  The  ray  structure  consists  of  all  rays 
for  this  side  and  is  mapped  to  a  triangle  on  the  terrain 
map.  Note  that  only  sides  a  and  6  are  far  sides,  (b)  A 
2D  array  is  allocated  as  the  ray  structure. 


As  we  can  see  from  Figure  3,  many  PEs  in  different 
rays  may  be  mapped  to  the  same  terrain  cell.  Thus  con¬ 
current  reads/writes  occur  in  the  global  communication 
between  the  two  processor  structures.  An  improvement 
can  be  made  to  eliminate  the  concurrent  reads/writes. 
It  can  be  shown  that  all  PEs  that  are  mapped  to  the 
same  terrain  cell  have  the  same  element  index  and  con¬ 
secutive  ray  indices.  Using  this  coherence  property,  the 
PEs  can  be  grouped  by  the  terrain  cell  they  are  mapped 
to,  and  only  one  from  each  group  will  participate  in  the 
global  communication.  In  each  group,  the  data  can  be 
distributed  to  or  combined  from  every  member  by  seg¬ 
mented  scan  operations.  As  this  operation  is  conducted 
along  the  dimension  of  size  L,  and  we  expect  L  to  be 
0{W),  it  will  not  increase  the  complexity. 

3.3  Region-to-region  visibility  analysis 

The  region-to-region  visibility  analysis  (RRVA)  problem 
is  defined  as  follows:  given  a  source  region  S  and  a  desti¬ 
nation  region  D  on  the  terrain,  compute  a  meeisure  of  vis¬ 


ibility  from  region  S  to  region  D.  In  terms  of  discretized 
geometry,  this  problem  can  be  restated  as  computing 
for  every  point  in  S  the  number  of  visible  points  in  D. 
In  the  RRVA  algorithm,  both  the  source  and  the  des¬ 
tination  regions  are  assumed  to  be  upright  rectangles, 
but  they  can  be  in  any  relationship,  e.g.,  they  can  be 
overlapping  or  even  identical.  From  the  previous  sub¬ 
section,  there  evidently  exists  a  brute-force  solution  by 
applying  the  PRVA  to  each  grid  cell  in  the  source  re¬ 
gion.  The  complexity  of  this  brute- force  algorithm  is 
O^L’gLoWlogW  -{-L'sLdW  xComm)  operations,  using 
up  to  LigLoW  processors,  where  Ls  and  Ld  are  the  lin¬ 
ear  size  of  the  source  and  the  destination  respectively. 
Judging  from  the  usual  problem  size,  no  existing  ma¬ 
chines  can  provide  this  number  of  processors.  Even  if 
such  a  machine  existed,  the  number  of  concurrent  reads 
would  make  this  algorithm  virtually  impractical.  There¬ 
fore,  we  have  to  stage  the  analysis  in  several  iterations, 
each  of  which  computes  a  subproblem  of  the  entire  anal¬ 
ysis.  Since  the  communication  patterns  among  iterations 
will  be  redundant,  we  identify  some  important  coherence 
properties  and  explain  how  they  can  be  used  to  improve 
the  efficiency  of  communication. 


One  important  coherence  property  is  depicted  in  Fig¬ 
ure  4,  which  shows  that  the  ray  from  the  source  point 
to  the  destination  point  {xd,yd)  is  enclosed  by 
the  triangle  formed  by  the  rays  from  (at,  -  l,y,)  and 
(®,  —  l,y,  -1-  1)  to  the  same  destination  point. ^  The 
two  rays  are  referred  to  as  the  parent  and  the  guardian 
respectively,  as  shown  in  the  figure.  Furthermore,  the 
vertical  width  of  the  triangle  formed  by  the  parent  and 
the  guardian  is  always  less  than  1  from  at,  to  atj.  If 
the  visibility  analysis  is  conducted  for  a  column  of  the 
source  at  a  time,  and  sweeps  from  left  to  right  using  the 
same  ray  structure,  then  the  PEs  in  this  ray  should  be 
able  to  obtain  the  elevation  data  of  the  corresponding 
terrain  cells  from  one  of  the  two  rays  for  the  previous 
column  using  only  local  communications  within  the  ray 
structure. 

This  observation  suggests  the  idea  of  a  sweeping  al- 

*  For  a  better  mapping  between  the  geometry  in  the  figures 
and  the  ordering  in  the  data  structure,  a  screen  coordinate 
system  is  adopted  in  all  figures,  i.e.,  the  origin  is  in  the  upper 
left  corner  and  the  K-coordinates  increase  downward. 
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Figure  5;  Extended  source;  the  thick  dashed  line  shows 
the  extended  initial  strip,  while  the  thin  dashed  lines 
show  how  the  strip  shrinks  during  the  sweep. 

gorithm,  which  sweeps  across  the  source  horizontally 
and/or  vertically.  Since  the  above  property  holds  only 
when  the  emitting  angle,  0  in  the  figure,  is  no  more  than 
45  degrees,  a  partition  is  defined  to  divide  the  rays  into 
four  sweeps,  namely  the  East,  West,  North,  and  South 
sweeps,  so  that  the  emitting  angle  is  always  no  more  than 
45  degrees.  In  each  sweep,  the  analysis  is  conducted  strip 
by  strip,  which  is  a  row  or  a  colunm  of  cells  that  is  per¬ 
pendicular  to  the  sweeping  direction.  The  elevation  data 
are  obtained  by  global  communications  only  for  the  first 
strip  of  each  sweep  and  are  paissed  within  the  ray  struc¬ 
ture  for  the  subsequent  strips.  It  can  be  proved  that 
a  ray,  its  parent,  and  its  guardian  belong  to  the  same 
sweep  if  they  exist.  In  order  to  ensure  the  existence  of 
the  guardian  for  all  rays,  it  may  be  necessary  to  extend 
the  source  region  by  the  side  length  along  the  sweeping 
direction  minus  one,  and  to  start  the  sweep  with  the  ex¬ 
tended  initial  strip.  Figure  5  shows  one  such  situation. 
Using  these  properties,  the  algorithm  is  outlined  as  fol¬ 
lows: 

Algorithm:  RRVA 
begin 

for  all  grid  cells  in  the  source  region 
set  visible-count  to  0; 

for  each  sweep  if  it  is  necessary 
construct  the  3D  ray  structure; 
for  all  PEs  in  the  ray  structure 

compute  the  corresponding  grid  cell  on  the 
terrain  for  the  initial  strip; 
get  the  elevation  data  from  the  terrain  map 
for  the  initial  strip; 
for  each  strip 

compute  the  visibility  along  each  ray; 
combine  the  result  in  the  ray  structure; 
update  the  ray  structure  for  the  next 
strip; 

end  for  each  strip 

send  the  result  back  to  the  source  cells  on  the 
terrain; 

end  for  each  sweep 
end  Algorithm 


Figure  6:  The  ray  structure  for  RRVA:  (a)  the  relevant 
elements  on  the  terrain  map;  (b)  a  3D  array  of  processors 
is  allocated  as  the  ray  structure. 


Visible-count  is  the  memory  location  in  each  terrain 
cell  that  returns  the  number  of  visible  destination  cells. 
The  algorithm  consists  of  three  major  parts: 

1.  Construction  of  the  ray  structure 

The  ray  structure  contains  all  rays  from  each  source 
cell  on  the  (possibly  extended)  initial  strip  to  every 
grid  cell  along  the  far  sides  of  the  destination  that 
belongs  to  the  sweep.  Since  a  ray  actually  contains 
a  list  of  processors,  the  entire  ray  structure  is  a  3D 
array  of  processors,  which  is  indexed  by  three  in¬ 
dices,  as  shown  in  Figure  6:  ti  is  the  index  for  the 
grid  cells  along  the  strip  in  the  source  region,  t;  is 
the  index  for  the  grid  cells  along  the  far  sides  of  the 
destination  region,  and  w  is  the  index  of  grid  cells 
along  the  ray. 

When  the  value  of  one  index  is  fixed,  the  other  two 
indices  specify  a  2D  slice  of  the  structure.  For  ex¬ 
ample,  a  v-w  slice  refers  to  the  2D  array  for  a  fixed 
value  of  u.  It  corresponds  to  all  rays  starting  at  the 
same  source  point  and  is  actually  a  part  of  the  2D 
ray  structure  for  the  PRVA  of  that  source  point.  We 
also  use  the  terms  u-neighbors,  v-neighbors,  and  w- 
neighbors  to  specify  the  neighboring  elements  along 
each  axis. 
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Figure  8:  The  configuration  of  the  group  and  the  actual 
location  of  each  agent  along  the  path  of  the  second  stage. 

2.  Obtaining  the  elevation  data  for  the  initial  strip 
For  each  PE  in  the  ray  structure,  its  correspond¬ 
ing  grid  cell  on  the  terrain  can  be  computed  in  a 
way  similar  to  that  in  the  PRVA,  and  the  elevation 
data  can  then  be  obtained  by  global  communication. 
The  grouping  technique  in  our  PRVA  algorithm  is 
applied  to  alleviate  the  congestion  from  concurrent 
reads.  However,  it  cannot  be  eliminated  completely: 
in  the  worst  case,  the  number  of  concurrent  reads  is 
reduced  to  the  range  of  «. 

3.  The  strip-wise  iterations 

The  visibility  analysis  is  conducted  along  all  rays  in 
parallel  in  the  beginning  of  the  loop.  The  results  are 
combined  into  a  count  of  visible  destinations  cells  for 
each  source  cell  on  the  strip.  Special  care  should  be 
taken  in  combining  these  results  to  avoid  duplicated 
counting.  This  count  is  stored  in  the  first  active  PE 
of  the  first  active  ray  for  each  source  cell.  The  ray 
structure  is  then  updated  for  the  analysis  of  the  next 
strip.  The  steps  are  shown  below: 

Updating  the  ray  structure  for  the  next  strip 
begin 

de-activate  the  first  active  u-v  slice; 
for  all  active  PEs  in  the  ray  structure 

change  the  starting  point  of  each  ray  to  the 

next  grid  point  along  the  sweeping  direction; 
compute  the  new  corresponding  terrain 
cell; 

if  it  is  different  from  the  previous  one 
if  there  is  a  u-neighbor  previously 
corresponding  to  this  terrain  cell 
then 

get  the  elevation  data  from  this 
u-neighbor; 

else 


de-activate  the  entire  ray  the  PE 
is  in; 

end  if 
end  for  all 

de-activate  rays  whose  emitting  angles  are 
greater  than  45  degrees; 

end 

After  the  iterations,  the  counts  of  visible  destination 
cells  stored  in  the  ray  structure  are  sent  back  to  the 
source  cell  on  the  terrain  and  are  added  to  the  results 
from  other  sweeps.  It  is  an  exclusive  write  operation 
since  only  one  PE  in  the  ray  structure  was  designated  to 
keep  the  count  for  each  source  cell  and  only  these  PEs 
will  participate  in  this  communication. 

The  resulting  complexity  is 

0{LsLDW\ogW  -b  LsLpW  y.Commrcad 

-(-  L'g  y.  Comniwrite)  operations 

with  a  minimal 

0(Ls  log  IV  -f-  Comniuirite  +  Commread)  time, 

using  up  to  kLsLoW  processors,  where  ib  is  a  constant 
determined  by  the  number  of  far  sides  and  the  number 
of  extended  sources. 

The  bottleneck  of  the  computation  derives  from  global 
communication,  and  its  reduction  is  achieved  in  three 
ways:  the  number  of  occurrences  of  global  communi¬ 
cation  operations  is  reduced  to  only  two  (one  for  read 
and  one  for  write),  the  total  number  of  such  operations 
is  reduced  by  a  factor  of  Ls,  and  the  congestion  due 
to  concurrent  read/ write  operations  is  minimized.  The 
computation  part  is  the  same  as  the  PRVA  algorithm, 
since  the  same  number  of  rays  is  allocated  for  each  point 
and  the  computation  along  each  ray  remains  the  same. 
Experimental  results  showing  the  improvements  due  to 
the  reduction  in  communication  complexity  can  be  found 
in  [6]. 

4  Implementation  and  Experimental 
Results 

In  this  section,  we  present  the  results  of  our  implementa¬ 
tion  of  the  algorithms.  All  experiments  were  conducted 
on  a  Connection  Machine  CM-2  using  8K  processors. 
The  terrain  size  is  512x512  grid  cells,  and  the  terrain 
cells  are  mapped  to  a  two-dimensional  array  of  virtual 
processors. 

Figure  7  illustrates  an  example  of  the  planning  algo¬ 
rithm,  which  consists  of  eight  subplans.  Each  subplaii  is 
illustrated  by  marks  on  the  terrain  map  indicating  the 
centers  of  the  groups,  and  the  grey  level  in  the  back¬ 
ground  shows  the  elevation  at  each  pixel:  brighter  pixels 
are  higher.  The  observer  is  indicated  by  an  “X”,  and  the 
initial  location  of  the  group  being  planned  is  marked  by 
a  square  with  a  dot  in  its  center.  The  line  connecting 
the  square  to  a  small  white  dot  is  the  path  planned  for 
the  center  of  the  group,  and  the  small  white  dot  at  the 
end  of  the  path  indicates  the  subgoal  location.  The  large 
solid  white  square  near  the  bottom  of  each  figure  shows 
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Figure  9:  The  source  and  destination  regions  in  the  Figure  10:  The  visibility  from  the  vicinity  of  the  subgonl 

region-to-region  visibility  analysis  of  the  third  stage.  to  the  destination  regions  in  the  third  stage. 
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the  location  of  the  final  goal.  The  adversuy  locations  in 
the  beginning  of  each  subplan  are  indicated  by  crosses, 
and  the  emanating  lines  illustrate  their  predicted  trajec¬ 
tories.  In  the  figure  for  the  first  stage,  both  groups  are 
located  at  the  same  initial  position. 

After  finding  the  path  for  the  center,  the  locations  of 
the  two  agents  are  determined  by  the  direction  of  motion 
and  the  spacing  between  agents.  Figure  8  enlarges  the 
region  around  the  path  of  the  second  stage  and  shows  the 
configuration  of  the  group  as  line  segments.  The  agents 
are  located  at  the  end  points  of  each  line  segment.  The 
safety  and  intervisibility  of  each  agent  are  indicated  by 
the  brightness  and  the  shape  of  the  marks  at  the  agent 
locations.  A  white  dot  means  a  path  point  at  which 
the  agent  is  safe  and  visible  to  its  partner.  A  black  dot 
means  a  path  point  at  which  the  agent  is  neither  safe 
nor  visible  to  its  partner.  A  white  i.  indicates  an  agent 
location  that  is  safe  but  not  visible  to  its  partner,  while 
a  black  ±  indicates  an  agent  location  that  is  not  safe  but 
visible  to  its  partner. 

The  key  computation  in  the  selection  of  the  subgoal 
is  the  region-to-region  visibility  analysis.  Figure  9  shows 
the  regions  considered  in  the  region-to-region  visibility 
analysis  in  the  third  stage.  The  source  region  is  the 
small  set  of  white  points  to  the  right  of  the  observer  (the 
“X”  mark),  which  consists  of  the  candidate  cells  for  the 
subgoal.  The  destination  region  is  the  region  spanned 
by  the  possible  adversary  trajectories  of  the  next  stage: 
a  circumscribing  upright  rectangle  is  used.  The  subgoal 
is  chosen  using  the  visibility  information.  The  visibility 
from  the  vicinity  of  the  subgoal  to  the  destination  region 
is  shown  in  Figure  10  by  the  white  area.  The  experi¬ 
mental  result  shows  that  the  planning  system  effectively 
returns  a  fairly  good  path. 
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Abstract 

A  robot  that  functions  in  real  world  situations  must  be 
equipped  to  handle  randomness  in  its  environment.  The 
purpose  of  this  research  is  to  employ  multiple  control 
laws  to  enable  an  IBM  7565  robot  to  move  through  a 
constrained  environment  and  recognize  the  objects  it  en¬ 
counters.  The  robot  is  equipped  with  a  force/torque 
sensor  on  its  end  effector.  No  prior  information  is  used 
by  the  robot.  While  exploring  the  workspace,  when  the 
robot  encounters  an  object,  dual  —  drive  force/velocity 
control  and  force  feedback  enable  it  to  track  each  object’s 
surface  and  collect  position  information  that  is  used  for 
recognition.  Experimental  results  are  presented. 

1  Overview 

Tactile  sensing  and  force  control  have  been  employed  to 
enable  a  robotic  system  to  gather  data  about  its  envi¬ 
ronment  in  order  to  perform  tasks  and/or  to  identify 
objects.  The  research  of  Khatib  [l]  is  concerned  with 
real-time  motion  and  force  control  of  robot  systems  to 
accomplish  finger  sensing  and  assembly  operations.  The 
research  of  Allen  [2]  employs  tactile  sensing  along  with 
computer  vision  to  recover  the  shape  of  an  object  by  us¬ 
ing  sparse  contact  data  points.  Tactile  sensing  in  [2]  is 
directed  by  prior  knowledge  of  the  position  of  an  object. 

This  paper  will  present  a  three  dimensional  object 
tracking  algorithm  that  has  been  designed  with  the 
intent  to  be  as  object  independent  as  possible.  Ex¬ 
periments  are  conducted  by  placing  different  three  di¬ 
mensional  objects  in  the  workspace.  The  robot  moves 
on  a  search  path  until  an  object  is  encountered.  A 
force/velocity  dual  —  drive  controller  is  then  imple¬ 
mented  which  moves  the  robot  along  the  surface  of  the 
object  without  prior  knowledge  of  its  shape  or  config¬ 
uration  within  the  workspace.  Planar  slices  of  tracked 
points  are  collected  in  order  to  recognize  an  object. 

The  dual-drive  control  algorithm  is  implemented  on  an 
IBM  7565  robot  equipped  with  a  six  axis  strain  gauge 
sensor  on  its  end  effector.  Force  sensor  data  has  an  ad¬ 
vantage  over  machine  vision  in  certain  environments;  i.e. 
when  the  lighting  is  poor  or  if  the  work  area  is  cluttered. 


’This  work  was  partially  funded  by  the  NSF  in  conjunc¬ 
tion  with  the  Advanced  Research  Projects  Agency  of  the  De¬ 
partment  of  Defence  under  Contract  No.  IRI-8905436 


These  situations  make  it  difficult  to  extract  reliable  in¬ 
formation  from  vision  algorithms. 

The  tracking  algorithm  is  tested  using  simple  and  com¬ 
plex  real-world  objects.  The  tracking  method  facilitates 
the  collection  of  dense  data  sets  that  can  be  used  for 
object  recognition. 

2  Prior  Results 

SIERA  (System  for  Implementing  and  E  aluating 
Robotic  Algorithms)  provides  the  research  environment 
for  the  IBM  7565  [3J.  The  system  allows  the  user  to  in¬ 
teractively  modify  any  robot  related  function  and  change 
the  control  law  to  suit  the  current  task.  In  SIERA,  a 
separate  computer  system  is  used  for  each  task  to  best 
match  the  hardware  to  the  task  requirement.  Real-time 
processing  is  done  on  the  bus-based  RTSS  (Real  Time 
Servo  System).  All  controller  modifications  run  on  the 
RTSS.  The  fast  processing  time  enables  the  robot  to  re¬ 
spond  quickly  to  force  feedback  information.  When  more 
computational  power  is  required,  such  as  for  graphic 
displays  and/or  object  classification  programs,  the  link- 
based  Armstrong  Multiprocessor  is  used  [4]. 

Multiple  control  laws  are  employed  to  enable  the  robot 
to  perform  functions  in  its  workspace.  The  controllers 
are  supervised  by  an  observer  program.  The  observer  is 
in  charge  of  activating  real-time  controllers  according  to 
force  feedback  information. 

When  the  robot  is  tracking  an  object,  a  dual-drive 
f  orce  I  velocity  controller  is  active  [5].  Force  feedback  is 
used  to  determine  the  surface  normal  and  velocity  feed¬ 
back  for  the  surface  tangent.  The  dual-drive  controller 
defines  the  signs  of  the  vectors  relative  to  a  point  in¬ 
side  the  tracking  path.  In  Figure  1,  motion  is  defined  to 
bo  positive  when  the  vector  V  is  counterclockwise  with 
respect  to  the  inner  point  P.  The  force  vector  F  is  posi¬ 
tive  when  it  is  in  the  same  direction  as  R.  The  relation 
between  the  velocity  and  the  force  is  orthogonal  if  the 
tracking  surface  is  sufficiently  stiff  and  the  friction  be¬ 
tween  the  surface  of  the  object  and  the  end  effector  is 
negligible.  If  the  surface  is  frictionless,  the  normal  will 
be  in  the  same  direction  as  the  force  vector  F.  Using 
force  sensor  readings,  the  outward  unit  normal  and  unit 
tangent  vectors  are  given  by, 

N  =  i4(F..F,) 
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Figure  1:  Surface  TYacking  with  the  Dual-Drive  Con¬ 
troller 

and 

T  = 

In  the  case  of  a  stiff  surface,  the  velocity  due  to  force 
corrections  is  small  and  V  is  in  the  tangential  direction. 
The  outward  unit  normal  and  unit  tangent  vectors  are 
given  by, 

N  =  |^(V«,-V,) 

and 

T  =  |^(V„V,), 

where  |F|  =  -|-  F2  and  |V|  =  These 

four  equations  for  the  unit  normal  and  unit  tangent  vec¬ 
tors  give  four  different  dual-drive  implementations  [5]. 
When  the  inner  point  P  cannot  be  specified,  the  sign  of 
the  velocity  and  force  vectors  are  computed  by  noting 
the  force  at  the  end  effector  and  the  actual  trajectory 
of  the  robot.  This  is  necessary  in  applications  where 
the  shape  of  the  object,  and  thus  the  complete  tracking 
path,  are  unknown.  [6]. 

When  the  robot  is  in  the  dual-drive  mode,  the  con¬ 
troller  is  complemented  by  a  force  —  check  controller 
[6].  The  force  threshold  check  sets  an  interrupt  when 
the  robot  looses  contact  while  tracking.  This  is  likely 
to  happen  when  the  robot  encounters  a  sudden  change 
in  surface  orientation,  i.e.  and  edge.  When  the  inter¬ 
rupt  is  set,  the  dual-drive  controller  is  switched  to  a  sur¬ 
face  searching  routine.  The  robot  moves  in  an  outward 
turning  square  spiral  until  contact  is  regained.  The  an¬ 
gle  between  the  surface  normal  and  the  end  effector  is 
checked  and  if  necessary,  the  end  effector  is  reoriented, 
at  which  time  surface  following  continues.  The  stored 
position  feedback  signals  that  are  collected  during  the 
tracking  can  be  used  by  a  classifier  program  to  identify 
the  shape  and  location  of  an  object. 

The  controller  Fcomply  is  active  when  the  robot  is 
moving  along  a  specified  “search”  trajectory,  and  it  is  not 
in  contact  with  a  surface  [5].  Fcomply  is  a  force  compli¬ 
ant  damping  controller.  When  an  object  is  encountered. 


Figure  2:  Planar  Tracking  Paths 

the  damping  controller  converts  a  desired  force  value, 
Fd,  to  a  velocity  vector,  V,  by  a  damping  constant,  B. 


The  Fcomply  controller  also  has  velocity  and  position 
saturation  to  keep  the  robot  at  a  safe  velocity  level 
within  the  workspace.  When  Fcomply  is  active,  the 
robot  moves  along  some  predescribed  path  along  a  tra¬ 
jectory  until  a  force  threshold  is  exceeded,  which  results 
when  the  end  effector  comes  in  contact  with  an  object. 

These  controllers  have  been  successfully  employed  for 
a  two  dimensional  object  recognition  demonstration  [6]. 
In  such  experiments,  the  IBM  7565  moves  on  a  prespec¬ 
ified  rectangular  path  through  the  workspace  which  en¬ 
ables  the  robot  to  sufficiently  explore  the  entire  environ¬ 
ment.  Objects  such  as  a  triangle,  a  quadrilateral  and  an 
ellipse  are  placed  in  the  workspace.  No  prior  informa¬ 
tion  about  the  shape  or  the  location  is  given.  Dual-drive 
control  enables  the  robot  to  track  and  to  recognize  the 
objects  it  encounters. 

3  Current  Research 

The  current  focus  of  this  research  has  been  directed  at 
expanding  the  two  dimensional  algorithm  to  enable  the 
robot  to  track  and  to  recognize  three  dimensional  ob¬ 
jects.  Initially,  the  objects  considered  are  spheres,  cylin¬ 
ders,  cubes,  cones  and  variations  of  these.  These  objects 
are  chosen  because  complex,  real  world  objects  are  often 
composed  of  these  basic  shapes. 

The  three  dimensional  object  tracking  algorithm  is 
meant  to  be  as  general  as  possible.  It  does  not  de¬ 
pend  on  an  object’s  orientation,  shape  or  location  in  the 
workspace;  i.e.  there  is  no  prior  knowledge  of  the  object’s 
shape  or  surface  contour.  In  our  initial  experiments,  all 
objects  are  tracked  using  the  two  dimensional  tracking 
algorithm.  The  robot  moves  along  an  object’s  surface  in 
a  plane  parallel  to  the  earth’s  surface,  and  planar  slices 
are  taken  at  increasing  levels  in  the  vertical  +Z  direc¬ 
tion.  Some  exami>les  of  these  simple  objects  are  shown 
in  Figure  2.  The  location  of  the  object  and  the  change 
in  shape  of  each  planar  slice  is  identified  from  the  infor¬ 
mation  gathered  during  the  side  tracking.  Information 
about  the  change  in  the  surface  contour  in  the  vertical 
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Figure  3:  Configurations  of  Gripper/End  Effector 


Figure  4:  Planar  Slices  with  Coricave  and  Convex  Dis¬ 
continuities  (Top  View) 


+Z,  or  any  direction,  can  be  extrapolated  from  explicit 
knowledge  of  the  X,  Y  and  Z  position  for  each  planar 
slice.  Given  this  information,  the  recognition  program 
is  able  to  distinguish  between  simple  objects,  such  as 
spheres,  cylinders,  cones,  and  boxes. 

Ideally  the  robot  should  move  along  the  surface  of  an 
object  with  the  end  effector  oriented  along  the  surface 
normal,  since  this  orientation  implies  minimal  interfer¬ 
ence  from  friction.  However,  it  is  difficult  to  implement 
such  a  controller.  The  present  dual-drive  controller  must 
be  reconfigured  each  time  the  orientation  of  the  end  effec¬ 
tor  changes,  and  in  the  .:ase  of  a  cylinder  the  end  effector 
would  have  to  be  consteintly  reoriented.  A  safe  threshold 
for  the  magnitude  of  the  angle  between  the  tool  and  the 
surface  normal,  |  0  |,  was  found  to  be  |  0  1<  45®.  The 
angle  |  ^  |  is  inferred  from  the  force  sensor  readings  and 
the  trajectory  of  the  robot.  Figure  3  illustrates  how  the 
robot  can  track  around  an  object  using  four  basic  con¬ 
figurations.  Therefore,  Ein  object  is  tracked  with  the  end 
effector  in  one  configuration  until  |  0  |>  45®,  at  which 
point  the  end  effector  is  reconfigured  for  tracking  to  con¬ 
tinue.  In  the  workspace,  these  configurations  point  the 
end  effector  in  either  the  -i-V,  —X,  —Y  or  +X  direction. 
Each  of  these  basic  configurations  requires  a  different 
version  of  the  dual-drive  controller. 

For  three  dimensional  object  tracking,  when  the  robot 
encounters  an  object  in  the  workspace,  it  follows  the  side 
in  the  dual-drive  mode  with  the  end  effector  in  one  of  the 
basic  configurations.  At  an  edge,  the  magnitude  of  6  is 
such  that  the  end  effector  must  be  reoriented  for  tracking 
to  continue.  At  this  time  the  dual-drive  controller  is  de¬ 
activated,  and  the  robot  moves  away  from  the  side  and 
reconfigures  the  end  effector.  The  Fcomply  controller 
moves  the  robot  toward  the  new  side  until  contact  is 
made,  then  an  appropriate  dual-drive  controller  is  reac¬ 
tivated.  This  continues  until  the  robot  has  completely 
encircled  the  object  in  the  X-Y  plane,  i.e.  when  the 
robot  has  tracked  around  the  object  in  all  four  configu¬ 
rations.  The  robot  then  moves  a  prespecified  amount  in 
the  +Z  direction  and  repeats  the  process. 

It  should  be  noted  that  in  order  to  insure  that  the 


gripper  does  not  collide  with  an  object,  there  is  a  limit  to 
the  amount  the  robot  arm  may  travel  when  approaching 
a  side.  The  measure  is  equal  to  the  length  of  the  end 
effector  (see  Figure  3).  Position  information  is  available 
while  the  robot  is  in  contact  with  an  object,  and  the 
point  where  the  end  effector  begins  the  approach  to  a  side 
(x,y)  is  recorded  for  each  configuration.  If  contact  is  not 
make  during  an  approach,  the  robot  attempts  to  make 
contact  in  the  next  configuration.  The  approach  begins 
at  the  {x,y)  position  for  that  configuration,  which  was 
recorded  during  the  previous  tracking  pass.  When  the 
robot  is  unable  to  make  contact  with  the.object,  because 
the  location  of  the  end  effector  exceeds  the  height  of  the 
object  in  all  of  the  four  basic  configurations,  tracking  in 
the  X-Y  plane  is  completed. 

Once  the  robot  is  able  to  track  simple  three  dimen¬ 
sional  objects,  more  realistic,  complex  objects  will  be 
employed.  There  are  certain  problems  associated  with 
the  identificat  ion  of  many  of  these  real  objects.  In  par¬ 
ticular,  each  of  the  four  basic  configurations  of  the  end 
effector  represents  a  progressive  90®  increase  from  the 
previous  configuration.  For  example,  when  moving  coun¬ 
terclockwise  the  object’s  surface  may  be  followed  from 
-f-T  to  —X  ...,  +X  to  -fy  ...,  etc.  depending  on  the 
starting  point.  Obviously  not  all  objects  have  surface 
normals  whose  angle  only  increases  in  this  way.  The 
algorithm  must  deal  with  concave  and  convex  disconti¬ 
nuities  on  the  object’s  surface.  A  concave  discontinuity 
is  defined  as  a  temporary  increeise  in  6  for  a  side.  It 
can  be  thought  of  as  an  indentation  in  the  planar  slice 
(see  Figure  4).  A  convex  discontinuity  is  defined  as  a 
temporary  decrease  in  6  for  a  side,  6  <  —45®  (see  Figure 
4).  Due  to  size  considerations,  the  end  effector  cannot 
be  reoriented  to  explore  a  concave  or  convex  discontinu¬ 
ity  on  the  surface  normal.  Therefore,  when  the  sensors 
indicate  that  the  end  effector  has  come  to  one  of  these 
surface  discontinuities,  the  active  dual-drive  controller 
is  deactivated  and  the  robot  switches  to  an  appropriate 
alternate  tracking  scheme.  In  the  case  of  a  concave  dis¬ 
continuity,  the  robot  moves  on  a  trajectory  across  the 
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Figure  5:  Concave  Discontinuities  V.S.  Edges 


area  until  contact  is  made  with  the  other  side.  It  then 
switches  to  a  routine  that  searches  for  the  point  where 
tracking  can  continue  in  the  dual-drive  mode.  Using 
Fcomply,  the  robot  moves  in  an  outward  turning  spiral 
until  contact  is  made.  This  contact  point  is  tested  to  find 
the  angle  the  surface  makes  with  the  end  effector.  If  the 
angle  is  acceptable,  the  dual-drive  controller  is  reacti¬ 
vated  and  surface  tracking  continues.  If  not,  the  action 
is  repeated  until  either  the  search  is  successful  or  the 
robot  has  reached  an  edge.  The  robot  handles  a  convex 
discontinuity  in  an  analogous  fashion.  At  the  beginning 
and  at  the  end  of  the  convex  area,  the  robot  searches 
for  a  point  where  it  can  safely  resume  tracking  with  the 
dual-drive  controller. 

A  convex  discontinuity,  and  the  point  where  the  end 
effector  must  be  reconfigured;  i.e.  an  edge,  are  both  in¬ 
dicated  when  6  >  45®.  Figure  5  illustrates  this  point. 
Obviously,  successful  tracking  depends  on  the  ability  of 
the  robot  to  distinguish  between  these  two  instances.  In 
order  to  insure  that  a  concave  discontinuity  is  always 
properly  distinguished  from  an  edge,  the  outer  limits  of 
the  object  are  assumed  to  be  known  before  the  object  is 
actually  tracked,  the  justification  being  that  in  the  fu¬ 
ture  this  information  will  be  obtained  from  a  machine 
vision  algorithm.  Stereo  machine  vision  cannot  provide 
the  dense  surface  descriptions  that  tracking  yields.  How¬ 
ever,  the  sparse  matches  provided  by  stereo  algorithms 
are  reliable  [?].  These  sparse  data  matches  give  an  out¬ 
line  of  the  object.  In  this  way,  stereo  machine  vision 
complements  the  tracking  algorithm  by  giving  a  first  es¬ 
timate  of  the  area  that  the  robot  must  track  in  each  basic 
configuration. 

4  Implementing  the  Algorithm 

Simple  three  dimensional  objects  have  been  used  to  test 
the  tracking  algorithm.  An  illustration  of  the  objects 
and  the  actual  data  points  collected  are  shown  in  Fig¬ 
ure  6.  The  arrows  on  the  illustration  indicate  the  levels 
at  which  the  planar  slices  were  taken. 

In  order  to  show  that  the  tracking  scheme  gathers  am¬ 
ple  data  points  to  identify  a  simple  object,  the  data  sets 


Figure  6;  Simple  Objects  and  Actual  Data  Points  Col¬ 
lected 


shown  in  Figure  6  were  used  as  input  to  a  shape  classifier 
program.  The  planar  slices  from  the  box  were  identified 
as  quadrilaterals  and  the  data  points  from  the  cylinder 
were  recognized  as  circles.  Hence,  simple  objects  like 
those  shown  in  Figure  2  can  be  recognized  by  combining 
shape  information  about  the  planar  data  slices. 

Complex  objects  have  also  been  employed  in  the  ex¬ 
periments.  A  standard  telephone  receiver  was  placed  at 
different  orientations  in  the  workspace.  Figure  7  and 
Figure  8  show  the  actual  data  collected  during  surface 
tracking.  Again  note  the  arrows  on  the  illustration.  In 
the  future  the  data  collected  will  be  used  to  recognize 
the  object. 

5  Conclusions  and  Future  Plans 

This  paper  outlines  a  robotic  object  tracking  algorithm 
which  employs  multiple  controllers.  The  dual  —  drive 
force/velocity  controller  is  used  when  the  robot  is  in 
contact  with  an  object,  and  the  damping  controller, 
Fcomply,  is  employed  to  move  the  robot  on  a  specified 
trajectory.  The  combination  of  these  controllers  enables 
the  robot  to  explore  its  environment  and  gather  data 
about  objects  in  the  workspace. 

The  algorithm  has  been  successfully  implemented  on 
an  IBM  7565  robot,  equipped  with  strain  gauges  on  its 
end  effector.  Dense  data  sets  have  been  collected  during 
surface  tracking  with  the  intention  that  this  information 
will  be  used  to  recognize  various  objects.  These  data 
sets  have  been  used  to  recognize  simple  objects  such  as 
a  box  and  a  cylin<ler. 

With  the  additional  feature  of  recognizing  concave  and 
convex  discontinuities  on  an  object's  surface,  the  algo¬ 
rithm  can  he  applied  to  complex  objects.  The  perfor- 
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Figure  7:  Tracking  the  Telephone  Receiver:  Actual  Data 
Points 


D 


Figure  8:  Tracking  the  Telephone  Receiver;  Actual  Data 
Points 


mance  of  the  algorithm  is  not  dependent  on  the  orien¬ 
tation  of  an  object.  A  telephone  receiver  was  placed  at 
different  orientations  in  the  workspace,  and  tracking  weis 
successfully  demonstrated  in  all  cases. 

Future  investigations  in  the  Area,  of  dual— drive  control 
will  concentrate  on  developing  a  force/velocity  controller 
that  will  move  the  robot  along  a  trajectory  in  three  di¬ 
mensions  and  thus  enhance  the  data  collection  ability  of 
the  robot.  The  necessary  reconfigurations  of  the  end  ef¬ 
fector  should  be  made  while  the  tool  remains  in  contact 
with  the  surface.  This  will  eliminate  the  discontinuous 
motion  presently  associated  with  reorienting  the  end  ef¬ 
fector  and  result  in  faster  tracking.  These  enhancements 
will  enable  the  robot  to  follow  a  complex  path  along  the 
surface  of  an  object  and  limit  the  uncertainty  involved 
in  making  an  identification. 

Work  is  underway  to  enable  the  recognition  program 
to  identify  complex  shapes,  much  like  the  planar  slices 
taken  from  the  telephone  receiver.  Here  there  is  the  pos¬ 
sibility  of  including  information  about  the  ’’features”  rec¬ 
ognized  by  the  algorithm,  i.e.  edges,  concave  and  convex 
discontinuities,  in  order  to  give  an  indication  of  the  iden¬ 
tity  and  orientation  of  an  object.  Work  is  progressing 
toward  incori)ora1  ing  machine  vision  in  an  interactive 
object  recognition  system  that  will  request  additional 
tracking  data  from  a  specified  area  of  the  object  in  order 
to  increase  the  probability  of  making  a  positive  identifi¬ 
cation.  When  these  enhancements  are  employed  as  part 
of  the  complete  tracking  system,  the  robot  will  be  able 
to  navigate  in  a  more  complex,  real-world  environment. 
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Currently,  most  robot  programming  is  done  either  by  man¬ 
ual  programming  or  by  the  "teach-by-showing"  method  using 
a  teach  pendant.  Both  of  these  methods  have  been  found  to 
have  several  drawbacks. 

We  propose  a  novel  method  to  program  a  robot,  the 
assembly-plan-from-observation  (APO)  method.  The  APO 
method  aims  to  build  a  system  that  has  the  capability  of  ob¬ 
serving  a  human  performing  an  assembly  task,  understanding 
the  task  based  on  the  observation,  and  generating  the  robot 
program  to  achieve  the  same  task. 

In  particular,  this  paper  defines  assembly  relations  which 
serve  as  the  basic  representation  of  each  assembly  task.  Then, 
we  verify  that  such  assembly  relations  can  be  recovered  from 
the  observation  of  human  assembly  tasks,  and  that  from  such 
assembly  relations,  it  is  possible  to  generate  robot  motion 
commands  to  repeat  the  same  assembly  task.  Finally,  we 
demonstrate  an  APO  system  based  on  the  assembly  relations. 

1  Introduction 

The  key  characteristic  of  robots  is  their  versatility.  They  can 
be  used  to  perform  a  large  variety  of  tasks  without  a  major 
re-design  of  the  robot.  This  versatility  is  due  to  the  generality 
of  the  robot’s  physical  structure,  but  a  robot’s  generality  can 
be  exploited  only  if  the  robot  can  be  easily  programmed. 

Several  methods  to  program  a  robot  have  been  proposed. 
Such  methods  include:  teach-by-showing,  teleoperation  [17, 
12,  31,  textual  programming[2},  and  automatic  program¬ 
ming  [6,  9,  71.  In  teach-by-showing  methods,  an  engineer 
stores,  using  a  teach  pendant  in  teaching  mode,  a  path  along 
which  a  robot  should  move  repeatedly.  In  run  mode,  the 
robot  follows  the  path  it  was  previously  taught.  This  is  the 
most  common  me^od  to  program  a  robot  in  industrial  appli¬ 
cations.  This  method  is  suitable  for  programming  a  robot  to 
repeat  simple  movements.  Moreover,  this  method  is  excellent 
b^use  a  robot  can  learn  complicated  paths  from  a  uained 
engineer.  However,  this  method  requires  that  an  engineer  is  in 
the  same  environment  as  the  robot.  Thus,  we  cannot  use  this 
method  in  hazardous  environments  such  as  in  nuclear  plants, 
underwater,  or  in  outer  space. 

To  remedy  this  problem,  teleoperation  methods  have  been 
proposed.  This  method  uses  a  master  manipulator  for  teaching 
and  a  slave  manipulator  for  execution.  An  engineer  controls 
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the  master  manipulator  in  a  safe  environment  while  moni¬ 
toring  the  hazardous  environment  with  a  remote  TV  camera 
and  display.  The  slave  manipulator  in  the  hazardous  environ¬ 
ment  executes  real  operations  based  on  conuol  signals  from 
its  master  manipulator.  Since  this  method  does  not  require  an 
operator  in  the  execution  environment,  it  is  suitable  for  the 
operation  in  hazardous  environments.  However,  by  using  this 
method,  we  can  only  leach  a  robot  trajectory  information.  It 
is  difficult  to  build  a  flexible  robot  system  able  to  use  force 
control  with  error  recovery  capabilities.  It  is  also  true  that  wc 
have  to  reconstruct  entire  programs,  even  when  a  very  minor 
change  in  the  program  is  desired. 

Textual  programming  is  often  used  in  academic  environ¬ 
ments.  A  programmer  stores  a  robot  command  sequence  in 
a  computer  as  a  textual  program.  By  using  a  compiler  or  an 
interpreter,  a  command  sequence  in  a  textual  program  is  con¬ 
vert^  into  a  form  that  the  robot  can  execute.  This  method  is 
quite  flexible  because  we  can  store  any  kind  of  control  pro¬ 
grams.  However,  it  requires  a  long  development  period  and 
expert  programmers. 

In  order  to  speed  up  the  programming  process,  automatic 
programming  has  been  proposed.  The  method  tries  to  develop 
geometric  reasoning  systems  which  can  generate  textual  pro¬ 
grams  to  control  a  robot  from  geometric  information  given 
by  geomeuic  models  and  task  specifications.  This  direction 
is  quite  promising,  however,  there  are  many  issues  to  be  ad¬ 
dressed  before  we  have  a  complete  automatic  programming 
system.  Such  issues  include:  how  to  generate  a  sequence  of 
operations,  how  to  determine  a  grasp  point  for  each  operation, 
how  to  determine  a  global  path  to  move  an  object  while  avoid¬ 
ing  collisions  with  other  objects.  It  is  quite  difficult  to  build 
a  complete  automatic  programming  system,  though  perhaps 
not  impossible. 

We  propose  a  novel  method  that  combines  automatic  pro¬ 
gramming  and  teleoperation.  We  propose  to  add  a  vision 
capability  that  will  observe  human  operations  to  an  automatic 
programming  system  (a  geometric  reasoner).  In  particular, 
we  propose  a  system  that  observes  a  human  performing  an 
assembly  tasks  while  a  geometric  reasoner  analyzes  and  rec¬ 
ognizes  such  tasks  from  observation,  and  generates  the  same 
assembly  sequence  fora  robot.  Wc  will  refer  to  this  paradigm 
as  Assembly  Plan  from  Observation  (APO). 

Due  to  the  geometric  reasoning  capability,  the  APO  system 
understands  the  operations  that  the  operator  Is  performing. 
Thus,  the  system  for  example  can  discard  unnecessary  mo¬ 
tions  which  are  often  introduced  by  a  human  telcopcrator.  The 
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system  can  also  insert  error  recovery  routines  into  the  gener¬ 
ated  assembly  plans.  In  this  regard,  APO  is  superior  to  the 
teleoperation  method. 

Due  to  the  vision  capability,  the  system  can  solve  several 
otherwise  extremely  difficult  problems,  such  as  path  planning 
and  determining  the  optimal  assembly  sequence,  by  simply 
observing  a  human  performing  the  operation.  In  this  regard, 
APO  is  superior  to  the  automatic  programming  method. 

2  Assembly  plan  from  observation 

In  an  APO  system,  a  human  operator  performs  assembly  tasks 
in  front  of  a  video  camera.  From  the  camera,  the  system  ob¬ 
tains  a  continuous  sequence  of  images  recording  the  assembly 
tasks.  In  order  for  the  system  to  recognize  assembly  tasks 
from  the  sequence  of  images,  the  system  has  to  perform  the 
following  six  operations  (See  Figure  1.): 

•  Temporal  Segmentation  -  dividing  the  continuous  se¬ 
quence  of  images  into  meaningful  segments  which  cor¬ 
respond  to  separate  human  assembly  tasks, 

•  Object  Recognition  -  recognizing  objects  and  determin¬ 
ing  object  configurations  in  a  given  image  segment. 

•  Task  Recognition  -  recognizing  assembly  tasks  based  on 
the  results  of  an  object  recognition  system. 

•  Grasp  Recognition  -  recognizing  where  and  how  the  hu¬ 
man  operator  grasps  an  objectfor  achieving  the  assembly 
task. 

•  Global  Path  Recognition  -  recognizing  the  path  along 
which  the  human  operator  moves  an  object  while  avoid¬ 
ing  collision. 

•  Task  Instantiation  -  collecting  necessary  parameters  from 
object  recognition,  grasp  recognition,  and  global  path 
recognition  results  for  performing  the  recognized  assem¬ 
bly  tasks,  and  setting  up  assembly  plans  to  perform  the 
same  task  using  a  robot  manipulator. 


human  assembly  task 


before 


during  after 


robot  assembly  task 


Figure  1 :  Assembly  plan  from  observation, 
which  to  move  an  object.  The  system,  then,  inserts  the  ob¬ 
tained  grasp  and  stack  locations  into  the  command  sequence. 
Finally,  the  command  sequence  is  sent  to  the  robot. 


3  Defining  Task  Models 


In  this  paper,  we  will  concentrate  on  the  task  recognition 
and  task  instantiation  modules,  because  these  two  parts  form 
the  main  loop  for  the  assembly  plan  from  observation. 

The  outline  of  the  modules  are  as  follows: 

Our  object  recognition  module  identifies  each  object  using 
the  object  models  from  a  given  image  segment.  The  module 
represents  the  recognition  results  in  a  world  model,  as  shown 
in  Figure  1,  by  using  the  geometric  modeler.  Vantage. 

Our  task  recognition  module  recognizes  object  relations  in 
two  image  segments  and  extracts  the  transition  between  two 
object  relations  from  the  two  segments.  The  task  recognition 
system  has  abstract  task  models  in  a  data  base.  Each  abstract 
task  in  the  data  base  describes  a  transition  between  two  dif¬ 
ferent  object  relations.  From  the  task  models  in  the  data  base, 
the  system  identifies  a  task  model  that  describe  the  transition 
need^  to  achieve  the  observed  object  relations,  as  shown  in 
Figure  1. 

Our  task  instantiation  module  represents  the  recognition  re¬ 
sult  as  an  instantiated  task  model.  An  instantiated  task  model 
associates  a  transition  with  an  action  capable  of  causing  the 
transition.  It  also  includes  appropriate  parameters  to  achieve 
the  action  based  on  the  given  scenes.  Such  parameters  include 
object  locations  and  the  grasping  locations  for  the  action.  The 
instantiated  task  model  also  includes  the  global  path  along 


In  order  to  develop  task  models  for  an  APO  system,  we  have 
to  define  representations  to  describe  assembly  tasks.  In  this 
section,  we  will  define  assembly  relations  for  such  represen¬ 
tations.  Then,  we  will  examine  that  such  assembly  relations 
satisfy  the  two  requirements. 

•  recoverability  -  assembly  relations  can  be  extracted  from 
observation, 

•  inferability  -  a  human  assembly  task  can  be  inferred  from 
an  assembly  relation,  and  it  is  possible  to  generate  as¬ 
sembly  operations  for  a  manipulator  from  the  assembly 
relation. 

Finally,  we  will  consider  how  to  define  assembly  task  models 
using  the  assembly  relations. 

3.1  Assembly  relation 

In  each  assembly  msk,  at  least  one  object  is  manipulated.  We 
will  refer  to  the  object  as  the  manipulated  objcci.  The  manip¬ 
ulated  object  is  atuichcd  to  other  stationary  objects,  which  wc 
refer  to  as  environmental  objects,  so  that  the  manipulated  ob¬ 
ject  achieves  a  particular  relation  with  environmental  objects. 

We  will  define  assembly  relations  with  respect  to  face  con¬ 
tacts  between  a  manipulated  object  and  its  stationary  envi¬ 
ronmental  objects.  The  essential  goal  of  an  assembly  utsk  is 
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to  establish  a  new  face  contact  between  a  manipulated  object 
and  environmental  objects.  For  example,  the  goal  of  a  peg- 
insertion  is  to  achieve  face  contacts  at  the  side  and  bottom 
faces  of  the  peg  against  the  side  and  bottom  faces  of  the  hole. 
Thus,  it  is  effective  to  use  face  contact  relations  as  the  central 
representation  for  defining  assembly  task  models. 

To  make  the  overall  problem  manageable,  we  concentrate 
on  a  world  of  polyhedral  objects  in  which  only  one  polyhedron 
may  be  moved  by  one  assembly  task.  An  assembly  relation 
will  be  defined  between  a  manipulated  polyhedron  and  sev¬ 
eral  stationary  environmental  polyhedra.  This  restriction  still 
leaves  a  diverse  range  of  interesting  relationships,  actions, 
and  resulting  assemblies. 

Such  face  contact  relations  satisfy  the  recoverability  re¬ 
quirement. 

•  Face  contact  relations  can  be  obtained  by  analyzing  ge¬ 
ometric  models.  An  object  recognition  program,  such 
as  in  [4]  can  recognize  a  manipulated  object,  determine 
its  configuration,  and  represent  the  recognition  result  in 
a  geometric  modeler,  as  well  as  the  geometric  represen¬ 
tations  of  other  stationary  environmental  objects.  By 
examining  each  face  pair  between  the  manipulated  and 
environmental  objects,  new  face  contact  relations  can  be 
determined  as  they  occur. 

Face  contact  relations  also  satisfy  the  inferability  require¬ 
ment. 

•  Each  face  contact  relation  constrains  possible  motions. 
At  contacting  faces,  the  orientations  of  surface  normals 
are  suffice  for  characterizing  relative  object  movement 
constraints.  For  example,  consider  a  box  resting  on  a 
table.  At  the  contact  faces  surface  normals  are  parallel 
and  opposing.  In  this  position  the  box  can  only  move 
up  or  parallel  to  the  table.  A  more  constraining  case  is  a 
square  bar  inserted  in  a  matching  shaped  hole.  The  bar’s 
four  faces  contact  their  hole  counterparts  with  opposing 
normals  and  the  only  possible  motion  lies  along  the  hole’s 
axis.  Thus,  from  a  face  contact  relation,  it  is  possible  to 
infer  the  assembly  actions  that  cause  such  face  contact 
relations. 

•  Face  contact  relations  characterize  a  control  strategy 
necessary  to  maintain  such  relations.  Each  face  contact 
relation  provides  a  constraint  to  motion.  As  long  as  the 
motion  constraint  is  constant,  the  same  mode  of  control 
is  applicable.  When  the  motion  constraint  changes,  a 
different  mode  of  control  is  required.  For  example,  let 
us  consider  a  box  to  be  placed  on  a  table  and  then  slide 
on  the  table.  Position  control  can  be  used  to  lower  the 
box  towards  the  table  while  the  box  is  in  the  air  (the  box 
does  not  have  any  face  contact.).  When  the  box  is  about 
to  make  contact  with  the  table  (about  to  have  onc-face 
contact),  force  control  is  necessary  to  detect  the  collision 
which  ensures  that  the  box  is  on  the  table.  Combined 
force  and  position  control  is  necessary  to  slide  the  box  on 
the  table  (for  maintaining  one-face  contact).  Face  contact 
relations  have  been  found  to  characterize  required  control 
strategies  [14,  13].  Thus,  such  face  contact  relations 
can  be  used  to  determine  a  conuol  strategy  necessary  to 
achieve  such  face  contact  relations  in  assembly  actions. 

Using  such  face  contact  relations  as  the  basic  represen¬ 
tations,  we  will  describe  an  assembly  uisk  with  a  transition 


between  pre-assembly  relations  and  post-assembly  relations. 
Based  on  the  description,  we  will  build  an  APO  system  in  the 
following  steps: 

•  classifying  all  possible  face  contact  relations  (assembly 
relations)  between  manipulated  and  environmental  ob¬ 
jects, 

•  considering  what  kinds  of  transitions  in  assembly  re¬ 
lations  occur  and  building  a  tree  in  which  each  branch 
corresponds  to  one  possible  transition  and  each  leaf  luxle 
corresponds  to  an  assembly  relation,  and 

•  assigning  manipulator  motions  to  achieve  such  assembly 
relation  transitions  (the  completed  tree  is  referred  to  as  a 
procedure  tree). 

3.2  Taxonomy  for  As.sembly  Relation 

For  geometric  objects  in  a  polyhedral  world,  our  taxonomy 
identifies  all  possible  assembly  relations  based  on  the  direc¬ 
tions  of  contact  surface  normals.  First,  we  will  analyze  a  two- 
dimensional  polygonal  world  and  then  a  three-dimensional 
polyhedral  world.  Some  related  issues  are  found  in  llOl. 

3.2.1  Two-dimensional  cases. 

Assembly  relations  will  be  considered  between  polygons 
(not  polyhedra)  by  using  normal  directions  at  contact  edges. 
Figure  3.2. 1  shows  an  assembly  relation  having  unidirectional 
contact.  Even  if  polygons  have  several  contact  edges  with 
the  same  normal  direction,  they  are  considered  as  having 
unidirectional  contact. 

Normal  direction  of  an  edge  can  be  represented  as  a  point 
on  the  Gaussian  circle  by  U’anslating  the  unit  normal  so  that 
its  starting  point  sits  at  the  origin  of  the  coordinate  system.  Its 
tail  then  lies  on  the  unit  circle  whose  center  is  the  origin.  This 
mapping  is  referred  to  as  a  Gauss  mapping.  Equivalently, 
possible  movement  directions  of  the  object  polygon  can  be 
represented  on  the  Gaussian  circle.  Points  on  the  semicircle 
around  the  contact  direction  correspond  to  the  possible  move¬ 
ment  directions  of  the  object  polygon.  Points  on  the  other 
semicircle  correspond  to  the  prohibited  motion  directions. 

Bidirectional  contact  has  two  possible  assembly  relations 
as  shown  in  Figure  3.  Assembly  relation  2d-b  in  Figure  3  has 
two  maps  located  opposite  one  another  on  the  circle.  This  case 
has  two  possible  movement  directions  on  the  Gaussian  circle. 
Relation  2d-c  in  Figure  3  has  two  oblique  contact  directions  on 
the  Gaussian  circle.  This  case  has  several  possible  movement 
directions  corresponding  to  a  small  arc. 

Tridircctional  contact  at  first  seems  to  have  three  cases: 
relation  2d-d,  relation  2d-c,  and  relation  2d-f,  as  shown  in 
Figure  4.  Relation  2d-d  has  two  opposite  points  and  one 
intermediate  point;  resulting  in  only  one  possible  movement 
direction.  Relation  2d-e  has  three  points  whose  maximum  arc 
is  larger  than  ir,  allowing  no  movement  of  the  object  at  all. 

Relation  2d-f  has  three  arbitrary  points  whose  maximum  arc 
is  less  that  w.  As  Figure  4(c)  illustrates,  the  middle  contact 
direction  docs  not  affect  the  possible  movement  directions  on 
the  Gaussian  circle.  Thus,  relation  2d-f  is  considered  equiv¬ 
alent  to  relation  2d-c  and  is  not  considered  an  independent 
relation.  When  a  relation  has  more  than  three  directions  of 
contact,  it  can  be  mapped  to  one  of  the  relations  mentioned 
above. 
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Figure  3;  Bidirectional  contact:  (a)  relation  2d-b:  (b)  relation 


Figure  4:  'Tridirectional  contact:  (a)  relation  2d-d;  (b)  relation 
2d-e;  (c)  relation  2d-f.  This  is  equivalent  to  relation  2d-c 
in  terms  of  possible  movement  directions,  and  thus,  is  not 
considered  an  indqiendent  relation. 


Table  1:  3D  assembly  relations 


l«nn 
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Explanation 

1 

3d-a  in  Figure  5  shows  anidirectionil  conuci.  The  con- 
ticl  direction  is  represented  as  a  point  on  the  Gaussian 
sphere.  Let  us  suppose  the  contact  direction  is  mapped 
to  the  north  pole  of  the  Gaussian  sphere.  The  possible 
directions  of  object  motion  can  be  represented  as  the 
northern  hemisphere  of  the  Gaussian  sphere;  the  pro¬ 
hibited  directions  can  be  represented  as  the  southern 
hemisphere. 

bi 

Bidirectional  contacts  have  two  diSerentrelations:  3d-b 

1 

and  3d-c,  dependingon  whether  contact  directions  ate  in 
opposite  dirKtions  or  not  See  Figure  S.  Relation  3d-b 
has  possible  movements  represented  as  a  great  circle  on 
the  Gaussian  sphere.  Rela^n  3d-c.  on  the  other  hand, 
hat  possible  movements  represented  as  an  area  bounded 
by  two  great  circles  on  the  Gaussian  sphere. 

3d-d 

The  Bidirectional  contacts  have  three  different  relations: 

Sd-e 

3d-d.  3d-e.  and  3d-f.  Relation  3d-d  and  relation  3d-e 

3d.r 

have  three  coplanar  contact  directions.  Relation  3d<l 
has  possible  movements  corresponding  to  a  great  semi 
circle.  Relation  3d-e  has  possible  movements  corre¬ 
sponding  to  two  points.  Between  relation  3d-e  and  re¬ 
lation  3d-d.  there  exists  a  relation  which  has  identical 
possible  movement  directions  as  relation  3d.c.  Rela¬ 
tion  3d-f  has  possible  movements  corresponding  B>  a 
spherical  area  bounded  by  three  great  circles. 

tem 

Relation  3d-g  has  possible  movements  conesponding  u> 

n 

an  arc,  while  relation  3d-h  has  one  possible  movement. 
Adding  one  mote  contact  direction  u>  relation  3d-e  gives 
relation  3d-h. 

Em 

^34 

3d-i  in  Figure  5  has  no  possible  movement  directions. 

3.2.2  Three-dimensional  cases. 

The  same  analysis  can  be  applied  to  3-D  cases.  3-D  cases 
consider  the  relationship  among  polyhedra.  Table  1  summa¬ 
rizes  the  analysis  of  face  contacts  among  polyhedra.  The 
taxonomy  has  classes  of  uni-,  bi-,  tri-,  tetra-,  and  hexadirec- 
tional  contacts.  Nine  different  contact  patterns  arc  extracted 
from  this  analysis. 

We  will  represent  the  contact  directions  and  possible  move¬ 
ment  directions  on  the  Gaussian  sphere  as  shown  in  Figure  S. 
The  shaded  areas  indicate  the  prohibited  movement  directions 
of  the  object  with  respect  to  the  environment.  The  non-shaded 
areas  indicate  the  possible  movement  directions. 

'  3J  Assembly  relation  transitions 

We  will  consider  a  sequence  of  manipulator  operations  to 
achieve  each  assembly  relation  from  assembly  relation  3d-s. 
Such  a  sequence  of  manipulator  operations  is  grouped  into 
a  motion  macro,  i.e.,  a  template  of  manipulator  operations, 
which,  when  applied  to  an  object,  yields  the  desired  assembly 
relation.  This  is  possible  because  each  assembly  relation  is 
defined  so  that  we  can  apply  the  same  manipulator  control 
strategy  to  achieve  the  relation  by  changing  only  conuoller 
parameters,  not  the  strategy. 

In  order  to  reduce  the  number  of  necessary  templates,  we 
will  analyze  each  assembly  relation  in  an  iterative  manner.  We 
will  analyze  simpler  relations  earlier  and  more  complicated 
relations  later.  Also,  instead  of  considering  a  template  to 
directly  achieve  a  complicated  relation  from  3d-s.  we  will 


1000 


Figure  5:  3-D  assembly  relation  taxonomy. 


consider  an  intermediate  relation,  and  then  try  to  achieve  the 
complicated  relation.  First,  we  try  to  achieve  an  intermediate 
relation  from  3d-s  by  using  the  templates  already  considered. 
Then  we  try  to  achieve  the  final  relation  Brom  the  intermediate 
relation  using  a  newly  considered  template. 

In  order  to  find  an  appropriate  intermediate  relation,  for 
each  assembly  relation,  we  consider  disassembly  actions  from 
the  assembly  relation,  and  extract  all  possible  immediate  inter¬ 
mediate  assembly  relations  just  prior  to  the  assembly  relation. 
We  do  this  because  considering  disassembly  actions  is  easier 
than  considering  assembly  actions. 

Several  intermediate  relations  sometimes  occur  from  the 
same  assembly  relation  due  to  1)  the  variation  in  shapes  of 
contact  faces,  and  2)  the  variety  of  possible  disassembly  op¬ 
erations. 

In  case  that  due  to  variations  in  the  shapes  of  contact  faces, 
we  have  to  analyze  all  intermediate  relations  and  assign  appro¬ 
priate  motion  templates  to  all  transitions  from  the  intermediate 
relations  to  the  d»ired  relation. 

In  case  that  due  to  the  variety  of  possible  disassembly  op¬ 
erations,  we  can  choose  one  appropriate  intermediate  relation 
among  the  several  intermediate  relations.  We  choose  the  one 
which  is  achieved  by  the  simplest  and  most  robust  operation 
under  uncertainty  in  positional  information.  In  order  to  select 
such  intermediate  relation,  we  use  the  following  criteria: 

1 .  in  the  case  that  a  direct  detach  motion  (a  motion  which 
immediately  breaks  a  face-contact)  exists,  choose  it. 

2.  In  the  case  that  a  lateral  motion  (a  motion  maintain  the 
same  contact  relation)  that  would  iMeak  face-contacts  by 
crossing  a  certain  boundary  exist,  choose  it. 

3.  In  the  case  in  which  several  candidate  motions  satisfy 
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Figure  6:  Examples  of  assembly  relation  transitions 


criterion  1  or  criterion  2,  choose  the  motion  which  least 
reduces  the  number  of  face  contacts. 

By  using  these  criteria,  we  will  analyze  each  assembly 
relation,  extract  all  possible  assembly  relation  transitions,  and 
prune  unnecessary  relation  transitions. 

We  can  represent  relation  transitions  as  a  tree  sunicture,  as 
shown  in  Figure  7.  Each  node  in  the  tree  represents  one  partic¬ 
ular  assembly  relation,  and  each  arc  represents  corresponding 
assembly  relation  uansitions. 

3.4  Procedure  tree 

A  procedure  uee  (Figure  8)  is  created  by  placing  a  template  of 
manipulator  operations  (motion  macro)  at  each  arc  separating 
the  assembly  relation  nodes  of  Figure  7.  Sec  Tab’.j  9.  The 
manipulator  operations  chosen  are  those  which  can  correctly 
achieves  an  assembly  relation  on  one  node  from  the  assembly 
relation  on  the  other  node. 

From  this  analysis  in  Thble  3,  the  following  four  motion 
macros  are  extracted; 
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exiumc<seo:3d-e. 
LMenlrootkHudoaglhetxispinlleltolhetiir- 
roanding  contact  fa^  canse  several  iclaiiotn, 
3d-s,  3d>a,  3d>b,  3d<  and  3d-d,  depending  on 
the  shape  of  contact  faces.  'Mk  have  to  con¬ 
sider  thm  five  poaaible  relation  tnnaitions.  We 
will  refer  to  the  axis  as  the  itaenionaxit.  See 
6(d)  for  the  example. 


a  and  3d-c  by  a  detach  motion  dependmg  on 
motion  directiona.  The  relation  tranxiliona.3d-f 
10  3d-a,  3d-f  to  Sd-a,  and  3d-f  to  3d-c,  reduces 
the  namberof  facc-contactaby  three,  two,  and 
one,  respectively.  Thiia,  following  criterion  3. 
the  relation  tnnaition,  3d-f  to  3d-c  is  chosen  at 
the  desirable  one. 


tnoDon  gives  tne  transition  &om 
the  aaaembfy  relation  3d-a  to  3d-s.  See  Fig- 
ore  6(a)  for  an  example  of  the  direct  detach  mo¬ 
tion  which  cauaea  an  assembly  relation  transi¬ 
tion  born  3d-a  to  3d-t. 


No  direct  detach  motion  can  be  applied  to  the 
aaaembfy  relation  3d-b. 

Lateral  motions  parallel  to  the  contact  faces  can 
beq^lied.  Depending  on  the  shape  of  contact 
focea,  it  reaches  either  3d-t  or  3d-a.  Since  this 
variation  is  due  to  the  shqie  of  the  contact  face, 
we  have  to  consider  both  cases.  Fignre  6(b) 
ehows  two  possible  relation  transitions. 


By  applying  direct  detach  motuns,  the  3d-e  re¬ 
lation  becomes  either  3d-s  or  3d-a.  The  two 
possibiiitiea  are  not  doe  to  the  shape  of  the  con- 
tact  fanes;  they  are  doe  to  motion  directions.  The 
relation  transition  from  3d-c  to  3d-s  reduces  the 
number  of  face-contacts  by  two.  while  the  re¬ 
lation  transition  fiom  3d-c  to  3d-a  reduces  the 
number  by  one.  The  latter  relation  transition  is 
chosen  as  the  desirable  one  by  the  criteiion  3. 
Hgnre  6(c)  shows  two  possible  transitiona  doe 
to  motion  dhections. 


motion  macro 


relation  transition  bom  3d-s  to  3d-a  is  realized  by 
an  attach  motion  which  contains  a  motion  component 
toward  the  contact  direction.  Among  several  attach  mo¬ 
tions.  pure  motion  towards  the  contact  direction  until 
face  contact  is  the  easiesL  Thus,  we  assign  the  cone- 
sponding  template  of  motions  to  the  relation  transition 
fom  3d-s  to  3d-a  and  refer  to  this  template  as  move-to- 
coniact  motion  macro. 


ince  the  configuration  of  the  object  is 
it  is  only  necessary  to  translate  the  object  parallel  to 
the  two  contact  faces.  We  assign  the  conesponding 
template  of  motions  to  the  relation  transition  and  refer 
to  this  template  as  move  motion  macro. 


At  3d-a,  one  face  contact  is  already  achieved.  The 
relation  transition  bom  3d-a  to  3d-c  is  realized  by  an 
attach  motion  along  the  contact  face  of  3d-a  toward  an¬ 
other  contact  face  of  3d-c  until  two-face  contact  occurs. 
Among  several  such  motions,  pure  motion  perpendica- 
lar  to  the  intersection  lines  benveen  two  cootaa  faces  is 
selected. 

We  achieve  this  relation  transition  by  using  the  same 
template  of  opeiaiions  for  the  relation  transitian  fimra 
3d-s  to  3d-a.  move-UxoMact.  Thus,  we  assign  the 
movt-to-amiact  motion  macro  to  the  relation  transition 
bom  3d-a  to  3d-c. 


The  motion  of  the  , 
object  ia  only  movable  aloog  the  msen  axis.  Weasethe 
move  macro  to  make  the  relation  transitions. 


At  3d-d.  the  object  can  only  Bans 
to-  I  linesamongcootactfaces.  ByasBigthameve-ro-canMcr 
macro,  we  achieve  the  tetradiractional  contact 
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•  move  -  a  motion  sequence  for  this  macro  is  realized  by 
translating  a  manipulated  object  from  the  starting  con¬ 
figuration  to  the  ending  configuration. 

•  move-to-contact  -  a  motion  sequence  for  this  motion 
macro  is  realized  by  translating  a  manipulated  object 
until  it  contacts  a  face  of  an  environmental  object,  dien 
fitting  a  manipulated  object  face  to  the  contact  environ¬ 
mental  face. 

If  we  have  precise  configurations,  we  can  achieve  the 
contact  and  fitting  operations  by  using  such  configura¬ 
tions.  Otherwise,  these  operations  require  some  sensory 
feedback  to  detect  the  occurrence  of  contact  and  fitting. 
See  [14]  for  a  detailed  implementation  of  the  macro  as 
a  skill  in  a  force  feedback  type  manipulator. 

•  insert-between  -  a  motion  sequence  for  this  motion  macro 
is  realized  by  first  aligning  a  manipulated  object  between 
a  pair  of  contact  environmental  faces,  and  then  translat¬ 
ing  it  between  the  pair  of  contact  faces  to  the  ending 
configuration. 

If  we  have  precise  configurations,  we  can  achieve  align 
motion  and  translation  motion  using  the  configurations. 
Otherwise,  the  align  motion  requires  some  sensory  feed¬ 
back.  See  [14]. 

•  insert-into  -  a  motion  sequence  for  this  motion  macro 
is  realized  by  aligning  a  manipulated  object  along  the 
insert  axis,  and  then  translating  along  the  axis  to  the 
ending  configuration. 

If  we  have  precise  configurations,  we  can  achieve  align 
motion  and  translation  motion  using  the  configurations. 
Otherwise,  the  align  motion  requires  some  sensory  feed¬ 
back.  See  [14]. 

Figure  8  represents  a  completed  procedure  tree. 

3.5  Task  models 

A  task  model  consists  of  an  assembly  relation  transition,  a  mo¬ 
tion  macro,  and  the  necessary  parameters  required  to  expand 
the  motion  macro  into  a  sequence  of  manipulator  commands. 
For  example.  Figure  9  shows  the  task  model  corresponding  to 
the  transition  from  3d-s  to  3d-a.  The  starting  and  end  relation 
slots  contain  3d-s  and  3d-a,  respectively.  The  action  slot  con¬ 
tains  the  move-to-contact  motion  macro.  In  order  to  achieve 
the  motion,  it  is  necessary  to  know  the  previous  configuration 
and  end  configuration  of  the  manipulated  object.  The  cor¬ 
responding  parameters  are  prepared  as  task  parameters.  The 
values  corresponding  to  these  parameters  are  obtained  by  the 
task  instantiation  module  at  run  time. 

Thirteen  task  models  corresponding  to  all  arcs  in  the  tree 
are  prepared.  They  are  attached  to  the  procedure  tree. 

4  Implementation  of  APO  system 

How  are  uisk  models  used  to  recover  human  assembly  tasks 
in  the  APO  system?  The  task  recognition  mechanism  will  be 
explained  in  the  following  examples.  The  example  system 
consists  of  three  classes  of  objects,  (any  of  which  can  ai^^ear 
in  the  scene):  castle,  block,  and  stick  (Figure  10). 

4.1  Temporal  Segmentation 

The  system  assumes  that  at  the  beginning  of  each  assembly 
task  human  intervention  occurs  in  the  scene  and  at  end  of 


move>to*coDtact  faove>to-contact 


Figure  8:  Procedure  uee. 


Figure  10:  Castle,  block  and  stick. 
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the  assembly  tasic  the  human  disappears  from  the  scene.  By 
using  this  assumption,  the  APO  system  segments  a  continuous 
image  sequence  given  by  a  TV  camera  from  the  scene  into  a 
finite  number  of  meaningful  chunks. 

By  using  the  level  change  in  the  brighmess  difference,  the 
system  can  detect  human  intervention.  Figure  1 1  shows  a 
continuous  image  sequence  of  a  scene  given  by  a  TV  camera, 
while  the  human  operator  is  putting  a  castle  on  the  table. 
Before  human  intervention,  the  scene  consists  of  only  still 
objects,  thus  the  difference  between  two  consecutive  images 
is  at  the  quite  level.  When  human  intervention  occurs,  the 
brightness  difference  is  large  due  to  the  motion  of  human  and 
manipulated  object  in  the  scene.  This  disturbance  continues 
until  the  end  of  the  assembly  operation.  After  the  human  hand 
disappears,  the  scene  consists  of  only  still  objects.  Thus,  the 
brightness  difference  returns  to  the  quite  level. 

We  have  been  using  this  method  for  detection  for  several 
live  demos  repeated  continuously  for  several  days,  and  the 
method  never  failed. 

4,2  Object  Recognition 

Objects  in  the  scene  are  recognized  from  range  data.  In  our 
current  implementation,  b/w  images  are  used  only  for  detect¬ 
ing  the  completion  of  one  assembly  task.  More  reliable  range 
data  are  us^  for  analyzing  the  scene.  After  a  certain  period 
after  the  detection  of  the  completion  of  one  assembly  task, 
the  APO  system  invokes  the  range  finder  and  measures  range 
information  in  the  scene.  The  APO  system  then  generates  a 
difference  image  between  the  range  image  from  the  previous 
step  (before  the  assembly  task)  and  the  range  image  from  the 
current  step  (after  the  assembly  task). 

The  system  applies  a  segmentation  program  to  the  differ¬ 
ence  image  and  (Attains  any  newly  appearing  regions.  These 
new  regions  correspond  to  the  faces  of  the  manipulated  object 
by  the  assembly  task.  See  Figure  12 


N+i  N  Difference 

Range  Image 

Figure  12:  The  difference  in  range  data 


Recognition  Prograai 


Figure  13:  Object  recognition:  a  recognition  program  is  ap¬ 
plied  only  to  any  newly  appearing  regions,  and  recognizes 
only  manipulated  object.  The  recognition  results  are  repre¬ 
sented  by  Vantage. 
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Figure  14:  Extracted  contact  faces  and  assembly  relation. 
43  Task  Identification 

By  using  the  transformation  from  body  coordinate  systems  to 
face  coordinate  systems,  (available  from  the  Vantage  geomet¬ 
ric  modeler),  the  configurations  of  the  faces  of  the  manipulated 
and  environmental  objects  are  obtained. 

The  system  extracts  contacting  face  pairs  from  the  face 
configurations.  Here,  a  contacting  face  pair  is  a  face  from 
the  manipulated  objects  and  a  face  from  an  environmental 
object,  which  have  the  same  face  equations  and  whose  surface 
normals  are  opposite  to  each  other. 

The  system  determines  the  assembly  relation  based  on  the 
contacting  face  pairs  by  analyzing  the  contact  directions  of 
pairs.  Here,  the  contact  direction  is  defined  as  the  normal  di¬ 
rection  from  the  environment  faces  to  the  manipulated  object 
faces  as  previously  defined.  Ibe  contact  pairs  are  grouped 
into  a  set  of  contact  directional  groups  so  that  each  group  has 
face  pairs  with  the  same  contact  direction.  By  examining  the 
occurrence  of  directions,  we  can  determine  which  assembly 
relation  occurs  by  the  assembly  task. 

The  system  recognizes  the  contact  faces  and  contact  di¬ 
rections  as  shown  in  Figure  14.  From  the  contact  faces  in 
Figure  14,  the  system  determines  that  the  current  assembly 
relation  is  3d-a. 

Before  the  assembly  task,  the  castle  does  not  exist  in  the 
scene.  Thus,  before  the  assembly  task,  the  assembly  relation 
between  the  castle  and  the  table  was  3d-s.  After  performance 
of  the  assembly  task,  the  manipulated  castle  established  a  3d-a 
assembly  relation  with  the  environmental  object,  the  table. 

From  this  observation,  the  system  recognizes  that  the  as¬ 
sembly  relation  transition,  3d-s  to  3d-a,  occurs  due  to  the 
assembly  task.  The  corresponding  task  mode  3d-s  to  3d-a 
is  extracted  from  the  corresponding  arc  along  the  procedure 
tree. 

4.4  Task  Instantiation 

In  this  example,  at  the  previous  step,  the  castle  was  stored  on 
the  warehouse  table.  Thus,  the  assembly  relation  transitions 
during  the  entire  assembly  task  are 

•  3d*a  to  3d>s:  detach  the  castle  from  the  warehouse  table 
to  the  dq>arture  configuration. 

•  3d-s  to  3d*s:  bring  the  castle  from  the  departure  config¬ 
uration  to  the  approach  configuration  in  free  space. 

•  Sd-s  to  3d>a:  move-to-contact  the  castle  to  the  working 
table  from  the  tqtproach  configuration. 


Thus,  the  corresponding  three  task  models  are  instantiated: 
a-to-s,  s-to-s,  and  s-to-a. 

The  following  procedure  is  executed  to  instantiate  a  task 
model: 

•  obtain  an  abstract  task  model  from  the  data  base, 

•  obtain  necessary  parameters  for  the  motion-macro  (i.e. 
motion  direction  and  translation  distance)  derived  from 
the  object  recognition  results. 

•  obtain  the  necessary  motion  macro  (a  sequence  of  ma¬ 
nipulator  motions)  by  consulting  the  action  slot  of  the 
task  model. 

The  instantiation  of  task  models  occurs  in  the  reverse  order, 
s-to-a,  s-to-s,  and  a-to-s. 

The  s-to-a  task  model  has  a  move-to-contact  motion  macro 
in  the  action  slot.  The  task  model  examines  each  object 
model  and  determines  grasp  configurations,  how  to  grasp  the 
object  with  respect  to  the  body  coordinate  system,  and  the 
specified  grasping  method.  In  the  current  implementation, 
each  object  model  has  predetermined  grasping  configurations. 
The  task  model  chooses  an  appropriate  grasping  configuration 
and  recalculates  it  based  on  the  current  body  configurations. 
The  task  model  determines  the  grasping  configuration  of  the 
castle  based  on  the  observed  castle  configuration.  The  task 
model  also  determines  the  stack  configuration  of  the  castle  on 
the  table  in  a  similar  manner.  The  system  then  inserts  these 
parameters  to  the  corresponding  slots  in  the  instantiated  task 
model. 

The  global  motion  is  also  implemented  as  a  task  model,  s- 
to-s.  This  task  model  has  a  motion  macro,  move.  The  current 
implementation  does  not  consider  collision  between  the  ma¬ 
nipulated  object  and  environmental  objects.  It  assumes  that 
space  above  a  certain  level  of  height  is  free  space.  The  task 
model  incorporates  the  path  from  the  departure  configuration 
to  the  high  position,  the  high  position  to  another  high  position 
above  the  approach  configuration,  and  the  second  high  posi¬ 
tion  to  the  approach  configuration.  These  configurations  are 
obtained  from  the  old  and  new  configurations  of  the  manipu¬ 
lated  objects.  These  values  are  inserted  into  their  slots  in  the 
instantiated  task  model. 

The  disassembly  task  is  also  implemented  as  a  task  model. 
The  current  implementation  does  not  observe  the  warehouse 
table  due  to  the  field  of  view  of  the  range  finder.  Thus,  the 
assembly  relation  uansition,  3d-a  to  3d-s,  which  occurs  at  the 
warehouse  table,  is  given  to  the  system  as  a  priori  knowledge. 
The  system  instantiates  a  disassembly  task  model,  a-to-s.  This 
task  model  has  a  motion  macro,  move  in  the  action  slot.  The 
grasp  configuration  for  the  disassembly  task  is  obtained  from 
the  geometric  model  in  a  similar  manner  to  the  assembly  task 
model.  This  value  is  stored  in  the  corresponding  slot  in  the 
instantiated  task  model. 

The  system  finally  performs  the  operations  given  by  the 
three  task  models  sequentially:  a-to-s,  s-to-s,  and  s-to-a.  Fig¬ 
ure  15  shows  the  final  move-to-contact  operation  by  a  manip¬ 
ulator. 

4.5  Additional  examples 

Figure  16(a)  shows  a  human  operation  for  inserting  a  stick  in 
a  hole  of  the  block.  The  system  recognizes  the  contact  faces 
(Figure  16(b)).  From  the  normal  direction  of  contact  faces, 
the  system  generates  tetra  directional  contact.  By  examining 
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the  directions  of  the  contacts,  the  system  determines  that  the 
observed  assembly  relation  is  3d-e. 

Currently,  the  vision  system  cannot  detect  intermediate  re¬ 
lation  transitions  such  as  f^rom  3d-b  to  3d-e  due  to  our  temporal 
segmentation  method.  It  can  only  detect  the  relation  transition 
from  3d-s  to  3d-e.  Thus,  the  system  explores  all  the  possible 
paths  in  the  procedure  tree  between  3d-s  and  3d-e.  Then,  by 
examining  the  shape  of  contact  pairs,  the  system  infers  which 
path  occurs. 

More  precisely,  the  relation  transition  from  3d-s  to  3d- 
e  corresponds  to  five  paths;  direct  path,  via  3d-b,  via  3d- 
a  and  3d-b,  via  3d-a  and  via  3d-a  and  3d-c.  All  the  arcs 
to  the  3d-e,  however,  have  the  same  assembly  action  (and 
disassembly  action),  translation  along  the  axis.  In  Vantage, 
the  disassembly  action  is  applied  to  the  current  geometric 
representation  of  the  manipulated  and  the  environment  objects 
to  find  the  previous  assembly  relation.  The  system  examines 
the  vertex  coordinates  of  all  the  contact  faces,  projects  them 
to  a  plane  parallel  to  the  translation  directions,  and  determines 
which  assembly  relation  occurs  due  to  this  translation  action. 
In  this  example,  the  system  finds  that  all  the  boundary  edge 
vertices  on  the  contact  faces  have  the  same  coordinate  system 
along  the  translation  directions.  From  this,  it  concludes  that 
the  3d-s  to  3d-e  relation  transition  occurs. 

The  s-to-e  task  model  has  a  motion  macro,  insert-into  in 
the  action  slot.  Using  the  predetermined  grasp  configuration 
and  the  observed  stick  position,  the  system  performs  the  insert 
operation  as  shown  in  Figure  16(c). 

Figure  17  shows  other  examples  constructed  successfully 
by  the  system. 

5  Conclusion 

We  have  described  an  Assembly-Plan-from-Observation 
(APO)  system  that  can  observe  an  assembly  task  performed 
by  a  human,  recognize  scene  objects,  relations  among  those 
objects,  and  actions  on  them,  and  produce  corresponding  op¬ 
erational  plans  for  a  robot.  Our  work  will  open  a  new  domain 
of  object  recognition  applications  and  provide  a  revolutionary 
way  of  programming  robots. 

The  current  system  analyzes  human  operation  and  gener¬ 
ates  the  fine  motion  plan  from  observation  among  polyhedral 
objects.  Future  directions  include  how  to  generate  grasp  plans 
and  global  motion  plans  from  observation. 


Figure  16:  Insert  a  stick  to  the  block;  (a)  input  scene,  (b)  face 
contact,  (c)  system  performance. 


Figure  IS:  Put  a  block  on  the  table  with  a  manipulator. 


Figure  17:  Additional  examples. 
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Abstract 

In  many  man-made  environments,  obstacles  in 
the  path  of  a  mobile  robot  can  be  characterized  as 
shallow,  i.e.  they  have  relatively  small  extent  in 
depth  compared  to  the  distance  &om  the  camera. 
We  present  a  framework  for  tracking  and  reconstruc¬ 
tion  of  shallow  structures.  An  affine  transformation 
is  used  as  a  dynamic  model  for  tracking  potential 
obstacles,  and  for  their  3D  reconstruction  as  shal¬ 
low  structures.  Results  of  applying  the  constraint  of 
affine  irackabUity  for  automatic  identification  and 
3D  reconstruction  of  shallow  structures  in  realistic 
scenes  are  presented.  It  is  also  shown  how  this  ap¬ 
proach  can  handle  independent  object  motion,  oc¬ 
clusions  and  motion  discontinuity. 

1  Introduction 

Identification  and  3D  representation  of  potential 
obstacles  is  crucial  for  the  success  of  autonomous 
visual  navigation.  Much  of  the  work  in  recovering 
scene  structure  from  monocular  vision  has  concen¬ 
trated  on  deriving  depths  of  points  or  lines  but  has 
achieved  only  limited  success.  Both  the  motion  and 
structure  computations  suffer  from  inherent  ambigu¬ 
ities  [2]  in  many  realistic  scenarios  and  also  are  very 
sensitive  to  noise  in  correspondences  or  flow  extrac¬ 
tion  [16].  The  recovery  of  aggregate  3D  structures  is 
generally  left  to  some  later  stage  in  which  features 
are  grouped  into  objects  or  surfaces  that  could  be 
potential  obstacles. 

In  the  approach  presented  here,  the  goal  is  to  dis¬ 
cover  aggregate  structures  in  the  imaged  scene  which 
can  be  characterised  as  shallow  structures.  Shallow 
structures  are  JD  structures  with  the  property  that 
the  difference  in  depth  within  the  whole  structure 
is  small  compared  to  its  distance  from  the  camera. 
Figure  1  shows  an  image  of  a  hallway.  This  scene 
consists  of  compact  structures  like  the  cones  and  the 
trash  can,  and  extended  structures  like  the  walls,  the 
floor  and  the  ceiling.  When  viewed  from  distances 
at  which  it  might  be  desirable  for  a  mobile  robot 
to  represent  these  internally,  the  variation  in  depth 
within  the  compact  structures  is  small  compared  to 
their  average  distances  from  the  camera.  That  is, 
the  compact  structures  can  be  characterized  as  shal¬ 
low  at  Stances  where  the  the  robot  might  need  an 
internal  representation  of  the  structures. 


'This  work  was  supported  in  part  by  DARPA  (via 
TACOM)  under  contract  number  DAAE07-91-C-R035. 


Figure  1:  A  Hallway  scene  with  shallow  and 
non— shallow  structures. 

Given  that  many  potential  obstacles  can  be  chru- 
acterized  as  shadow,  the  computational  goal  is  to 
automatically  identify  shallow  structures  in  the  envi¬ 
ronment,  maintain  their  identity  or  correspondence 
over  time  (tracking)  and  reliably  reconstruct  their 
3D  position  with  respect  to  the  camera.  In  [14], 
we  demonstrated  how  the  3D  motion  and  structure 
of  a  shallow  object  in  motion,  relative  to  the  cam¬ 
era,  can  be  well  approximated  by  a  four-parameter 
affine  transformation.  A  framework  was  presented 
for  tracking  shallow  objects  over  time  under  the 
affine  constraints. 

Tracking  under  the  affine  constraints  can  abo  be 
utilized  for  the  automatic  identification  of  shallow 
structures  as  distinct  from  their  background,  as  we 
demonstrate  here.  Furthermore,  reliable  reconstruc¬ 
tion  of  the  segmented  shallow  objects  is  also  demon¬ 
strated.  It  is  also  shown  how  this  approach  can  han¬ 
dle  independent  object  motion,  occlusions  and  mo¬ 
tion  discontinuity. 

An  advantage  of  the  approach  described  here  is 
that  3D  structure  information  is  derived  reliably 
without  the  intermediate  step  of  explicit  computa¬ 
tion  of  the  3D  motion  parameters.  Recall  that  the 
well-known  inherent  ambiguities  ([2, 17])  in  the  pro¬ 
cess  of  decomposing  the  image  motion  into  a  3D  ro¬ 
tation  and  a  translation  can  lead  to  large  errors  in 
the  3D  structure  estimation. 

Furthermore,  the  3D  location  and  the  dynamics 
of  the  entire  aggregate  structure  are  directly  repre¬ 
sented  instead  of  the  depth  of  more  primitive  to¬ 
kens  like  points  and  individual  lines.  The  derived 
description  of  the  scene  can  be  viewed  as  a  set  of 


1009 


Ctonto-paiallel  planes  {cardboard-cut-out  surfaces) 
of  constant  depth,  one  for  eiu:h  shallow  object  in 
the  scene. 

2  Relationship  to  Previous  Work 

In  [7]  and  [8],  a  locally-constiuit  acceleration 
model  is  used  for  tracking  of  individual  2D  line  seg¬ 
ments  over  a  sequence  of  frames.  Williams  and  Han¬ 
son  [18],  in  their  work  on  flow-predicted  line  corre¬ 
spondences,  have  demonstrated  that  for  translations 
in  depth,  reliable  depth  can  be  computed  by  measur¬ 
ing  the  temporal  magnification  (looming)  of  lengths 
and  regions  at  approximately  constant  depth.  Their 
method  was  demonstrated  on  manually  selected  vir¬ 
tual  line  segments  and  regions  in  the  image,  each 
of  whose  vertices  is  defined  as  the  intersection  of 
two  lines.  Automatic  segmentation  and  temporal- 
persistence  in  tracking  was  not  addressed.  Nelson 
and  Aloimonos  [13]  use  the  divergence  of  the  flow 
field  between  a  pair  of  frames  to  ^vide  various  re¬ 
gions  in  the  image  into  surfaces  at  different  depths 
with  respect  to  the  camera. 

One  of  the  earliest  attempts  at  describing  the 
scene  as  planar  patches  and  its  subsequent  segmen¬ 
tation  into  multiple  object  motions  was  that  of  Adiv 
[1].  His  approach  employed  the  constraints  on  im¬ 
age  flow  from  the  rigid  motion  of  a  planar  patch  to 
group  image  regions,  each  region  corresponding  to 
such  a  motion.  The  input  used  was  sparse  or  dense 
image  flow  and  the  associated  confidence  measures 
between  a  pair  of  images  [4].  Since  the  method  is 
based  on  image  flow,  it  is  not  very  reliable  when 
the  scene  is  composed  primarily  of  textureless  sur¬ 
faces.  Furthermore,  Adiv’s  approach  was  limited  to 
descriptions  based  on  only  two  image  frames  and  ex¬ 
tensions  to  multiple  frames  have  not  been  proposed. 
Faugeras  and  Lustman  [9]  ako  suggest  an  approzich 
for  reconstructing  the  scene  as  planar  patches  based 
on  line  tokens. 

3  Affine  Describability  and  Tracka- 
bility 

This  section  presents  a  brief  review  of  the  deriva¬ 
tion  of  the  affine  constriunt  for  shallow  structures, 
and  of  its  use  in  tracking.  (Please  refer  to  [14]  for 
a  detailed  discussion  of  the  aflSne  describability  and 
trackability  constraints.) 

S.l  AfRne  Describability 

It  was  shown  in  [14]  that  the  image  projections  of 
a  shaUow  structure  can  be  approximated  by  a  four- 
parameter  affine  transformation.  That  is,  given  a 
3D  structure  which  can  be  well  approximated  by  a 
fronto-parallel  plane  (shallow  structure),  its  image 
projections  at  two  closely  spaced  time  instants  are 
related  through: 

«  ^sRfP  +  t,  t  =  an,,  +  (1) 

11  "0 
where,  p  and  pf  are  the  corresponding  imaged  points 
of  a  shallow  structure  at  times  t  and  t  +  1  respec¬ 
tively,  a  is  the  scale  defined  as  the  ratio  of  average 
depths  at  the  two  time  instants,  A,  is  the  2x2 


rotation  matrix  for  the  rotation  around  the  optical 
axis  (x-axis),  t  is  the  translation  in  the  image  plane, 
n,,  and  T,,  are  the  vectors  represeming  the  x  and 
y  components  of  the  3D  rotational  and  translational 
vectors  respectively,  Zq  is  the  average  depth  at  the 
second  time  instant,  and  /  is  the  focal  length  of  the 
camera. 


Figure  2:  Parallel  and  perpendicular  endpoint 
uncertainties. 


Figure  3:  The  Parallel  and  perpendicular  error 
components. 

3.2  Affine  Parameters  and  their  Covariances 
A  set  of  noisy  line  correspondences  are  used  to 
compute  the  best  affine  motion  parameters  in  the 
image  plue.  The  model  for  noise  in  the  extracted 
line  segments  u  shown  in  Figure  2  [8,  14].  The  un¬ 
certainties  in  the  endpoints  of  a  line  can  be  modeled 
as  variances,  <rj|  and  which  are  the  parallel  and 
perpendicular  uncertainties,  respectively,  in  a  coor¬ 
dinate  system  aligned  with  the  line  as  shown  in  the 
figure.  Based  on  this  noise  model,  a  weighted  error 
measure  (Figure  3)  is  formulated  to  relate  the  image 
lines  of  a  structure  with  the  unknown  affine  parame¬ 
ters.  The  error  measure  is  a  sum  of  the  parcel  and 
perpendicular  components  of  the  vectors  joining  the 
corresponding  endpoints  of  the  line  in  frame  t-l-1  and 
the  affine  transformed  line  in  frame  t  [3].  Each  of  the 
components  is  weighted  according  to  the  parallel  and 
perpendicular  variances  of  the  corresponding  lines. 
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The  eiioi  measure  can  be  written  as; 

3 

Ei  =  +  <  -  Pi, )  •  *».'}“ 

3=1 

+  i^iuKAir. +  t-p;j.)  •*;]=*  (2) 

where  t  is  the  ith  corresponding  pair,  j  refers  to 
endpoint  1  or  2,  w±f  and  uy^  are  the  weights  for 
the  perpendicular  and  parallel  error  components, 

D  =  I  ^  is  the  data  matrix  which  is  con- 

structed  using  the  endpoint  p  =  lx  y]^  in  frame 
t,  vector  r«  =  [scoso/x  ssinttf,]^  is  the  product 
of  scale  s  and  rotation,  to,,  around  the  optical  axis, 
and  and  are  the  unit  normal  and  direction,  re¬ 
spectively,  of  the  line  in  frame  t  +  1.  It  is  clear  &om 
Figure  3  that  the  first  term  in  the  above  equation 
is  the  weighted  perpendicular  distance  between  the 
afiine  transformed  endpoint  of  a  line  at  t  to  the  cor¬ 
responding  line  in  the  next  frame.  The  second  term 
is  the  weighted  longitudinal  distance.  The  weights 
associated  with  each  of  the  error  components  can  be 
chosen  appropriately  for  both  points  and  lines  ex¬ 
tracted  from  the  image  data.  For  example,  for  lines 
typically  tuj.^  is  much  larger  than  ui||^,  reflecting  the 
known  noise  characteristics  of  most  line  extraction 
algorithms. 

For  a  set  of  line  correspondences,  the  unknown 
parameters  and  <  can  be  found  by  minimising 
Ei.  Through  a  series  of  simple  algebraic  manip¬ 
ulations  it  can  be  shown  that  the  following  linear 
system  gives  the  solution: 

MtotVaff  =  Vtot  (3) 

where  Mtot  and  vtot  are  the  data  matrix  and  vector, 
respectively,  and  is  the  vector  of  the  unknown 
afflne  parameters  (for  full  details,  see  [14]). 

Given  the  model  of  uncertainty  of  the  constituent 
lines  in  a  structure,  the  covariances  of  the  output 
afflne  parameters  can  be  expressed  as  follows  [15]: 

Ar.t  =  MrJt  (4) 

where  is  the  4x4  covariance  matrix  of  the 
afflne  parameters  r,  and  t. 

S.3  Tracking  Shallow  Structures 

The  affine  motion  constraint  developed  in  the  pre¬ 
vious  section  can  be  used  in  a  dynamic  model  to 
predict  and  track  shallow  structures  over  time. 

TVacking  requires  the  following  three  components; 

1.  A  dynamic  model  of  the  motion  (or  change 
of  state  in  Kalman  filtering  terminology)  of  a 
structure. 

2.  A  match  measure  to  choose  good  matches  for 
a  structure  in  every  newly  acquired  frame. 
The  predictions  for  searching  for  the  potential 
matches  are  provided  by  the  dynamic  model. 

3.  A  fusion  of  the  current  estimate  of  the  affine 
motion  and  the  3D  location  parameters  of  a 
structure  with  those  obtained  horn  the  newly 
acquired  data. 

All  these  components  have  to  account  for  errors 
in  extracting  tokens  {meaivrtmeni  noise)  in  images 


and  also  errors  in  modeling  the  dynamics  of  a  struc¬ 
ture  {plant  noise).  Three  sources  of  error  have  to  be 
accounted  for: 

1.  Measurement  uncertainty  in  the  image  data  on 
which  the  prediction  is  based. 

2.  Departures  from  modeled  predictions  of  motion, 
(e.g.  non-uniform  motion). 

3.  Errors  in  the  affine  description  due  to  depar¬ 
tures  from  a  fironto-parallel  plane  for  the  real 
shallow  structure. 

5.3.1  Dynamic  Models 

Kalman  filtering  provides  a  natural  framework  for 
expressing  the  dynamic  models  which  can  account 
for  these  Vcirious  sources  of  error  in  a  single  uni¬ 
fied  framework.  There  are  two  dynrunic  models  for 
the  problem  at  hand.  The  first  model  is  used  to 
predict  the  affine  parameters  between  frames  t  1 
and  t  given  the  parameters  at  time  t.  This  is  based 
on  a  model  of  uniform  3D  motion.  Modeling  noise 
(plant  noise)  is  added  to  account  for  departures  from 
non-uniformity  and  also  to  account  for  any  motion 
discontinuity.  This  model  can  be  expressed  in  its 
general  form  as: 

A»+i,«  =  /i(Aj)  -1-  T)  (5) 

where  q  u  the  plant  noise  term.  The  exact  func¬ 
tion  fi  was  presented  in  [14].  It  wu  also  shown  that 
in  most  practical  scenarios,  the  contribution  of  the 
plant  noise  term  can  be  limited  to  the  translational 
part  of  the  afflne  parameters  thus  considerably  sim¬ 
plifying  the  computation.  However,  the  model  will 
also  handle  modeling  noise  in  the  other  parameters 
as  well.  Note  that  the  covariemces  of  the  predicted 
affine  parameters  are  a  combination  of  the  covari¬ 
ances  of  the  current  parameters  and  the  plant  noise. 
Recall  that  the  covariances  of  the  current  parame¬ 
ters  already  include  the  effects  of  measurement  noise 
in  the  image  data  (Equation  4). 

The  second  dynamic  model  predicts  the  image  lo¬ 
cation  of  a  shallow  structure  at  time  t  -t- 1  given  the 
predicted  affine  motion  par2uneter8,  and  the  current 
image  location  at  t. 

^1+1  =  /2(-dt+i,f ,  Lt)  -1-  ^  (6) 

The  modeling  noise  term  (  accounts  for  the  third 
source  of  error  mentioned  above  —  departures  from 
a  fronto-parallel  plane  for  the  real  shallow  structure, 
and  also  for  the  measurement  noise  in  the  image 
measurements  at  t  -f  1.  Again,  the  covariances  of 
the  predicted  location  are  a  fused  estimate  of  the 
covariances  of  the  predicted  afflne  parameters  and 
the  measurement  noise  in  the  current  location  esti¬ 
mate.  Note  that  the  function  is  the  one  shown  in 
Equation  1. 

3.3.2  Model  Matching 

Kalman  filtering  provides  a  basis  for  predictions 
as  well  as  fusion  of  uncertain  information  over  time 
through  recursive  estimation.  However,  the  cor¬ 
respondence  problem  also  has  to  be  addressed  for 
tracking.  Given  the  noisy  predictions  and  a  num¬ 
ber  of  potential  matches  with  the  associated  model¬ 
ing  and  measurement  covariances,  we  use  the  Maha- 
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lanobis  distance  [12, 14]  for  matching  the  predictions 
with  potential  matches  in  a  newly  acquired  frame. 
Note  that  the  match  measure  is  computed  for  the 
shallow  structure  as  a  whole  and  not  for  its  con¬ 
stituent  lines  individuaUy.  The  covariances  of  the 
state  vector  associated  with  the  model  couple  the 
various  parameters  of  the  model  as  a  whole.  This 
provides  an  implicit  hgural  context  for  disambiguous 
matching  while  accounting  for  modeling  and  mea¬ 
surement  uncertainties. 

4  Does  Affine  Describability  imply 
Shallowness  ? 

Before  presenting  an  algorithm  for  shallow  struc¬ 
ture  identification  based  on  the  formulation  devel¬ 
oped  above,  it  is  instructive  to  compare  the  four  and 
the  six  parameter  affine  transformations. 

4.1  General  2D  AfRne  IVansformation 

In  the  above  formulation,  we  have  chosen  to 
approximate  a  3D  shallow  structure  by  a  fronto- 
parallel  plane.  What  is  the  resulting  description  if 
a  plane  of  arbitrary  orientation  is  chosen  to  approx¬ 
imate  a  shallow  structure  ?  It  can  be  shown  that 
the  four-parameter  2D  affine  description  generalises 
to  a  six-parameter  2D  affine  description  where  the 
extra  parameters  account  for  the  plane  skew. 

In  the  context  of  shape  from  textons,  and  object 
recognition,  Kanade  and  Render  [11]  and  Hutten- 
locher  [10],  respectively,  have  shown  that  given  a 
2D  affine  transformation  between  two  planes,  a  3D 
nmilarity  transformation  (up  to  a  reflection)  which 
relates  the  two  planes  can  be  recovered.  In  other 
words,  the  relative  orientation  (up  to  a  reflection), 
a  2D  translation  (parallel  to  one  plane)  and  relative 
distance  (inverse  scale)  between  the  planes  can  be 
recovered. 

In  the  context  of  motion,  the  affine  transformation 
describes  a  transformation  between  image  projec¬ 
tions  of  the  same  3D  plane  at  two  time  instants  un¬ 
der  the  weak  perspective  projection.  Thus,  given  a 
general  2D  affine  transformation  describing  the  mo¬ 
tion  of  an  image  patch,  the  3D  rotation,  translation 
parallel  to  the  image  plane,  and  the  depth  (up  to  an 
arbitrary  scale  factor)  can  be  recovered.  Note  that 
the  reconstruction  of  the  plane  can  be  done  only  as  a 
fronto-parallel  plane  at  the  computed  depth.  Its  ab¬ 
solute  orientation,  that  is  its  orientation  in  the  cam¬ 
era  coordinate  system,  is  not  recoverable;  only  the 
relative  orientation  that  relates  the  orientation  of  the 
plane  at  the  two  time  instants  can  be  recovered.  This 
is  also  the  rotation  between  the  two  time  instants. 
Thus,  the  description  of  structure  achieved  by  both 
the  four-parameter  transformation  described  earlier, 
and  a  six-parameter  transformation  is  as  a  fronto- 
parallel  plane. 

4.2  Depth  reconstruction  from  the  4-  and 
6-parameter  transformations 

Either  of  the  two  transformations  could  be  used 
for  reconstructing  shallow  structures.  Given  that  the 
3D  features  (points  and  lines)  in  a  shallow  structure 
are  distributed  in  depth  around  a  nominal  depth. 


the  skew  parameters  in  the  six-parameter  transfor¬ 
mation  account  for  the  foreshortening  effects  due  to 
the  variation  in  depth.  The  four-parameter  trans¬ 
formation  accounts  for  the  nominal  depth  only  and 
not  the  distribution  around  it.  Thus,  in  principle, 
the  general  affine  transformation  is  a  more  accurate 
description  of  the  image  motion  of  a  shallow  struc¬ 
ture  than  the  four-parameter  approximation  used 
here.  However,  we  have  found  experimentaUy,  that 
if  Huttenlocher’s  (or  Kanade  and  Render’s)  solution 
is  used  to  reconstruct  the  shcdlow  structure,  it  per¬ 
forms  systematically  worse  th2in  the  four-parameter 
approximation  when  rotations  between  frames  are 
small  and  when  the  3D  plane  has  a  large  slant.  This 
will  be  discussed  in  more  detail  in  the  section  on  ex¬ 
perimental  results.  Note  that  in  the  case  of  motion 
(unlike  that  of  object  recognition),  where  images  are 
acquired  dynamically  with  close  spacing  in  time,  ro¬ 
tations  between  frames  are  usually  small. 

5  Shallow  Structure  Identification 
and  Tracking  Algorithm 

The  formulation  of  Section  3  on  tracking  within 
the  affine  constraints  is  embedded  in  an  algorithm  to 
automatically  identify  shallow  structures  in  a  scene. 
The  essentied  idea  is  that  if  a  hypothesised  structure 
can  be  consistently  tracked  and  its  3D  depth  over 
time  is  consistent  with  a  shallow  structure  model, 
then  the  structure  is  identified  as  shallow,  otherwise 
it  is  labeled  non-shallow.  A  minimal  set  of  three 
lines  (a  triple)  is  used  to  define  a  hypothesised  struc¬ 
ture  as  a  potential  shallow  structure. 

At  startup,  no  information  is  available  about  the 
motion  of  a  hypothesised  structure.  The  line  track¬ 
ing  algorithm  of  Williams  and  Hanson  [18],  which 
matches  lines  to  their  flow-based  predictions,  is  used 
to  generate  the  initial  correspondences.  Using  the 
correspondences  thus  derived,  the  initial  affine  mo¬ 
tion  parameters  and  their  covariances  are  computed 
(Equations  3  and  4). 

After  the  startup  phase,  for  every  newly  acquired 
frame,  the  location  of  the  hypothesised  structure  is 
predicted  using  the  dynamic  model  presented  in  Sec¬ 
tion  3.  Around  each  predicted  line,  a  window  query 
is  performed  to  obtzun  potential  matches  for  each 
line  of  the  structure.  Then  all  possible  potential 
match  structures  are  compared  agmnst  the  predicted 
structure. 

In  the  matching  phase,  the  Mahalanobis  distance 
is  computed  for  each  potential  data  triple  against 
the  prediction,  and  the  best  triple  below  a  thresh¬ 
old  is  chosen  as  the  match.  This  threshold  depends 
on  the  model  of  measurement  errors  in  lines  and  al¬ 
lowable  non-uniformity  in  motion.  If  all  the  errors 
are  assumed  to  be  Gaussian,  then  the  Mahalanobis 
distance  has  a  chi-squared  distribution  with  the  ap¬ 
propriate  degrees  of  freedom  [5].  A  threshold  on 
this  distance  can  be  chosen  by  using  the  chi-squared 
value  corresponding  to  a  desired  level  of  confidence 
in  accepting  a  match. 

Once  an  acceptable  match  is  found,  in  the  up¬ 
date  phase,  the  model’s  new  motion  parameters  are 
computed.  The  covariances  of  the  current  location 
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vector  and  the  computed  affine  puameteis  are  also 
recomputed.  Since  depth  is  a  part  of  the  loca¬ 
tion  vector,  it  is  also  updated.  Additionally,  the 
variance-weighted  sample  mean  and  sample  disper¬ 
sion  of  depth  are  updated  by  incorporating  the  new 
measurement. 

If  an  acceptable  match  is  not  found  in  the  cur¬ 
rent  &ame,  due  to  inadequacies  of  line  grouping  or 
because  of  occlusion,  the  current  prediction  is  up¬ 
graded  to  model  status  and  its  number-of-lrames- 
missed  count  is  incremented.  The  variances  of  its 
motion  parameters  and  those  of  the  line  segments 
for  its  potential  matches  in  the  next  frame  are  in¬ 
creased  and  consequently,  the  search  window  for  the 
next  prediction/matching  phase  is  expanded.  This 
permits  graceful  handling  of  many  of  the  primary 
reasons  for  a  match  not  being  found,  such  as  un¬ 
dergrouping  or  overgrouping,  disappearance  of  lines, 
and  motion  discontinuities.  If  a  match  is  reacquired 
after  a  lapse  in  the  previous  frame,  the  motion  vari¬ 
ances  and  the  window  sise  are  reduced,  but  not  be¬ 
low  the  levels  set  at  the  start  of  tracking. 

The  tracking  phase  discussed  above  is  repeated  for 
every  newly  hypothesised  shallow  aggregate  struc¬ 
ture  for  a  window  of  frames.  If  1)  it  has  been  tracked 
for  more  than  half  the  frames  in  the  window,  and  2) 
its  depth  dispersion  is  within  an  allowed  limit,  and 
3)  its  residual  affine  description  error  for  all  matched 
frames  is  less  than  a  threshold,  then  it  is  declared  as 
a  shallow  structure,  else  it  is  not  and  is  dropped  from 
further  consideration. 

6.1  Application  of  the  Algorithm 

The  algorithm  outlined  above  can  be  applied  to 
image  data  in  either  a  query  mode  or  in  an  auto¬ 
matic  mode.  In  the  query  mode,  a  set  of  lines  is 
presented  to  the  algorithm  as  a  hypothesised  shal¬ 
low  structure.  The  algorithm  tracks  the  structure  as 
described  above  and  declares  it  shallow  or  otherwise. 

In  the  automatic  mode,  triples  of  lines  all  over 
the  image  are  instantiated  as  hypothesised  aggre¬ 
gate  structures  and  the  algorithm  automatically  cy¬ 
cles  through  them  and  labels  any  given  structure  as 
shallow  or  non-shallow.  We  employ  proximity  and 
convexity  as  generic  heuristics  to  create  triples  of 
line  tokens  as  aggregate  hypotheses. 

Given  a  set  of  lines  constituting  a  hypothesised 
shallow  structure  in  frame  1,  the  tracking  algorithm 
is  applied.  The  tracking  is  done  for  a  few  frames 
before  the  structure  is  labeled. 

6  Experimental  Results 

0.1  Tracking  Results 

We  present  the  tracking  results  on  two  image  se¬ 
quences,  cones-se^and  room-$eq,  both  of  which  were 
captured  with  a  SONY  B/W  AVC-Dl  camera,  with 
an  effective  FOV  of  24  by  23  degrees  mounted  on  a 
Denning  robot,  and  digitised  to  256-by-242  pixels. 
The  camera  moved  into  the  scene  with  a  translation 
magnitude  measured  to  be  1.95  feet  for  the  cone$-aeq 
and  0.39  feet  for  the  room-seq.  Two  image  frames 
each  of  the  eonet-teq  and  the  room-$eq  are  shown  in 
Figures  4  and  5,  respectively.  It  is  emphasised  that 


Figure  4:  IVames  1  ic  6  of  the  cones-aeq. 


Figure  5:  Frames  1  i!c  8  of  the  room-aeq. 


the  effective  motion  is  neither  purely  translational 
nor  uniform.  In  each  frame  lines  are  extracted  using 
Boldt’s  [6]  line  grouping  system. 

For  both  sequences.  Figures  6-8  are  to  be  read 
left-to-right  and  top-to-bottom.  In  each  figure, 
panel  a)  shows  the  structure  highlighted  in  bold  and 
overliud  on  lines  in  frame  1.  Panel  b)  highlights 
the  corresponding  structure  in  frame  2;  the  corre¬ 
spondence  wu  derived  in  the  bootstrap  phase  using 
flow-based  line  tracking  [18].  Subsequently,  corre¬ 
sponding  to  each  frame  in  the  sequence,  eeid  panel, 
starting  with  panel  c)  onwards,  depicts  matching  for 
each  successive  frame.  Only  the  region  around  the 
structure  of  interest  is  expanded  and  shown  in  detail. 
The  prediction  windows  for  each  line  are  shown  as 
shaded  areas.  The  centT2d  spine  of  these  windows  is 
the  actual  prediction.  Thin  lines  show  ah  the  lines  in 
and  around  the  region  of  interest.  Lines  of  medium 
thickness  show  the  union  of  sets  of  potential  matches 
for  each  line.  If  a  match  is  found  in  a  frame,  it  is 
drawn  using  bold  lines. 

Figure  6  depicts  tracking  of  three  lines  on  a  cone. 
In  frame  4,  the  left  line  of  the  cone  is  merged  with 
a  door  line  in  the  background  by  the  line  group¬ 
ing  system.  No  match  is  found  for  the  structure 
in  this  frame  but  its  prediction  persists.  In  the  next 
frame,  the  lines  split  again  and  the  match  b  success¬ 
fully  found.  This  shows  that  the  system  is  resilient 
to  overgrouping  errors  caused  by  accident  of  view¬ 
points.  The  mechanism  of  model  persistence  allows 
the  internal  model  of  a  shallow  structure  to  persist 
even  when  a  good  match  for  the  data  is  not  found  in 
a  given  frame.  This  could  be  due  to  various  reasons 
—  errors  in  token  extraction,  motion  discontinuity 
and  occlusions. 

To  highlight  the  model  persistence  aspect,  results 
of  tracking  an  independently  moving  object  are  pte- 
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a)  Frame  1  b)  Frame  2 


e)  Frame  5  t)  Frame  6 


Fignie  6:  Tracking  a  cone  in  the  Cones  se> 
quence. 

seated  in  Fignte  7.  In  the  room-teq,  along  with  the 
motion  of  the  camera  into  the  scene,  an  object  cre¬ 
ated  ont  of  Lego  blocks  was  moved  independently. 
As  the  camera  moves  into  the  scene,  this  object 
moves  from  the  left  to  the  right  (Figure  5).  Further¬ 
more,  during  the  course  of  the  motion,  the  Lego  ob¬ 
ject  gets  completely  occluded  for  three  frames  in  the 
sequence  and  then  re-emerges.  Also,  between  frames 
6  and  7,  while  the  object  is  completely  occluded, 
there  is  a  discontinuity  in  the  motion  of  the  «un- 
era.  The  vehicle  on  which  the  camera  is  mounted 
encounters  an  upward  sloping  ramp.  As  a  result,  the 
predictions  for  the  object  after  frame  6  go  in  a  direc¬ 
tion  oppodte  to  the  dfrection  of  the  actual  motion  in 
the  image.  In  spite  of  this  compounded  effect  of  mo¬ 
tion  disrontinnity  and  occlusion,  the  mechanism  of 
model  persbtence  and  plant  noise  in  the  models  en¬ 
ables  the  algorithm  to  maintain  the  object’s  identity 
and  track  it  after  it  reappears. 

The  process  is  depicted  in  Figure  7.  Note  that  in 
frame  7,  when  the  object  is  still  occluded,  there  are 
many  potential  mattes  within  the  search  window 
but  no  good  match  is  found  because  none  of  them 
satisfies  the  match  criterion  for  the  model  as  a  whole. 


a )  Frame  1 


b)  Frame  2 


o)  Frame  5 


f )  Frame  6 


g)  Frame  7 


h  I  Frwie  8 


i)  Frame  9  J)  Frame  10 

Figure  7:  Tracking  of  a  shallow  independently 
moving  object  in  the  room-teq. 


In  conttast,  if  the  matching  wu  done  on  a  per  line 
baaia,  lalae  matchea  could  have  eaaily  been  found. 
In  frame  8,  when  the  object  reappeara  from  behind 
the  occluding  auiface,  there  ia  a  lot  of  clutter  in  the 
aearch  windowa  along  with  the  good  match.  Also, 
notice  that  due  to  the  motion  discontinuity  while 
the  object  was  occluded,  the  prediction  (center  of 
the  search  window)  has  moved  considerably  away 
from  the  actual  data.  However,  the  model  matching 
technique  that  accounts  for  modeling  and  measure¬ 
ment  noise  is  able  to  find  the  object. 


ure  illustrates  the  fact  that  an  affine  description 
is  inadequate  for  describing  the  motion  of  a  non¬ 
shallow  structure.  This  is  made  explicit  by  the  non- 
trackability  of  the  structure  over  time.  The  struc¬ 
ture  is  tracked  up  to  frame  3  but  beyond  the  fourth 
frame,  the  predictions  deviate  from  the  data  and 
hence  the  model  is  lost.  The  deviation  from  shal¬ 
lowness  is  clearly  expressed  in  the  computed  affine 
parameters  in  that  the  computed  depth  of  the  struc¬ 
ture  indicates  that  it  is  receding  from  rather  than  ap¬ 
proaching  the  camera  (i.e.  depth  is  negative).  The 
predictions  in  frames  5  and  6  are  explicitly  shown 
using  dotted  lines  to  contrast  them  with  the  data. 


e )  Frame  5 


f )  Frame  6 


Figure  8:  Non-trackability  of  a  non-shallow 
triple.  Two  lines  on  a  cone  and  one  of  the 
doorway  lines  in  the  background. 

In  order  to  use  the  tracking  for  automatic  seg¬ 
mentation  of  shallow  structures,  it  is  to  be  shown 
that  the  tracking  algorithm  is  unable  to  track  non¬ 
shallow  structures.  With  this,  the  affine  trackability 
constraint  can  be  applied  to  the  discrimination  task 
(shallow  vs.  non-shallow). 

Figure  8  shows  the  attempt  at  tracking  a  non¬ 
shallow  structure  in  the  eonei-seq.  Two  lines  on 
a  cone  and  one  on  a  structure  in  the  background 
have  been  chosen  for  this  illustration.  The  iig- 


0.2  Segmentation  and  Reconstruction  Re¬ 
sults 

The  tracking  algorithm  was  applied  to  the  room- 
$eq  and  the  eones-$tq  to  identify  the  shallow  struc¬ 
tures  in  the  scene.  Line  triples  were  automatically 
selected  to  hypothesise  aggregate  structures.  Each 
of  these  was  tested  for  affine  trackability,  resulting  in 
its  labeling  as  a  shallow  or  a  non-shallow  structure. 
Figures  9  and  10  show  the  structures  identified  as 
shallow  by  the  algorithm  in  the  two  sequences.  In 
the  room-seq  and  the  conea-geq,  79  and  121  triples 
were  found  out  of  a  total  number  of  180  and  167 
lines,  respectively. 

In  the  eones-ieq,  amongst  the  two  cones  (cones 
5  and  6)  in  the  center  of  the  image,  two  lines  on 
cone  5  and  a  line  on  cone  6  are  merged  together  as 


Table  2:  Computed  vs.  Measured  Depths  of 
some  Objects  in  the  room-stq  (in  /eei). _ 


Object 

Meas.  Z 

Comp.  Z 

1 

8.3 

8.02 

-3.4 

2 

13.4 

12.48 

6.0 

3 

14.57 

14.6 

HlfIBIjQi 

4 

18.98 

18.78 

-1.1 

5 

11.57 

11.78 

1.8 

6 

19.04 

18.01 

-5.4 

7 

20.35 

19.16 

-5.8 

8 

20.35 

19.84 

-2.5 

a  single  shaUow  structure.  This  is  because  they  are 
close  to  the  FOE  and  are  far  away  enough  that  their 
image  motion  is  small.  However,  cone  6  is  correctly 
labelled  as  a  shallow  structure. 

The  depth  of  some  salient  structures  was  mea¬ 
sured  with  a  tape  measure.  Tables  1  and  2  show 
a  comparison  of  this  ground  truth  with  the  com¬ 
puted  depths  for  the  conea-aeq  and  the  room-aeqy 
respectivdy.  The  objects  referred  to  in  the  tables 
are  labeUed  in  Figures  9  and  10. 


Figure  10:  ShaUow  structures  identified  in  the 
room-aeq. 

0.2.1  Comparison  between  the  4-  and 
6— parameter  affine  reconstruction 

In  Section  4,  the  relative  merits  of  the  3D  re¬ 
construction  of  a  shallow  structure  using  the  four 
and  six-parameter  affine  transformations  were  dis¬ 
cussed.  Now  we  present  some  results  of  this  com¬ 
parison.  The  results  are  illustrated  on  an  image  se¬ 
quence  of  a  scene  in  which  shallow  planar  structures 
of  a  number  of  orientations  are  present.  This  se¬ 
quence  is  called  the  comp-aeqr,  two  frames  are  shown 
in  Figure  11.  The  approximate  translation-in-depth 
between  consecutive  frames  is  1.4  feet. 

The  depths  of  some  salient  structures  in  the  scene 
were  measured  from  the  camera  in  its  position  in 
frame  1.  Recall  that  both  the  transformations  re¬ 
construct  a  shallow  structure  as  a  fronto-parallel 
plane.  So,  for  structures  which  have  a  large  slant, 
the  ground  truth  depths  are  the  average  depths. 
Figure  12  shows  some  labelled  objects  chosen  for 
the  comparison  and  Table  3  shows  the  measured 


depths  and  the  depths  computed  by  the  four  and 
rix-paiameter  affine  transformations.  The  depths 
of  the  highly  slanted  structures  in  the  left  half  of  the 
image  (objects  1-3  in  the  table)  show  a  larger  error 
for  the  six-parameter  transformation  and  a  consid¬ 
erably  smaller  one  for  the  four-parameter  transfor¬ 
mation.  We  have  confirmed  this  with  simulated  data 
as  well. 


Table  3:  Depth  comparisons  for  the  four  and 
six-parameter  affine  transformations  in  the 
comp~aeq  {in  feet). _ 


Meas. 

Z 

Comp.  Z 

1  Affn. 

!  4 

Err. 

(%) 

k3I 

HBII 

29.3 

30.0 

2.32 

24.2 

-17.42 

2 

31.2 

31.0 

25.2 

-19.31 

3 

33.2 

34.4 

3.13 

25.9 

•22.18 

■OH 

26.0 

1.01 

ll^i 

mmn, 

5 

35.8 

34.2 

Esai 

32.4 

•9.60 

6 

28.2 

28.3 

0.39 

25.9 

Figure  12:  Labelled  objects  in  the  comp-aeq 
shown  in  frame  1  lines. 

There  are  two  reasons  for  the  above  behavior. 
First,  the  computation  of  the  six-parameter  trans¬ 
formation  is  sensitive  to  the  orientation  of  the  lines 
chosen  in  the  image  plane.  In  order  to  understand 
the  second  reason,  the  equations  for  the  3D  parame¬ 
ters  developed  in  [11]  can  be  used.  The  relationship 
between  the  relevant  3D  parameters  and  the  affine 
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tiansfoimation  can  be  written  as; 

_  cosja 

det{A)  coatTi  '  ' 

where  a  is  the  scale  parameter  which  equab  the  ratio 
of  average  depths  of  the  shaUow  structure  at  instants 
1  and  2,  A  is  the  2x2  matrix  for  the  afEne  transfor¬ 
mation,  and  (Ti  and  era  are  the  slants  of  the  planar 
approximation  for  the  shallow  structure  at  the  two 
time  instants.  For  the  case  of  motion  with  small  ro¬ 
tations,  the  difference  in  the  slants,  say  6a,  is  small, 
that  is,  aa  »  ai  +  Sa.  With  this  approximation, 
the  above  equation  can  be  written  as: 

s*  =  (1  —  jo- tan  Oi)  det( A)  (8) 

6(T  is  related  to  the  relative  orientation  or  rotation 
between  the  planar  positions  at  the  two  time  in¬ 
stants.  The  error  in  s  for  a  small  error  in  Sar  can 
be  expressed  as: 

d(s*)  =  -  tan  <Ti  det{A)  d{6a)  (9) 

Recall  that  for  the  six-parameter  affine  transforma¬ 
tion,  only  Ser  and  not  ai  (or  rr])  can  be  computed. 
Equation  9  shows  that  a  small  error  in  computing 
Sa  gets  magnified  by  the  factor  tanci.  This  factor 
varies  from  0  to  oo  as  signuti  goes  from  0  to  90  de¬ 
grees.  Thus,  for  higher  slants  the  error  in  scale  and 
in  the  corresponding  depth  is  higher. 

7  Conclusions 

In  this  paper,  we  have  presented  a  framework  for 
the  integration  of  spatial  constraints  on  generic  ob¬ 
ject  structure  and  temporal  constraints  on  smooth 
motion  to  achieve  a  semantically  useful  description 
of  a  scene  from  a  sequence  of  images.  The  mo¬ 
tion  of  shallow  structures  in  the  image  plane  can 
be  described  by  an  affine  transformation.  Instead  of 
clustering  image  features,  observed  over  two  bames, 
into  an  object  hypothesis  that  is  consistent  with  a 
shallow  structure  interpretation,  the  temporal  evo¬ 
lution  of  a  hypothesised  structure  is  used  to  verify 
its  consistency  within  the  constraints  of  a  shallow 
structure.  Temporal  evolution  is  characterised  by 
the  trackability  of  a  structure  under  the  affine  con¬ 
straint.  Thus,  a  scene  can  be  divided  into  shallow 
structures  and  the  rest  by  the  use  of  tracking  as  a 
verification  process.  Further,  3D  reconstruction  for 
the  identified  shallow  structures  can  be  done  without 
an  explicit  computation  of  the  3D  motion  parame¬ 
ters. 
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Abstract 

Most  robotic  grasping  tasks  assume  a  station¬ 
ary  or  fixed  object.  In  this  paper,  we  ex¬ 
plore  the  requirements  for  grasping  a  moving  ob¬ 
ject.  This  task  requires  proper  coordination  be¬ 
tween  at  least  3  separate  subsystems:  real-time 
vision  sensing,  trajectory-planning/arm-control, 
and  grasp  planning.  As  with  humans,  our  sys¬ 
tem  first  visually  tracks  the  object’s  3-D  posi¬ 
tion.  Because  the  object  is  in  motion,  this  must 
be  done  in  real-time  to  coordinate  the  motion 
of  the  robotic  arm  as  it  tracks  the  object.  The 
vision  system  is  used  to  feed  a  an  arm  control 
algorithm  that  plans  a  trajectory.  The  arm  con¬ 
trol  algorithm  is  implemented  in  two  steps:  1) 
filtering  and  prediction,  and  2)  kinematic  trans¬ 
formation  computation.  Once  the  trajectory  of 
the  object  is  tracked,  the  hand  must  intercept 
the  object  to  actually  grasp  it.  We  present  ex¬ 
perimental  results  in  which  a  moving  model  train 
is  tracked,  stably  grasped,  and  picked  up  by  the 
system. 

1  INTRODUCTION 

The  focus  of  our  work  is  to  achieve  a  high  level 
of  interaction  between  a  real-time  vision  system 

*This  work  was  supported  in  part  by  DARPA  contract 
N00039-84-C-0165,  NSF  grants  DMC-86-05065,  DCI-86- 
08845,  CCR^8S-12709,  IRI-86-571S1,  IRI-88-1319,  North 
American  Philips  Laboratories,  Siemens  Corporation  and 
Rockwell  Inc. 


that  is  capable  of  tracking  moving  objects  in  3- 
D  and  a  robot  arm  that  contains  a  dexterous 
hand  that  can  be  used  to  intercept,  grasp  and 
pick  up  a  moving  object.  We  are  interested  in 
exploring  the  interplay  of  hand-eye  coordination 
for  dynamic  grasping  tasks  such  as  grasping  of 
parts  on  a  moving  conveyor  system,  assembly  of 
articulated  parts  or  for  grasping  from  a  mobile 
robotic  system.  Coordination  between  an  organ¬ 
ism’s  sensing  modalities  and  motor  control  sys¬ 
tem  is  a  hallmark  of  intelligent  behavior,  and  we 
are  pursuing  the  goal  of  building  an  integrated 
sensing  and  actuation  system  that  can  operate  in 
dynamic  as  opposed  to  static  environments.  The 
algorithms  we  have  developed  that  relate  sensing 
to  actuation  are  quite  general  and  applicable  to 
a  variety  of  complex  robotic  tasks  that  require 
visual  feedback  for  arm  and  hand  control. 

The  system  we  have  built  addresses  three 
distinct  problems  in  robotic  hand-eye  coordina¬ 
tion  for  grasping  moving  objects:  fast  computa¬ 
tion  of  3-D  motion  parameters  from  vision,  pre¬ 
dictive  control  of  a  moving  robotic  arm  to  track  a 
moving  object,  and  grasp  planning.  The  system 
is  able  to  operate  at  approximately  human  arm 
movement  rates,  using  visual  feedback  to  track, 
stably  grasp,  and  pickup  a  moving  object. 

The  system  consists  of  two  fixed  cameras 
that  can  image  a  scene  containing  a  moving  ob¬ 
ject  (see  Figure  1).  A  PUMA-560  with  a  parallel 
jaw  gripper  attached  is  used  to  track  the  object 
with  the  of  stably  grasping  and  picking  up  the 
object  as  it  moves.  The  system  operates  as  fol- 
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lows; 

1.  The  imaging  system  performs  a  stereoscopic 
optic-flow  calculation  at  each  pixel  in  the  im¬ 
age.  From  these  optic-flow  fields,  a  motion 
energy  profile  is  obtained  that  forms  the  ba¬ 
sis  for  a  triangulation  that  can  recover  the  3- 
D  position  of  a  moving  object  at  video  rates. 

2.  The  3-D  position  of  the  moving  object  com¬ 
puted  by  step  1  is  initially  smoothed  to  re¬ 
move  sensor  noise,  and  a  non-linear  filter  is 
used  to  recover  the  correct  trajectory  param¬ 
eters  which  can  be  used  for  forward  predic¬ 
tion,  and  the  updated  position  is  sent  to  the 
trajectory-planner/ arm-control  system. 

3.  The  trajectory  planner  updates  the  joint 
level  servos  of  the  arm  via  kinematic  trans¬ 
form  equations.  An  additional  fixed  gain 
filter  is  used  to  provide  servo-level  control 
in  case  of  missed  or  delayed  communication 
from  the  vision  and  filtering  system. 

4.  Once  tracking  is  stable,  the  system  com¬ 
mands  the  arm  to  intercept  the  moving  ob¬ 
ject  and  the  hand  is  used  to  stably  grasp  the 
object  and  pick  it  up. 

The  following  sections  of  the  paper  describe 
each  of  these  subsystems  in  detail  along  with  ex¬ 
perimental  results. 

2  PREVIOUS  WORK 

Previous  efforts  in  the  areas  of  motion  tracking 
and  real-time  control  are  too  numerous  to  ex¬ 
haustively  list  here.  We  instead  list  some  no¬ 
table  efforts  that  have  inspired  us  or  use  sim¬ 
ilar  approaches.  Burt  et  al.  [1]  has  focused 
on  high  speed  feature  detection  and  hierarchi¬ 
cal  scaling  of  images  in  order  to  meet  the  real¬ 
time  demands  of  surveillance  and  other  robotic 
applications.  Related  work  has  been  reported  by 
Lee  and  Wohn  [2]  and  Wiklund  and  Granlund 
[3]  who  use  image  differencing  methods  to  track 
motion.  Corke,  Paul  and  Wohn  [4]  report  a 
feature  based  tracking  method  that  uses  special 
purpose  hardware  to  drive  a  servo-controller  of 
an  arm-mounted  camera.  Goldenberg  et  al.[5] 
have  developed  a  method  that  uses  temporal  fil¬ 
tering  with  similar  hardware  to  our  own.  Luo, 
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Mullen  and  Wessel  [6]  report  a  real-time  imple¬ 
mentation  of  motion  tracking  in  1-D  based  on 
Horn  and  Schunk’s  method.  Verghese  et  al.  [7] 
report  real-time,  short-range  visual  tracking  of 
objects  using  a  pipelined  system  similar  to  our 
own.  Safadi  [8]  uses  a  tracking  filter  similar  to 
our  own  and  a  pyramid  based  vision  system, 
but  few  results  are  reported  with  this  system. 
Rao  and  Durrant-Whyte  [9]  have  implemented  a 
Kalman  filter  based  de-centralized  tracking  sys¬ 
tem  that  tracks  moving  objects  with  multiple 
cameras.  Miller  [10]  has  integrated  a  camera  and 
arm  for  a  tracking  task  where  the  emphasis  is  on 
learning  kinematic  and  control  parameters  of  the 
system.  Weiss  et  al.  [11]  also  use  visual  feedback 
to  develop  control  laws  for  manipulation.  Brown 
[12]  has  implemented  a  gaze  control  system  that 
links  a  robotic  “head”  containing  binocular  cam¬ 
eras  with  a  servo  controller  that  allows  one  to 
maintain  a  fixed  gaze  on  a  moving  object.  Clark 
and  Perrier  [13]  also  have  implemented  a  gaze 
control  system  for  a  mobile  robot.  A  variation 
of  the  tracking  problems  is  the  case  of  moving 
cameras.  Some  of  the  papers  addressing  this  in¬ 
teresting  problem  are  [14,  15,  16]. 

The  majority  of  literature  on  the  control 
problems  encountered  in  motion  tracking  exper¬ 
iments  is  concerned  with  the  problem  of  gener¬ 
ating  smooth,  up-to-date  trajectories  from  noisy 
and  delayed  outputs  from  different  vision  algo¬ 
rithms.  Our  previous  work  [17]  coped  with  that 
problem  in  a  similar  way  as  in  [18],  using  an 
a  —  /?  —  7  filter,  which  is  a  form  of  a  steady- 
state  Kalman  filter.  A  similar  approach  can  be 
found  in  papers  by  [19,  20,  21].  In  [19]  a  sophis¬ 
ticated  control  scheme  is  described  which  com¬ 
bines  a  Kalman  filter’s  estimation  and  filtering 
power  with  an  optimal  (LQG)  controller  which 
computes  the  robot’s  motion.  The  authors  have 
presented  good  tracking  results,  as  well  as  stated 
that  the  controller  is  robust  enough  so  the  use 
of  more  complex  (time-varying  LQG)  methods  is 
not  justified.  The  choice  of  gain  matrices  in  the 
cost  function  and  the  best  set  of  noise  variances  is 
done  empirically.  Paper  [20]  addresses  the  prob¬ 
lem  of  uncertainty  of  cameras  in  the  robot’s  co¬ 
ordinate  frame.  The  fact  that  cameras  have  to 
be  strictly  fixed  in  robot’s  frame  might  be  quite 
annoying  since  each  time  they  are  (most  often 
incidentally)  displaced,  one  has  to  undertake  a 
tedious  job  of  their  recalibration.  Again,  the  es¬ 
timation  of  moving  object’s  position  and  orienta- 


tion  is  done  in  the  Cartesian  space  and  a  simple 
error  model  is  assumed.  The  paper  [21]  adopts 
3rd  order  Kalman  filter  in  order  to  allow  a  robotic 
system  (consisting  of  two  degrees  of  freedom)  to 
play  the  labyrinth  game. 

A  somewhat  different  approach  has  been  ex¬ 
plored  in  papers  [22,  23,  24].  The  auto-regressive 
(AR)  and  auto-r^gressive  moving-average  with 
exogenous  input  (ARMAX)  models  are  investi¬ 
gated.  It  is  noteworthy  to  point  out,  as  stated 
in  [22],  that  this  is  more  of  an  implementation 
than  a  conceptual  difference  from  the  classical 
Kalman-filter  approach  since  the  coefficients  of 
polynomials  in  ARMAX  model  depend  on  the 
Kalman  gains. 


3  VISION  SYSTEM 

The  vision  system  used  in  this  research  is  de¬ 
scribed  in  detail  in  [17]  and  we  briefly  review  the 
method  here.  In  a  visual  tracking  problem,  mo¬ 
tion  in  the  imaging  system  has  to  be  translated 
into  3-D  scene  motion.  Our  approach  is  to  ini¬ 
tially  compute  local  optic-flow  fields  that  mea¬ 
sure  image  velocity  at  each  pixel  in  the  image. 
A  variety  of  techniques  for  computing  optic-flow 
fields  have  been  used  with  varying  results  includ¬ 
ing  matching  based  techniques  [25,  26,  27]  gra¬ 
dient  based  techniques  [28,  29,  30]  and  spatio- 
temporal  energy  methods  [31,  32].  Optic-flow 
was  chosen  as  the  primitive  upon  which  to  base 
the  tracking  algorithm  since  it  can  be  extracted 
quickly  and  reliably  from  our  images,  and  it 
quantifies  actual  motion  in  the  scene  which  we 
need  to  detect.  We  are  using  2  fixed  cameras 
that  are  calibrated  with  the  3-D  scene,  but  there 
is  no  explicit  need  to  use  registered  (i.e  scan-line 
coherence)  cameras.  The  identical  algorithm  for 
extracting  optic-flow  is  run  on  each  camera’s  im¬ 
age  in  parallel  using  the  PIPE  parallel  image  pro¬ 
cessor  [33].  Once  the  motion  centroids  are  known 
for  each  camera,  they  are  back-projected  into  the 
scene  using  the  camera  calibration  matrices  and 
triangulated  to  find  the  actual  3-D  location  of 
the  movement.  This  3-D  position  is  computed 
every  l/60th  second,  but  with  a  processing  delay 
of  roughly  10  msec. 


4  ARM  CONTROL 

The  second  part  of  the  system  is  the  arm  control. 
The  robotic  arm  has  to  be  controlled  in  real-time 
to  follow  the  motion  of  the  object,  using  the  out¬ 
put  of  the  vision  system.  The  raw  vision  system 
output  is  not  sufficient  as  a  control  parameter 
since  its  output  is  both  noisy  as  well  as  delayed 
in  time.  The  control  system  needs  to  do  the  fol¬ 
lowing: 

•  Filter  out  the  noise  with  a  digital  filter 

•  Predict  the  position  to  cope  with  delays  in¬ 
troduced  by  both  vision  subsystem  and  the 
digital  filter 

•  Perform  the  kinematic  transformations 
which  will  map  the  desired  manipulator’s  tip 
position  from  a  Cartesian  coordinate  frame 
into  joint  coordinates,  and  actually  perform 
the  movement 

Our  vision  algorithm  provides  in  each  sam¬ 
pling  instant  a  position  in  3D  space  as  a  triplet  of 
Cartesian  coordinates  (x,  y,  z').  The  task  of  the 
control  algorithm  is  to  smooth  and  predict  ahead 
the  trajectory,  thus  positioning  the  robot  where 
the  object  is  during  its  motion. 

A  well  known  and  useful  solution  is  the 
Kalman  filter  approach,  because  it  successfully 
performs  both  smoothing  and  prediction.  How¬ 
ever,  the  assumption  the  Kalman  filter  makes  is 
that  the  noise  applied  to  the  system  is  white. 
That  fact  directly  depends  on  the  parametriza- 
tion  of  the  trajectory  and,  unfortunately  in  our 
case,  the  simplest  possible  parametrization  - 
Cartesian-  does  not  support  this  noise  model. 
Our  previous  work  [17]  used  a  variant  of  this  ap¬ 
proach  and  obtained  tracking  that  was  smooth 
but  not  accurate  enough  to  allow  actual  grasping 
of  the  moving  object.  Our  solution  to  this  prob¬ 
lem  was  to  appeal  to  a  local  coordinate  system 
that  was  able  to  model  the  motion  and  system 
noise  characteristics  more  accurately,  thus  pro¬ 
ducing  a  more  accurate  control  algorithm. 

4.1  The  Model  of  the  3D  Motion 

The  main  idea  in  the  trajectory  parametrization 
used  in  this  paper  is  to  describe  a  point  in  a  lo¬ 
cal  coordinate  frame,  relative  to  the  point  from 
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the  previous  sampling  instant,  by  the  triplet  of 
coordinates  (s,  <l>,  z)  where 

•  s  is  the  length  of  an  arc  between  two  points 

•  <f>  is  the  “bending”  of  the  trajectory  (see  fig¬ 
ure  2) 

•  z  is  the  altitude  difference  in  two  consecutive 
points 

Due  to  the  existence  of  noise,  all  three  coor¬ 
dinates  are  random  variables  with  certain  distri¬ 
butions.  We  have  made  the  following  assump¬ 
tions,  as  a  result  of  both  reasoning  about  the 
vision  algorithm  and  certain  necessary  simplifi¬ 
cations; 

•  In  sampling  instant  k  our  object  is  in  point 

Pk 

•  In  the  next  sampling  instant  ib  -I-  1  the  ob¬ 
ject  is  in  Pk+i  and  the  point  returned  by  the 
vision  algorithm  is  Qk+i 

•  Qk+i  is  normally  distributed  around  Pk+i- 
The  noise  can  be  expressed  by  its  two  com¬ 
ponents,  tangential  n*  and  normal  rin 

•  n<  and  n„  are  both  zero-mean,  with  the  same 
dispersion  and  mutually  not  correlated.  Ex¬ 
perimentally,  it  has  been  determined  that 
their  coefficient  of  correlation  is  between  0.1 
and  0.2. 

Under  these  assumptions  it  can  be  shown 
that  the  velocity  v  and  curvature  k  are: 

V  =  lim  s/T  (1) 

K  =  lim  tan  ifio/sn  (2) 

T— ♦O 

where  sq  =  ||P*+i  —  P*|l  and  (po  =  — 

^Pk-lPkPk+l- 

What  are  advantages  of  such  a  parametriza.- 
tion?  The  most  obvious  one  is  the  simplicity  of 
the  prediction  task  in  this  framework;  all  we  need 
is  to  multiply  the  velocity  v  =  s/T  by  the  time 
T  >T  'Kt  want  to  predict  ahead,  as  well  as  “bend¬ 
ing”  <i>.  The  next  advantage  is  that  in  order  to 
achieve  an  accurate  prediction,  we  do  not  need  a 
high-order  model  with  the  mostly  heuristic  tun¬ 
ing  of  numerous  parameters.  The  price  we  have 
to  pay  is  that  filtering  is  not  straightforward.  It 
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turns  out  that  we  cannot  just  apply  a  low-pass 
filter  in  order  to  recover  a  DC  component  from 
s,  but  rather  we  need  more  elaborate  approach 
which  takes  into  account  a  probabilistic  distribu¬ 
tion  of  s. 

While  this  model  introduces  more  complex¬ 
ity  than  a  standard  Cartesian  model,  we  will  see 
below  that  it  is  more  effective  in  allowing  us  to  ac¬ 
curately  predict  and  smooth  our  trajectory.  The 
initial  experiments  with  this  model  separates  3- 
D  space  into  an  XY  plane  and  the  Z  axis,  and 
addresses  these  two  components  of  motion  sepa¬ 
rately.  However,  the  method  for  the  XY  plane 
can  be  extended  to  include  another  parameter 
which  will  create  a  full  Frenet  Frame  at  each  in¬ 
stant  of  time  in  the  trajectory.  Our  initial  exper¬ 
iments  (described  below)  tracked  a  planar  curve, 
allowing  us  to  use  this  simplification.  Motion  in 
the  Z  direction  is  tracked  with  a  Cartesian  dis¬ 
placement  as  outlined  in  [17]. 

Our  model  assumes  the  following  coordinate 
transformation  that  relates  the  moving  object’s 
coordinate  frame  at  one  instant  with  the  next 
instant  in  time: 

Rot(z,  ^o)  o  Trans(x,  s)  o  Trans(z,  Az)  (3) 

where  Rot  and  Trans  are  rotation  /  translation 
around  /  along  given  axis.  Presented  as  a  4  x  4 
matrix,  transformation  (3)  is 

cos  ^0  —  sin  ^0  0  s  cos  <j>o 

„  _  sin  00  cos  00  0  ssin0o 

■'delta  -  0  0  1  Az 

0  0  0  1 

4.2  Probability  Distributions  of  s 
and  <f> 

In  this  section,  we  will  motivate  the  choice  of 
model  used  to  recover  the  parameter  values  so 
and  <p  given  the  estimate  of  the  arclength  s.  Let 
s  =  IIQfc+i  —  P*  1 1  be  the  distance  between  the  ob¬ 
ject  and  the  next  position  returned  by  the  vision 
algorithm.  According  to  figure  2  we  have 
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where  nj  and  nj,  are  Gaussian  with  dispersion 
(T.  According  to  the  definition  of  the  probability 


distribution,  we  can  write  the  distribution  F{s) 
as 


2v<t^ 


(5) 


where  D  is  a  disk  of  the  radius  s. 

Now  by  introducing  substitution  t  =  rcos0, 
n  =  r  sin  we  get 

Jo  Jo 

Distribution  density  is  given  as  f{s)  =  or 
after  differentiation 
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The  last  integral  can  be  expressed  by  a  modified 
Bessel  function  Io(z): 


/w  = 


(7) 


A  graph  of  f{s)  is  given  in  figure  3.  Here  so  is 
fixed  to  1  and  <t  varies  from  0.4  to  1.0.  Our  job 
is  to  recover  so  given  /(s). 

It  is  apparent  from  the  figure  3  that  the  peak 
value  of  /(s)  depends  ou  <t,  and  drifts  towards 
higher  values  as  <r  grows.  The  expectation  for  s 
also  depends  on  tr.  In  particular,  we  have 

Si  =  £(s)  =  J  s/(s)ds  =  (8) 

where 

«(*)  =  Y(/o(  Y)+/i(y))) 

(9) 

Here  <t  is  the  constant  for  the  given  system  and 
it  is  related  to  sq.  In  order  to  estimate  <r  we  will 
use  second-order  moment; 

Sj  =  £?(s*)  =  f  s*/(s)ds  =  So  -I-  (10) 

Jo 


where  p  =  S2/S1  and  z  -  <r/s\.  Now  by  setting 
X  =  we  end  up  with  an  equation 

Equation  12  relates  our  known  control  in¬ 
puts  (p  =  S2/S1)  to  X.  We  can  create  a  table 
of  values  for  this  function  offline,  and  then  by 
interpolation  calculate  a  value  of  x  given  p. 

Let  xo(p)  be  the  solution  of  (12).  Now  we 
can  express  so  and  <r  as  functions  of  si  and  53  as 
follows; 


<^  =  S2— 7==^=^  (14) 

^2  +  xo  (if) 


This  method  requires  little  on  line  compu¬ 
tation  -  an  interpolation  table  of  values  of  «i  is 
all  we  need  to  recover  the  arclength  parameter 
So-  Figure  5  is  the  experimentally  measured  den¬ 
sity  of  sj  taken  from  the  triangulated  optic-flow 
fields.  This  distribution’s  resemblance  to  figure  3 
(the  theoretical  density)  is  clear. 

To  find  the  bending  parameter  <^0i  we  use 
the  same  technique  as  for  the  distribution  of  s, 
and  we  get  the  following  formula; 

where  k  =  ct/sq  and  —  ^0  €  (— ir/2,x/2).  It 
is  obvious  that  /  is  symmetric  around  ^0,  which 
also  means  that  the  expectation  E<f>  =  4o-  Hence, 
we  so  not  need  to  perform  a  non-linear  filtering 
to  recover  ^o- 

The  graph  of  /  for  fc  =  0.1  to  0.9  and  4>o  =  0 
is  given  in  figure  6. 

4.3  Smoothing  of  the  Control  In¬ 
puts 


Now  by  eliminating  sq  from  (8)  and  (10)  we 

have 
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In  the  previous  section,  we  showed  how  to  extract 
parameters  sq  and  if>o  from  the  updated  positions 
determined  from  the  vision  system.  The  signals 
si,s^  described  in  equations  8  and  10  are  in  fact 


the  smoothed  versions  expectations  of  the  control 
signals  s,  which  are  the  arclength  and  the  ar- 
clength  squared.  The  smoothing  filter  we  use  to 
compute  these  signals  is  a  moving-average  (MA) 
filter  using  a  Kaiser  window  [34].  This  filter  pro¬ 
vides  the  largest  ratio  of  signal  energy  in  the  main 
lobe  and  a  side  lobe,  which  usually  results  in  a 
filter  of  lower  order.  The  windowing  function  is 
given  by 


WK(n)  = 


Io0y/l  -  (1  - 


where  lo  is  the  modified  zeroth-order  Bessel  func¬ 
tion,  /?  is  the  shape  parameter  which  defines  the 
width  of  the  main  lobe  and  M  is  the  order  of  the 
filter.  According  to  [34],  /3  and  M  are  given  by 

A -7.95 
M  «s  ,  ^  ■— 

14.36Aw 


{0.1102(A-8.7),A>50 

0.5842(A  -  21)°^  -I-  0.07886(A  -  21), 

21  <  A  <  50 

where  A  is  the  stopband  attenuation  and  Ao;  = 
(wr  —  We)/w,,  Ur  is  the  stopband  frequency,  Uc  is 
the  passband  frequency  and  u,  is  the  sampling 
frequency. 

We  have  adopted  A  =  30  and  Aw  =  0.05 
which  results  in  Af  =  30.  Since  the  frequency  of 
the  vision  algorithm  is  about  60  Hz,  the  overall 
length  of  the  window  is  about  0.5  seconds.  We 
also  apply  this  MA  filter  to  the  bending  param¬ 
eter  <f>- 

The  implementation  of  MA  filter  is  straight¬ 
forward:  once  the  weights  are  computed  off-line, 
a  window  of  length  M  of  measurements  is  re¬ 
tained  and  each  sample  is  multiplied  by  an  ap¬ 
propriate  weight  in  the  sampling  period,  which 
requires  M  multiplications  and  Af  —  1  additions. 
This  allows  reasonably  wide  windows  (even  up  to 
several  hundreds  entries)  to  be  used  in  computing 
the  smoothed  signal. 


parameters  described  above.  The  host  computer 
is  able  to  predict  ahead  the  trajectory  using  the 
derivation  of  velocity  and  curvature  in  equations 
(1)  and  (2).  These  updated  predictions  are  sent 
to  the  trajectory  generator  that  is  actually  con¬ 
trolling  the  robot  arm.  The  trajectory  genera¬ 
tor  is  a  separate  system  that  has  two  parallel 
tasks:  a  low-priority  task  which  reads  the  serial 
line  receiving  updated  control  signals  and  high- 
priority  task  which  calculates  the  transformation 
equation  and  moves  the  manipulator.  Those  two 
tasks  communicate  via  shared  memory.  The  job 
of  the  robot  controlling  program  is  to  synchronize 
its  two  tasks  (i.e.  to  obtain  mutual  exclusion  in 
accessing  shared  data),  to  unpack  input  packets 
read  from  the  serial  line,  and  to  update  the  joint 
servos  every  30  msec. 

The  asynchronous  nature  of  the  communi¬ 
cation  between  the  host  computer  and  the  tra¬ 
jectory  generator  can  result  in  missed  or  delayed 
communications  between  the  two  systems.  Since 
the  updating  of  the  robotic  arm  parameters  needs 
to  be  done  at  very  tightly  specified  servo  rates  (30 
msec),  it  is  imperative  that  the  trajectory  gener¬ 
ator  can  provide  updated  control  parameters  at 
these  rates,  regardless  of  whether  it  has  received 
a  new  control  input  from  the  host.  Therefore,  we 
have  implemented  a  fixed  gain  a  —  /3  —  y  filter  as 
part  of  the  trajectory  generator  [18].  This  filter 
provides  a  small  amount  of  prediction  to  the  tra¬ 
jectory  parameters  if  the  control  signals  from  the 
host  are  delayed. 

We  are  using  RCCL  [35]  to  control  the 
robotic  arm  (a  PUMA  560).  RCCL  (Robot  Con¬ 
trol  C  Language)  allows  the  use  of  C  program¬ 
ming  constructs  to  control  the  robot  as  well  as 
defining  transformation  equations  (as  described 
in  [36]).  The  transformation  equations  permit 
dynamic  updating  of  arm  position  by  generating 
the  4x4  transform  of  the  moving  object’s  po¬ 
sition  from  the  vision  system  and  sending  this 
information  to  the  arm  control  algorithm. 


5  GRASPING 


4.4  Prediction  and  Synchroniza¬ 
tion 

The  host  computer  controls  the  initial  vision  pro¬ 
cessing  and  subsequent  computation  of  control 


The  remaining  part  of  our  system  is  the  intercep¬ 
tion  and  grasping  of  the  object.  We  have  exam¬ 
ined  the  human  psychological  literature  in  order 
to  find  useful  paradigms  for  robotic  visual-motor 
coordination  strategies  that  include  arm  move- 


ment  and  grasping  from  visual  inputs.  In  this 
section  we  briefly  describe  some  relevant  theories 
and  their  relation  to  our  own  work. 

There  are  several  theories  on  the  organiza¬ 
tion  of  skilled  human  motor  control.  Richard 
Schmidt  [37]  has  proposed  a  theory  of  gener¬ 
alized  motor  programs,  or  movement  schemas. 
In  this  view,  a  skilled  action  is  composed  of  an 
ordered  set  of  parametrized  motor  control  pro¬ 
grams  of  short  duration  (less  than  200  msec), 
each  of  which  accomplishes  one  part  of  the  task. 
As  one  program  is  completed,  the  next  one  is  ex¬ 
ecuted.  Generalized  motor  programs  accomplish 
several  objectives;  (1)  they  specify  which  muscle 
to  move  in  a  given  motion;  (2)  the  order  of  con¬ 
traction  of  the  muscles;  (3)  the  phasing  within  the 
sequence,  i.e.,  the  temporal  relationships  among 
the  contractions;  (4)  the  relative  force  of  each  el¬ 
ement.  At  the  initiation  of  a  skilled  task,  the  pa¬ 
rameters  of  the  motor  control  program  are  deter¬ 
mined  by  sensory  input  and  task  demands,  and 
then  the  programs  are  executed  to  completion. 
If  the  wrong  program  is  selected  for  some  rea¬ 
son,  the  program  cannot  be  stopped  by  use  of 
sensory  information.  Similarly,  in  playing  table 
tennis,  the  motion  of  the  racket  is  determined  be¬ 
fore  the  beginning  of  the  swing  and  visual  input 
has  little  effect  after  the  initiation  of  motion.  As 
an  example  of  Schmidt’s  theory,  the  skilled  task 
of  grasping  a  moving  object  could  be  partitioned 
into  two  motor  control  schemas:  one  to  position 
the  arm  and  a  second  one  to  control  the  grasping 
action. 

The  schema  concept  maps  into  Von  Hof- 
sten’s  ideas  about  the  development  of  grasp¬ 
ing  skills  in  children  [38]  He  believes  there  are 
two  separate  sensorimotor  systems  responsible 
for  reaching:  one  for  approaching  the  target  and 
one  for  grasping  it.  During  early  childhood,  the 
precise  timing  between  these  two  systems  devel¬ 
ops  as  the  child  learns  how  to  catch.  The  reach¬ 
ing  system  develops  first,  before  a  child  is  capa¬ 
ble  of  grasping.  But  even  before  he  is  capable  of 
closing  his  hand  at  precisely  the  right  moment, 
he  has  begun  to  develop  the  ability  to  move  his 
hand  toward  a  moving  object  and  predict  the  lo¬ 
cation  at  which  his  hand  will  intercept  the  object. 
With  growth,  a  child  learns  to  control  the  tim¬ 
ing  between  reaching  and  grasping,  that  is,  to 
close  his  hand  at  the  correct  moment.  Experi¬ 
mental  evidence  has  shown  that  there  is  a  win¬ 


dow  of  approximately  14  msec  during  which  the 
hand  must  begin  closing.  Unlike  Schmidt,  how¬ 
ever,  Von  Hofsten  does  not  consider  vision  and 
grasping  to  be  two  mutually  exclusive  tasks  [39] 
Visual  tracking  is  used  to  guide  the  reaching  arm 
during  its  motion,  not  only  before  motion.  A  co¬ 
ordinated  motion  is  a  combination  of  perceptual 
schemas  and  motor  schemas  (see  Iberall  and  Ar- 
bib  [40]  ). 

Vision  is  used  during  the  reaching  phase  of 
the  task  for  what  psychologists  call  “prospective 
control” .  Prospective  control  corresponds  to  pre¬ 
dictive  filtering,  as  used  by  control  theorists.  In 
grasping  a  moving  object,  it  is  necessary  for  the 
hand  to  move  not  to  the  current  position  of  the 
object,  but  to  plan  ahead  to  where  it  will  be 
shortly.  Vision,  rather  than  haptics,  provides  the 
basis  of  prospective  control  because  touch  can¬ 
not  provide  the  anticipatory  information  required 
to  predict  the  course  of  a  moving  object.  There 
are  two  predominant  theories  about  what  visual 
schema  is  used  to  track  a  moving  object  and 
aid  in  predicting  the  intersection  of  the  reaching 
hand  and  that  object.  Lee  [41]  proposes  the  use 
of  vision  to  measure  the  expansion  of  the  image 
on  the  retina  in  order  to  estimate  the  time  un¬ 
til  contact.  The  attraction  of  this  theory  is  that 
humans  would  not  need  to  compute  the  veloc¬ 
ity  and  location  of  the  moving  object,  but  would 
calculate  the  more  useful  time-until-contact  in¬ 
formation.  A  person  catching  an  object  uses  this 
image  to  compute  when  to  begin  the  correct  mo¬ 
tion  commands  (usually  at  about  300  msec  be¬ 
fore  the  actual  grasp).  Von  Hofsten  disputes  the 
use  of  retinal  expansion  information  because  it 
is  clear  that  people  are  able  to  track  targets  in 
which  there  is  no  such  expansion,  such  as  ob¬ 
jects  that  are  circling  or  passing  across  the  field 
of  view.  He  suggested  an  alternative  schema  in 
which  people  calculate  the  distance  to  a  moving 
object  by  using  the  vergence  angle  to  the  object. 
Vision  seems  to  be  used  predominantly  to  track 
the  moving  object,  but  the  catcher  also  tracks  his 
hand  during  reaching  to  aid  his  nonvisual  propri¬ 
oceptive  senses,  that  is,  to  help  judge  the  position 
of  his  hand  in  relation  to  the  environment.  Fi¬ 
nally,  vision  must  be  used  during  the  reaching 
phase  to  orient  the  hand  correctly  in  relation  to 
the  object  that  is  being  caught. 

We  also  note  a  relevant  fact  for  human  con¬ 
tact  and  grasping  of  objects.  The  central  fac- 
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tor  to  the  final  grasp  is  the  time  of  the  onset  of 
hand  closure.  In  early  childhood  (up  to  about  5 
months),  closing  the  hand  is  triggered  primarily 
by  touch.  Children  tend  to  begin  grasping  only 
when  they  are  already  in  contact  with  the  object. 
By  the  time  a  child  is  13  months  old,  however, 
the  hand  closes  before  touch,  on  average  as  early 
as  for  adults.  We  take  the  view  below  that  our 
robotic  system  is  past  early  childhood  -  we  will 
close  the  hand  before  actual  contact  is  made. 

The  initial  strategy  we  have  adopted  in  pick¬ 
ing  up  the  object  is  an  open  loop  strategy,  simi¬ 
lar  in  spirit  to  the  pre-programmed  motor  control 
schemas  described  in  the  psychological  literature. 
Schmidt’s  schema  theory  holds  that  for  tasks  of 
short  duration,  perception  is  used  to  find  a  set 
of  parameters  to  pass  to  a  motor  control  pro¬ 
gram.  It  is  not  used  during  the  execution  of  a 
task.  When  grasping  a  moving  object,  for  ex¬ 
ample,  once  vision  determined  the  trajectory  of 
the  object,  the  reach  and  grasping  motor  schemas 
take  over  with  no  interference  from  vision. 

In  our  implementation  of  this  strategy,  vi¬ 
sion  is  not  used  to  continually  monitor  the  grasp¬ 
ing,  but  only  to  provide  a  final  position  and  veloc¬ 
ity  from  which  the  arm  is  directed  to  very  quickly 
move  to  the  object.  This  automatic  movement  is 
done  by  establishing  coordinate  frames  of  action 
for  each  of  the  components  of  the  system  and 
solving  transformation  equations. 

The  transformation  equations  permit  dy¬ 
namic  updating  of  the  arm  position  by  generat¬ 
ing  the  4x4  transform  of  the  moving  object’s 
position  from  the  vision  system  and  sending  this 
information  to  the  arm  control  algorithm.  Be¬ 
cause  the  movement  of  the  hand  requires  a  small 
amount  of  time  during  which  the  object  may  have 
moved,  the  object’s  trajectory  is  predicted  ahead 
during  the  movement  using  the  a  —  —  7  predic¬ 
tor.  By  keeping  the  fingers  of  the  hand  spread 
during  this  maneuver,  no  actual  contact  takes 
place  until  the  gripper  reaches  the  position  of  the 
moving  object. 


6  EXPERIMENTAL 
RESULTS 

We  have  implemented  the  system  described 
above  in  order  to  demonstrate  the  capability  of 
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the  methods.  The  goal  was  to  track  a  moving 
model  train,  intercept  it,  stably  greisp  it  and  pick 
it  up.  The  train  was  moving  in  an  oval  trajectory; 
however,  the  system  had  no  a  priori  knowledge  of 
this  particular  trajectory.  The  setup  of  our  sys¬ 
tem  is  presented  in  figure  1.  The  velocity  of  the 
train  was  10  —  20cm/s.  In  this  section  we  present 
some  results  obtained  by  experiments.  First,  in 
figure  7  we  have  the  actual  measured  arclength 
signal  Si  (black)  and  the  filtered  signal  sq  (gray). 
It  is  noticeable  that  sq  is  somewhat  below  the 
expected  value  of  si.  The  nature  of  si  is  quite 
noisy;  however,  the  analysis  described  in  section  4 
was  able  to  accurately  extract  the  correct  control 
signal.  The  arm  control  is  particularly  smooth 
and  jerk  free,  as  well  as  being  accurate  enough 
to  intercept  and  grasp  the  object  between  the 
jaws  of  the  gripper.  Figure  8  shows  the  moving 
object’s  trajectory  points  computed  by  the  vi¬ 
sion  algorithm  (black)  and  the  commanded  con¬ 
trol  signals  after  filtering  (gray).  As  can  be  seen, 
the  control  system  is  able  to  accomplish  its  task 
of  both  smoothing  for  noise  and  extracting  an 
accurate  position  of  the  moving  object. 

Because  we  are  using  a  parallel  jaw  gripper, 
the  jaws  must  remain  aligned  with  the  tangent  to 
the  actued  trajectory  of  the  moving  object.  The 
system  controls  the  gripper  direction  (joint  6  on 
the  robot)  to  be  parallel  to  this  tangential  direc¬ 
tion,  allowing  grasping  to  occur  at  any  point  in 
the  trajectory. 

Figure  9  shows  3  frames  taken  from  a  video 
tape  of  the  system  intercepting,  grasping  and 
picking  up  the  object  (this  video  tape  has  been 
submitted  to  the  video  portion  of  the  IEEE 
Robotics  and  Automation  conference).  The  sys¬ 
tem  is  quite  repeatable,  and  is  able  to  track 
other  arbitrary  trajectories  in  addition  to  the  one 
shown. 


7  FUTURE  WORK  AND 
CONCLUSIONS 

We  have  developed  a  robust  system  for  track¬ 
ing  and  grasping  moving  objects.  The  system 
relies  on  real-time  stereo  triangulation  of  optic- 
flow  fields  and  is  able  to  cope  with  the  inher¬ 
ent  noise  and  inaccuracy  of  visual  sensors  by  ap¬ 
plying  parameterized  filters  that  smooth  and  can 


predict  ahead  the  moving  object’s  position.  Once 
this  tracking  is  achieved,  a  grasping  strategy  is 
applied  that  performs  an  analog  of  human  arm 
movement  schemas. 

Our  future  work  is  concerned  with  imple¬ 
menting  other  possible  grasping  strategies.  One 
strategy  we  are  currently  exploring  is  to  visually 
monitor  the  interception  of  the  hand  and  object 
and  use  this  visual  information  to  update  the 
Drive  transform  at  video  update  rates.  This  ap¬ 
proach  is  computationally  more  demanding,  re¬ 
quiring  multiple  moving  object  tracking  capabil¬ 
ity.  The  initial  vision  tracking  described  above  is 
capable  of  single  object  tracking  only.  If  we  at¬ 
tempt  to  visually  servo  the  moving  robotic  arm 
with  the  moving  object,  we  have  introduced  mul¬ 
tiple  moving  objects  into  the  scene. 

We  have  identified  2  possible  approaches  to 
tracking  these  multiple  objects  visually.  The  first 
is  to  use  the  PIPE’S  region  of  interest  operator 
that  can  effectively  “window”  the  visual  field  and 
compute  different  motion  energies  in  each  win¬ 
dow  concurrently.  Each  region  can  be  assigned 
to  a  different  stage  of  the  PIPE  and  compute 
its  result  independently.  This  approach  assumes 
that  the  moving  objects  can  be  segmented.  This 
is  possible  since  the  motion  of  the  hand  in  3- 
D  is  known  -  we  have  commanded  it  ourselves. 
Therefore,  since  we  know  the  camera  parameters 
and  3-D  position  of  the  hand,  it  will  be  possi¬ 
ble  to  find  the  relevant  image-space  coordinates 
that  correspond  to  the  3-D  position  of  the  hand. 
Once  these  are  known,  we  can  form  a  window  cen¬ 
tered  on  this  position  in  the  PIPE,  and  concur¬ 
rently  compute  motion  energy  of  the  moving  ob¬ 
ject  and  the  moving  hand  in  each  camera.  Each 
of  these  motion  centroids  can  then  be  triangu¬ 
lated  to  find  the  effective  positions  of  both  the 
hand  and  object  and  compute  the  new  Drive 
transform.  Both  computations  must,  however, 
compete  for  the  hardware  histogramming  capa¬ 
bility  needed  for  centroid  computation,  and  this 
will  effectively  reduce  the  bandwidth  of  position 
updating  by  a  factor  of  2. 

Another  approach  is  to  use  a  coarse-fine  hi¬ 
erarchical  control  system  that  uses  a  multi-sensor 
approach.  As  we  approach  the  object  for  grasp¬ 
ing,  we  can  shift  the  visual  attention  from  the 
static  cameras  used  in  3-D  triangulation  to  a  sin¬ 
gle  camera  mounted  on  the  wrist  of  the  robotic 
hand.  Once  we  have  determined  that  the  moving 
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object  is  in  the  field  of  view  of  this  camera,  we 
can  use  its  estimates  of  motion  via  optic-flow  to 
keep  the  object  to  grasped  in  the  center  of  the 
wrist  camera’s  field  of  view.  This  control  infor¬ 
mation  will  be  used  to  compute  the  Drive  trans¬ 
form  to  correctly  move  the  hand  to  intercept  the 
object.  We  have  implemented  such  a  tracking 
system  with  a  different  robotic  system  [42]  and 
can  adapt  this  method  to  this  particular  task. 
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Figure  1:  Tracking  Grasping  System 


Figure  3;  Distribution  density  /(«),«o  =  1,<t  =  0.4—  l.O,  increment  =  0.1 
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Figure  7;  Input  signal  si  (black)  and  filtered  signal  so  (gray) 


Overcoming  the  Barriers  to 
Architecture-Independent  Image  Processing 

Jon  A.  Webb 
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Pittsburgh,  PA  15213-3890 


By  applying  simple  ideas  from  parallel  pro¬ 
gramming  research,  it  is  possible  to  overcome 
the  serious  barriers  in  software  development  to 
widespread  application  of  parallel  computers 
to  image  processing.  This  is  being  done  in  the 
Adapt  language  [22]  project,  which  exploits 
these  ideas:  complete  architecture  indepen¬ 
dence;  the  little  language  approach;  and  the 
simultaneous  development  of  language,  paral¬ 
lel  implementations,  applications,  and  a  signif¬ 
icant  library  of  image  processing  programs, 
based  on  the  emergi  ng  ANSIIISO  standard  Pro¬ 
grammer’s  Imaging  Kernel  System. 


1.  Introduction 

Image  processing  is  ripe  for  the  application 
of  parallel  computers.  Powerful,  reliable  paral¬ 
lel  computers  have  been  capable  of  being 
applied  to  problems  in  image  processing 
research  for  years;  and  now,  image  processing 
computers  based  on  parallelism  rather  than 
fixed-function  hardware  pipelining  are  becom¬ 
ing  available.  Indeed  parallelism  is  the  only 
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available  source  of  the  power  that  will  be 
required  to  process  images  of  significant  size 
without  inordinate  investment  in  costly,  diffi- 
cult-to-maintain,  hardware. 

For  years  people  have  searched  for  a  solu¬ 
tion  to  the  problem  of  programming  these  par¬ 
allel  computers.  Some  have  implicitly  assumed 
that  a  powerful  parallel  computer  would  offer 
significant  enough  advantage  to  lure  everyone 
to  adopt  that  machine  as  a  standard,  and  have 
introduced  architecture-specific  languages.  But 
this  approach  has  proved  incapable  of  keeping 
up  with  advances  in  parallel  computer  architec¬ 
ture;  any  architecture-specific  approach  goes 
out  of  date  rapidly  as  new  parallel  computer 
features  are  introduced. 


Others  have  introduced  architecture  inde¬ 
pendence  at  various  levels — hiding  data  distri¬ 
bution  from  the  programmer,  or  managing  the 
distribution  of  tasks  automatically.  These 
approaches  do  introduce  some  flexibility,  but 
they  do  not  help  the  image  processing  program¬ 
mer  solve  problems;  they  merely  reduce  the 
number  of  new  features  that  must  be  under¬ 
stood  in  order  to  program  an  architecture,  or 
class  of  architectures.  It  is  the  thesis  of  this 
paper  that  no  less  than  complete  hiding  of  the 
underlying  parallel  computer  will  prove  suffi¬ 
cient  to  successfully  replacing  serial  with  paral¬ 
lel  computers  for  development  of  new  image 
processing  algorithms.  Anything  less  creates 
unacceptable  cost  to  the  programmer,  and 
unnecessarily  restricts  the  programs  developed. 
It  is  possible  to  achieve  this  goal  through  the 
use  of  specialized  languages  for  image  process¬ 
ing. 
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Within  the  design  of  such  languages,  the  tra¬ 
ditional  approach  has  been  what  we  shall  call 
here  the  big  language  approach.  In  big  lan¬ 
guages,  objects  such  as  images  are  manipulated 
as  elementary  data  objects.  This  approach  has 
led  to  a  difficult  programming  style,  prolifera¬ 
tion  of  primitive  operators,  loss  of  understand¬ 
ing  by  the  programmer  of  efficiency  issues,  and 
difficulty  of  implementation  on  MIMD  com¬ 
puters. 

We  reject  this  approach  in  favor  of  what  we 
shall  call  the  little  languages  approach.  (The 
term  “little  language”  comes  firom  an  article  in 
which  specialized,  limited-function  languages 
were  first  described  as  a  language  class  [3] .)  In 
this  method,  a  simple  language  is  defined  with 
a  straightforward  model  of  mapping  a  program 
onto  the  images  to  be  processed,  and  then  ordi¬ 
nary  language  constructs  are  used  to  describe 
the  operations  to  be  performed  on  the  image 
elements.  This  approach  leads  to  a  natural  pro¬ 
gramming  style  (the  programmer  is  essentially 
writing  the  inner  loop  of  a  computation  iterated 
over  the  image),  introduces  no  new  primitive 
operators  (except  those  that  may  be  convenient 
due  to  the  specialized  nature  of  the  language), 
gives  the  programmer  a  straightforward  under¬ 
standing  of  efficiency  issues,  and  can  easily  be 
implemented  on  MIMD  computers. 

Finally,  previous  approaches  in  parallel  lan¬ 
guage  design  for  computer  vision  have  often 
focussed  too  strongly  on  the  capabilities  of  par¬ 
allel  computers  while  ignoring  what  is  needed 
for  successful  computer  vision  programming, 
or  have  concentrated  too  strongly  on  the  needs 
of  computer  vision  while  ignoring  the  restric¬ 
tions  that  parallel  computers  impose  on  the  lan¬ 
guage  designer.  Only  by  integrating  language 
design,  parallel  implementations,  applications 
development,  and  program  library  construction 
can  this  problem  be  overcome. 


2.  Architecture  Independence 

For  the  purposes  of  this  paper,  we  limit  the 
class  of  machines  on  which  we  want  to  achieve 


architecture  independence  to  MIMD  comput¬ 
ers.  The  reason  for  this  is  perfectly  straight- 
foward:  advances  in  processor  architecture  are 
being  driven  at  a  very  high  rate  for  serial  pro¬ 
cessors,  because  of  the  large  market  for  suc¬ 
cessful  serial  processors.  These  advances 
contribute  directly  to  increased  perfomiance  of 
MIMD  computers,  both  in  hardware  and  soft¬ 
ware;  for  example,  future  iWarp  systems  will 
use  standard  microprocessor  designs  together 
with  an  additional  communicat'  ns  compo¬ 
nent  [6]  .  The  same  is  not  true  for  SIMD 
designs;  advances  there  are  driven  only  by  the 
much  smaller  investment  in  SIMD  computers. 

Now  we  can  consider  various  levels  of 
architecture  independence: 

1.  Processor  Architecture.  This  is  the  degree 
of  architecture  independence  provided  by 
conventional  serial  processor  languages.  For 
example,  the  number  of  registers,  the 
presence  or  absence  of  a  math  coprocessor, 
etc.  are  hidden. 

2.  Array  Size.  The  size  of  the  processor  array 
is  hidden:  the  array  topology  must  be  the 
same,  as  well  as  the  intercommunication 
scheme.  This  kind  of  independence  is 
extremely  useful — it  allows  the 
development  of  code  on  a  small  array  of 
nodes,  and  production  runs  on  large  arrays, 
as  well  as  upgrades  of  processor  arrays  to 
larger  systems  without  changing  software — 
but  it  does  not  allow  the  movement  of  code 
from  one  architecture  to  another. 

3.  Array  Topology.  This  level  supports 
changes  in  the  shape  of  the  underlying 
processor  array  as  well  as  its  size.  Processes 
can  be  placed  on  whatever  processors  are 
convenient,  and  the  programming  system 
takes  care  of  managing  communication 
using  whatever  array  topology  is  available. 

4.  Interprocessor  Communication.  This 
level  completely  hides  interprocessor 
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communication.  For  example,  Linda  [9] 
manages  interprocessor  communication  by 
supporting  a  “tuple”  model  in  which 
processes  place  tuples  in  a  common 
database  and  read  them  from  there.  The 
Uniform  System  [15]  also  provides  this 
level  of  architecture  independence. 

5.  Process  Management.  This  is  the  highest 
level  of  architecture  independence.  The 
existence  of  multiple  processes  is  hidden 
from  the  programmer,  as  well  as  all 
communication.  The  program  uses  a  model 
that  does  not  make  any  mention  of  multiple 
processes,  and  the  compiler  takes  care  of 
splitting  the  work  into  multiple  processes 
and  allocation  of  data.  This  potentially 
allows  MIMD.  serial,  and  even  SIMD 
architectures  to  be  programmed  efficiently 
with  the  same  program. 

It  should  be  clear  that  the  image  processing 
programmer  wants  the  highest  degree  of  archi¬ 
tecture  independence  described  above;  any¬ 
thing  less  leads  to  a  reduction  of  the  effort 
applied  to  solving  image  processing  problems, 
and  an  increase  in  the  effort  in  solving  the  prob¬ 
lems  of  parallel  processing.  Solving  these  prob¬ 
lems  may  be  interesting  in  their  own  right,  but 
to  the  image  processing  programmer  they  are  of 
no  more  concern  than  figuring  out  how  to 
manipulate  images  on  a  machine  whose  physi¬ 
cal  memory  is  too  small  for  an  image  to  fit. 
When  a  general  solution  is  found  for  such  a 
problem,  such  as  virtual  memory,  this  leads  to 
increased  productivity  and  a  decrease  in 
machine-specific  solutions. 

Now,  in  the  case  of  a  general  programming 
language  such  as  C,  it  may  well  prove  impossi¬ 
ble  to  provide  the  programmer  with  a  high 
degree  of  architecture  independence.  But  with 
specialized  languages,  particularly  collection- 
oriented  languages  that  exploit  data  parallel¬ 
ism,  complete  architecture  independence  is  not 
only  achievable,  but  not  difficult  to  achieve. 


Experience  with  the  Apply  language  shows 
just  how  easy  it  can  be  to  achieve  this  degree  of 
architecture  independence.  Apply  is  special¬ 
ized  for  local  image  processing  operations, 
such  as  edge  detection,  convolution,  smooth¬ 
ing,  point  operations,  and  so  on.  Apply  pro¬ 
vides  the  highest  degree  of  architecture 
independence  above.  Now,  over  a  period  of  a 
few  years  Apply  compilers  were  developed  for 
all  of  these  machines: 

•  Warp  (10  processor  one-dimensional 
systolic  array)  [17] ,  iWarp  (4-1024 
processor  two-dimensional  systolic  array) 
[2] ,  FT  Warp  (two-dimensional  systolic 
array  with  fault-tolerance)  [11]. 

•  Sun  and  other  UNIX  architectures  (serial 
architecture)  [17] . 

•  Carnegie  Mellon  SLAP  (a  SIMD  machine 
one  processor  per  column  in  an  image — e.g., 
512  processors  for  a  512  x  512  image.)  [8] 

•  University  of  Massachusetts  Image 
Understanding  Architecture  (one  processor 
per  pixel  in  the  largest  implementation)  1 14| 

•  Meiko  Computing  Surface  (transputer- 
based,  reconfigurable)  [18] . 

•  Hughes  HBA,  a  video  bus-based  machine 
[16,  17]  .1 

None  of  these  efforts  involved  more  than 
two  people  actively  programming  the  compiler, 
or  more  than  a  few  months  of  prograntming 
time  to  get  the  compiler  working  efficiently. 
The  implementation  Noan  the  gamut  of  com¬ 
puters  being  used  f>  p.iiallel  image  processing 
today,  including  c\e.  SIMD  machines.  Yes 
programs  can  be  run  efticiently  on  any  of  these 
machines,  simply  by  recompiling  the  Apply 
code. 

The  reason  these  implementation  efforts 
have  been  so  successful  with  so  little  effon  is 
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the  use  of  the  little  languages  model,  as  will 
now  be  explained. 


3.  Little  Languages 

There  are  two  kinds  of  specialized  lan¬ 
guages  for  image  processing:  those  that  manip¬ 
ulate  the  images  as  a  whole,  which  we  call  big 
languages,  and  those  that  map  over  the  image  a 
serial  program  that  manipulates  images  ele¬ 
ments,  which  we  call  little  languages.  The  term 
“big”  comes  from  the  large  data  structures 
manipulated  by  such  languages;  “little”  comes 
from  the  contrast  to  big,  and  an  article  [3]  in 
which  such  specialized  languages  were  first 
de.scribed  as  a  distinct  language  class. 

A  trivial  example  will  illustrate  the  distinc¬ 
tion  between  the  two  approaches.  Suppose  a 
2x2  local  averaging  operation  is  to  be  pro¬ 
grammed.  In  a  big  language,  the  program 
would  perform  these  steps:  shift  the  image  one 
pixel  to  the  right;  add  the  shifted  image  to  the 
original;  shift  this  result  one  pixel  down;  add 
this  to  the  previous  result;  divide  by  four.  In  a 
little  language,  such  as  Apply,  the  programmer 
would  state  in  the  procedure  header  that  a  2x2 
window  was  to  be  taken  on  the  input  image. 
The  program  would  assign  to  the  output  image 
the  sum  of  the  four  pixels  in  this  window, 
divided  by  four. 

Now,  big  languages  are  the  more  popular  of 
the  two  approaches.  Many  big  languages  for 
image  processing  have  been  constructed,  deriv¬ 
ing  their  design  from  image  algebra  [13]  ,  for 
example.  But  in  terms  of  taking  advantage  of 
the  capabilities  of  MIMD  parallel  computers, 
these  languages  suffer  from  several  defects: 

•  The  programmer  is  forced  to  write  in  an 
unfamiliar  and  awkward  programming 
style.  This  is  best  illustrated  by  the  difficulty 
of  writing  highly  conditional  local  image 
processing  operators  in  such  languages:  for 
example,  the  median  filter.  Median  filter 
requires  calculating  the  median  locally  at 


every  pixel  in  the  image,  an  operaliun  iliai 
involves  comparison  of  image  values  with 
neighbors  and  the  use  of  .some  local  siorage. 
In  a  big  language,  the  images  must  be 
shifted  and  manipulated  as  a  whole,  making 
it  difficult  to  write  code  that  must  take  the 
separate  paths  through  conditional 
expressions  that  are  necessary  to  calculate 
the  median. 

•  Big  languages  include  a  lot  of  primitive 
operators,  which  must  be  learned.  This 
imposes  a  heavy  start-up  cost  for  the 
programmer.  Applications  must  be 
completely  reprogrammed  as  well.  Little 
language  programs  are  written  using 
familiar  operators,  so  that  learning  is  ea;cv, 
and  it  is  even  possible  to  reuse  code  from 
serial  implementations  with  only  minor 
changes. 

•  Big  languages  can  easily  be  compiled  for 
SIMD  machines,  because  the  tight  coupling 
between  processors  makes  it  possible  to 
issue  each  primitive  operator  as  an 
instruction  sequence  to  the  entire  array.  This 
is  not  so  on  MIMD  machine.s;  .since  the 
processors  there  are  more  loosely  coupled,  a 
similar  process  would  be  woefully 
inefficient.  Instead,  a  process  known  as 
clustering  must  be  employed  (41 .  Clustering 
combines  a  sequence  of  primitive  operators 
into  a  single  loop.  This  process  requires 
careful  compile  time  analysis,  including 
estimates  of  the  sizes  of  the  objects  to  which 
the  primitives  are  applied.  Little  languages 
can  be  compiled  efficiently  through  the  use 
of  simple  template-based  code  generation 
techniques. 

•  Clustering  requires  a  lot  of  infonnaiion 
about  the  primitives  being  clustered 
together,  making  it  difficult  to  deal  with 
procedure  call  in  big  languages,  especially  if 
the  procedure  being  called  has  been 
compiled  separately.  Procedure  call  is  trivial 


procedure  egsbl (imagein  ;  in  image  array  -1..1)  of  byte  border  G, 

type  :  integer, 
imageout  ;  out  image  byte) 

is  next 

horz,  vert  :  integer; 
begin 

horz  :=  imagein (-1 , -1 )  +  2*imagein (-1 ,  0)  +  imagein(-l,  1)  - 

imagein(  1,-1)  -  2*imagein(  1,  0)  -  imagein (  1,  1); 

vert  :=  imagein (-1, -1)  +  2*imagein(  0,-1)  +  imagein (  1,-1)  - 

imagein(-l,  1)  -  2*imagein(  0,  1)  -  imagein (  1,  1); 

if  type  =  1  then  imageout  :=  sqrt (horz*horz  +  vert*vert); 

else  imageout  :=  abs(horz)  +  abs(vert); 

end  if; 
end  next; 
end  egsbl; 

Figure  1.  Adapt  Sobel  Operator 

procedure  histogram(  im:  in  image  byte,  hist;  out  array (0 .. 255)  of  float) 
is 

count:  array (0. .255)  of  integer; 
first  begin 

for  i  in  0..255  loop  count (i)  :=  0;  end  loop; 
end  first ; 
next  begin 

count (im)  ;=  count (im)  +  1; 

end  next; 
combine  begin 

for  i  in  0..255  loop  count (i)  :=  count (i) +_count (i) ;  end  loop; 
end  combine; 
last 

pixels:  integer; 
begin 

pixels  :=  0; 

for  i  in  0..255  loop  pixels  :=  pixels  +  count  (i);  end  loop; 
for  i  in  0..255  loop  hist(i)  :=  float (count (i) ) /pixels;  end  loop; 
end  last; 
sad  histogram; 

Figure  2.  Adapt  Histogram 
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to  deal  with  in  little  languages,  even  to 
separately  compiled  procedures. 

•  Little  languages  can  support  programming 
models  especially  designed  for  the 
applications  domain,  while  big  languages 
tend  to  support  only  a  few  uniform  models. 
For  example,  Adapt’s  fundamental 
programming  model  is  that  an  operation  is 
applied  once  per  pixel  in  parallel.  But, 
within  a  row,  previous  results  can  be  reused, 
making  it  easy  to  program  operators  that 
take  advantage  of  locality. 

•  The  optimizations  the  big  language 
compiler  must  go  through  in  generating 
efficient  MIMD  code  makes  the  resulting 
program  very  different  from  the  original  big 


language  program.  As  a  result,  the 
programmer  may  not  have  a  clear  model  of 
the  efficiency  issues  involved  in  the 
program;  subtle  changes  in  the  big  language 
code  can  make  a  big  difference  in 
performance  on  the  MIMD  machine.  In 
contrast,  template-based  code  generation  in 
little  languages  gives  a  program  that  looks 
very  much  like  the  original,  making  the 
efficiency  issues  much  easier  for  the 
programmer  to  deal  with. 

•  For  the  same  reason,  designing  a  debugger 
for  big  language  programs  on  MIMD 
computers  is  quite  hard.  A  little  language 
debugger  would  only  have  to  undo  .some 
simple  source  transformation  present  the 
programmer  with  a  model  of  the  code  being 
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executed  that  corresponds  exactly  to  the 
little  language  program  originally  written. 

Let  us  examine  the  process  of  compilation 
of  Adapt  programs  to  understand  how  this  can 
be  done.  An  Adapt  procedure  consists  of  four 
sections: 

First  An  initialization  function,  which 
may  be  run  at  the  beginning  of  any  row  of 
the  image. 

Next  This  section  is  applied  in  raster 
order  across  the  image  (wrapped  around 
at  the  borders  of  the  image).  Each 
execution  of  Next  is  guaranteed  to  be 
preceded  either  by  another  execution  of 
Next  (for  the  previous  pixel),  or  by  First 
at  the  beginning  of  a  row.  Pixels  in  the 
Next  section  are  referenced  relative  to  the 
current  pixel  position,  as  illustrated  by 
the  Sobel  operator  program  in  Figure  1. 

Combine  A  merging  function,  which 

combines  the  outputs  of  any  two  image 
regions  to  produce  an  output  for  the 
concatenation  of  the  two  regions.  To 
make  programming  easier.  Combine  will 
be  applied  only  to  adjacent  groups  of 
consecutive  rows  of  the  image.  Combine 
can  reference  any  variable  used 
elsewhere  in  the  program,  which  retain 
their  values.  Its  output  is  also  expressed 
in  terms  of  these  variables.  Within  the 
Combine  section,  variables  preceded  by 
an  underscore  refer  to  the  values  of 
variables  in  the  lower  image  region,  and 
variables  not  preceded  by  an  underscore 
refer  to  values  from  the  upper  image 
region.  The  Combine  section  result  is  to 
assign  to  variables  not  preceded  by  an 
underscore  the  correct  values  for  the 
merged  region. 


Last  A  termination  function,  which  is 
applied  once  after  the  output  of  the  entire 
image  is  computed. 

An  Adapt  program  for  image  histogram  is 
shown  in  Figure  2.  The  First  section  (lines  17 
through  19)  zeroes  the  histogram,  the  Next  sec¬ 
tion  (lines  20  through  22)  increments  the  histo¬ 
gram  element  for  the  current  pixel,  the 
Combine  section  (lines  23  through  25)  adds  two 
histograms  together,  and  the  Last  section  (lines 
26  through  32)  divides  the  histogram  by  the 
total  pixel  count  to  create  a  pixel  frequency 
array. 

In  order  to  compile  an  Adapt  program  for  a 
MIMD  computer,  the  compiler  must  go 
through  these  steps: 

1.  Distribute  images.  Images  are  divided 
among  processors  in  some  regular  pattern, 
usually  as  swaths  of  rows.  The  images  must 
be  divided  and  distributed  to  processors.  At 
this  level  of  vision,  processing  is  unlikely  to 
vary  significantly  across  the  image,  so  it  is 
usually  sufficient  to  simply  give  each 
processor  an  equal-sized  swath.  But  load 
balancing  techniques  can  be  used;  for 
example,  an  initial  set  of  swaths  can  be  dealt 
out  to  processors,  and  then  new  swaths  can 
be  provided  as  processors  finish  their  work. 
It  is  also  possible  to  distribute  a  few  rows  of 
the  images  to  processors  at  a  time,  which 
will  allow  processors  to  overlap  input, 
output,  and  computation  on  machines  with 
that  capability,  for  example  systolic  arrays. 
Note  that  it  is  also  possible  to  leave  the 
images  distributed  over  the  processor  array, 
so  that  all  that  is  necessary  for  a  processor  to 
get  its  input  images  is  for  that  processor  to 
exchange  a  few  rows  with  its  neighbors,  to 
cover  the  window  of  the  image  being 
processed. 

2.  Distribute  other  inputs.  These  can  be 
distributed  to  processors  via  a  simple 
broadcast  mechanism. 
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3.  Map  relative  addressing  onto  image  buffers. 
The  First  and  Next  sections  use  addressing 
relative  to  the  “current”  pixel.  The  compiler 
must  replace  these  relative  addresses  with 
whatever  array  reference  is  needed  to  access 
the  proper  pixels  in  the  image  buffers.  This 
is  essentially  a  macro-expansion. 

4.  Generate  First  section.  The  First  section 
can  be  executed  immediately  once  the 
inputs  for  a  processor  are  available. 

5.  Embed  Next  section.  The  Next  section  must 
be  embedded  within  appropriate  for  loops 
that  will  cause  it  to  be  iterated  over  the 
section  of  the  image  allocated  to  a  processor. 

6.  Collect  images.  The  output  images  are 
distributed  in  the  same  way  as  the  input 
images,  and  must  be  collected  similarly. 

7.  Extract  Combine  variables.  The  compiler 
must  scan  the  Combine  section  to  determine 
which  variables  must  actually  be  passed 
among  processors.  In  theory,  dataflow 
analysis  would  be  required  to  determine 
which  variables  are  live  within  the  Combine 
section,  but  in  practice  this  yields  little 
improvement  in  performance,  since  usually 
the  only  variables  that  are  eliminated  by  this 
technique  are  scalars,  and  scalars  take  up 
little  space  in  a  message  in  any  case.  The 
extracted  Combine  variables  must  be  packed 
within  a  message,  and  references  to  them 
must  replaced  with  references  to  the 
appropriate  message  component,  again  via 
macro-expansion. 

8.  Generate  Combining  tree.  The  pattern  by 
which  a  processor  combines  its  results — 
serially,  in  a  two-dimensional  pattern,  or  in  a 
binary  tree — depends  on  the  message¬ 
passing  capabilities  of  the  underlying 
processor.  This  pattern  is  determined  for  the 
particular  machine  by  the  Adapt  compiler 
developer,  and  the  compiler  simply 


generates  the  appropriate  code  to  pass  the 
necessary  Combine  variables  among 
processors,  then  generates  the  Combine 
section  applied  to  the  two  sets  of  Combine 
variables.  At  the  top  level  of  this  tree,  the 
Combine  variables  are  passed  back  to  the 
calling  processor. 


9.  Generate  Last  section.  The  code  on  the 
calling  processor  executes  the  Last  section, 
applied  to  the  Combine  variables  returned 
by  the  combining  tree. 


Thus,  the  process  of  compiling  an  Adapt 
program  onto  a  target  machine  comes  down  to 
choosing  an  appropriate  I/O  model,  and  mak¬ 
ing  the  appropriate  macro-substitutions  in 
Adapt  programs  to  translate  the  code  into  the 
target  language  with  image  and  combine  vari¬ 
ables  appropriately  changed.  As  a  result,  given 
an  Adapt  parser,  generating  an  Adapt  compiler 
is  usually  a  matter  of  a  few  weeks  of  program¬ 
ming. 


The  resulting  programs  are  efficient, 
because  the  I/O  model  is  designed  for  the  par¬ 
ticular  target  machine,  and  because  the  com¬ 
piler  makes  so  few  modifications  to  the  user’s 
program.  Experience  with  serial  languages  has 
given  the  user  a  good  idea  of  what  is  necessary 
to  design  code  efficiently,  and  this  experience  is 
directly  useful  in  Adapt. 


Nothing  in  this  simple  compilation  step 
requires  any  special  processing  for  procedure 
call  within  Adapt  programs;  even  procedures  of 
foreign  languages  can  be  called,  so  long  as  their 
argument  passing  conventions  are  compatible 
with  Adapt.  For  example,  the  iWarp  Adapt 
implementation  uses  the  standard  Unix  math 
library  without  change,  simply  by  providing  an 
Adapt  procedure  header  for  each  Unix  math 
routine  in  an  Adapt  header  file. 
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4.  Integrated  Language  Development 

The  Warp  project  showed  the  value  of  inte¬ 
grated  architecture,  applications,  and  software 
development  [7]  .  Integrated  development 
helps  ensure  that  the  system  developed  solves 
problems  of  importance  to  the  application, 
exposes  new  and  interesting  problems  to  be 
solved,  and  helps  ensure  that  the  system  devel¬ 
oped  will  actually  be  used  in  the  targeted  appli¬ 
cation. 

We  have  followed  this  model  in  the  develop¬ 
ment  of  the  Adapt  language,  by  simultaneously 
addressing  the  issues  of  algorithm,  architec¬ 
ture,  applications,  and  libraries.  From  the  earli¬ 
est  days  of  design  of  the  language,  we  have 
studied  important  algorithms  such  as  connected 
components.  Hough  transform,  image  warping, 
and  run-length  encodin  [21,  20,  10]  g  to  make 
sure  that  the  Adapt  model  can  be  used  to  imple¬ 
ment  such  algorithms.  In  cases  where  impor¬ 
tant  image  processing  algorithms  cannot  be 
implemented  in  the  Adapt  model,  such  as  error- 
diffusion  halftoning,  we  have  devised  new 
algorithms  [19]  that  can  be  implemented  in 
Adapt. 

In  addressing  architectural  considerations, 
we  have  implemented  the  language  on  a  variety 
of  computers.  Adapt  implementations  exist  for 
Unix/C  architectures,  the  Carnegie  Mellon 
Nectar  compute  [1]  r,  the  Carnegie  Mellon 
Warp  computer,  and  the  Intel-Carnegie  Mellon 
iWarp  computer.  Each  of  these  designs  has 
exploited  features  specialized  to  the  particular 
architecture.  For  example,  the  Nectar  imple¬ 
mentation  does  automatic  load  balancing  by 
allocating  a  small  slice  to  each  processor,  then 
sending  more  as  processors  complete.  Lightly 
loaded  processors  will  end  up  doing  more  work 
in  this  model,  and  heavily  loaded  processors 
will  do  only  the  initial  small  slice.  On  Warp, 
there  are  two  separate  implementations:  one 
which  partitioned  the  image  by  rows,  and 
another  which  partitioned  the  image  by  col¬ 
umns.  We  found  that  the  row-partitioned 
method  was  much  faster,  because  of  reduced 
coupling  between  processors  and  reduced  I/O 


when  sending  intermediate  results.  The  iWarp 
implementation  exploits  iWarp’s  logical  path¬ 
ways  when  constructing  the  combining  net¬ 
work;  even  though  iWarp  is  physically  a  two- 
dimensional  mesh,  intermediate  results  are 
combined  in  a  binary  tree. 

We  also  have  an  implementation  on  the  Intel 
Touchstone  computer  underway.  This  will  take 
advantage  of  the  Mach  operating  systems  on 
the  Touchstone,  which  is  being  implemented  at 
Carnegie  Mellon.  Images  will  be  stored  distrib¬ 
uted  among  processors,  and  proces.sors  will  get 
the  data  they  need  for  computing  their  local 
result  by  automatic  message  passing  generated 
through  access  to  memory  locations  actually 
stored  in  other  processors. 

The  Sun  implementation  of  Adapt  lias 
proved  particularly  successful  for  code  devel¬ 
opment.  The  Adapt  Sun  implementation  is  not 
an  Adapt  simulation;  Adapt  produces  good 
code,  comparable  to  the  best  hand-written  code 
for  serial  image  processing.  Adapt  programs  on 
the  Sun  can  be  debugged  completely,  then  only 
a  simple  recompilation  step  is  necessary  to  port 
the  code  to  iWarp.  This  greatly  speeds  up  code 
development  time,  as  well  as  reducing  the  load 
on  the  iWarp  for  debugging.  Moreover,  there  is 
a  “parallel  simulation”  mode  in  the  Sun  Adapt 
compiler  that  allows  the  user  to  run  the  pro¬ 
gram  as  if  it  was  being  run  on  a  parallel  proces¬ 
sor  (at  a  small  loss  of  efficiency).  This  allows 
the  programmer  to  uncover  bugs  in  the  Com¬ 
bine  section  of  the  Adapt  code  on  the  Sun. 

Applications  have  also  been  studied  in  the 
Adapt  language  from  the  beginning.  The  paral¬ 
lel  halftoning  algorithm  mentioned  earlier 
yields  high-quality  halftoned  images  (higher 
quality  than  error-diffusion  methods),  and  this 
has  been  exploited  in  a  program  for  halftoning 
portrait  images  for  display  on  monochrome 
big-mapped  displays.  We  have  done  work  on 
image  spectrogram  computation  and  feature 
extraction  modules  for  an  object  recognition 
system.  Currently  in  progress  are  implementa¬ 
tions  of  the  second  DARPA  image  understatid- 
ing  benchmark,  research  on  magnetic 
resonance  image  reconstruction,  development 
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of  a  stereo  vision  algorithm,  and  programming 
the  Joint  Photographic  Experts  Group  (JPEG) 
image  compression  algorithm  in  Adapt. 

The  Adapt  library  development  has  pro¬ 
ceeded  in  two  stages.  First,  the  existing  V^B 
library  (a  library  of  local  image  processing 
operations  for  the  Apply  language)  has  been 
translated  to  Adapt  and  has  been  verified  to 
work  with  Adapt.  Second,  a  more  ambitious 
library  development  effort  is  underway,  based 
on  the  ANSI/ISO  Programmer’s  Imaging  Ker¬ 
nel  System  (PIKS). 

The  PIKS  library  design  is  a  result  of  meet¬ 
ings  of  the  ANSI  X3H3.8  Imaging  Applica¬ 
tions  Program  Interface  task  group,  a  group  of 
engineers  and  researchers  largely  from  indus¬ 
try,  with  representation  from  Sun,  Mitre,  Wang, 
Kodak,  Datacube,  and  so  on.  The  PIKS  library 
has  been  proposed  for  ISO  status,  and  current 
plans  are  for  the  library  to  be  in  nearly  final 
form  at  the  beginning  of  1992. 

The  library  includes  about  200  different 
operators,  covering  these  applications  areas  [5] 

•  primitive  image  manipulation 

•  image  enhancement 

•  image  restoration 

•  image  analysis 

•  image  classification  (basic) 

•  image  visualization  (basic) 

•  standard  color  models 

•  image  transport 

•  image  compression  and  decompression 

The  library  will  be  implemented  in  two 
stages.  Initially,  we  will  concentrate  on  the 
PIKS  Core  routines,  which  are  the  routines  that 
cover  most  of  the  functionality  in  PIKS.  These 
are  the  following  [12] : 


•  Analysis  operators:  accumulator, 
amplitude  projection,  extrema,  distance, 
feature  list,  line  profile,  and  window 
statistics. 

•  Color  processing  operators:  additive  color 
conversion,  color  lookup  table,  subtractive 
color  conversion,  trichromatic  conversion. 

•  Complex  image  operators:  complex 
conjugate,  complex  magnitude,  complex-to- 
polar,  and  polar-to-complex. 

•  Ensemble  operators:  dyadic. 

•  Filtering  operators:  convolve,  two- 
dimensional,  linear  filtering,  max,  min,  and 
median  filtering. 

•  Geometric  operators:  flip,  spin,  transpose, 
resize,  rotate,  subsample,  translate,  warp 
(lookup  table  and  polynomial). 

•  Histogram  operators:  histogram,  one¬ 
dimensional. 

•  Morphological  operators:  erosion  and 
dilation  (conditional  and  unconditional),  fill 
contour,  hit  or  miss  transformation,  and 
morphic. 

•  Point  operators:  bit  shift,  lookup  table, 
monadic,  threshold,  unary,  and  window- 
level. 

•  Presentation  operators:  dither. 

•  Unitary  transform  operators:  Fourier 
transform. 

•  3D  specific  operators:  3D  slice. 

Following  the  implementation  of  the  PIK.S 
core,  we  will  move  on  to  implement  the  com¬ 
plete  PIKS  library. 

The  library  implementation  effort  serves 
three  goals  in  our  language  development  effort: 
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1 .  Completeness.  By  implementing  a  large 
library  of  routines  that  covers  so  much  of 
image  processing,  we  will  ensure  that  the 
Adapt  language  can  actually  be  used  for 
development  of  practical  image  processing 
systems. 

2.  Standardization.  The  PIKS  library 
promises  to  be  developed  and  used  by  a 
wide  variety  of  image  processing  software 
and  hardware  development  companies.  By 
implementing  it,  we  ensure 
interchangeability  of  code  with  applications 
developed  on  other  platforms. 


difficult  to  compile  these  languages  for 
MIMD  computers,  since  multiple  primitive 
operations  must  be  clustered  together  for 
efficiency. 

3.  Non-integrated  language  development. 
Language  development  should  benefit  from 
the  beginning  from  algorithm,  architecture, 
application,  and  library  experience.  Without 
this  some  important  algorithms  may  not  be 
implementable  in  the  language,  or  may  be 
inefficient  on  an  important  architecture 
class,  or  miss  important  functionality  for  the 
development  of  working  systems. 


3.  Software  base.  The  PIKS  library  will  serve 
as  a  basic  set  of  routines  for  the  Adapt 
programmer  to  use.  Code  from  the  library 
can  be  used  as  a  model  for  development  of 
new  routines  of  similar  function. 


5.  Summary 

Reliable  parallel  hardware  is  available  for 
image  processing,  but  the  full  potential  of  this 
hardware  has  not  yet  been  resized.  Software 
has  been  a  bottleneck.  We  trace  this  problem  to 
three  sources: 

1.  Lack  of  complete  architecture 
independence.  Anything  less  than  complete 
hiding  of  the  underlying  hardware  leads  to 
the  introduction  of  programming  issues  that 
are  not  relevant  to  the  concerns  of  the  image 
processing  programmer,  and  which  make 
the  program  limited  to  run  on  a  panicular 
architecture. 

2.  Big  languages.  Big  languages,  which 
manipulate  images  as  a  whole,  are 
appropriate  for  SIMD  computers.  But  they 
impose  an  unnatural  style  on  the  image 
processing  program,  especially  when 
dealing  with  conditional  neighborhood 
operations  such  as  median  filter.  And  it  is 


The  Adapt  language  and  its  associated  PIKS 
library  seeks  to  overcome  these  problems.  It 
offers  complete  architectural  independence.  It 
is  a  little  language,  which  provides  a  natural 
nnodel  to  the  programmer  that  is  easy  to  com¬ 
pile  for  MIMD  machines.  And  the  language 
development  effort  is  integrated,  combining 
algorithm,  architecture,  application,  and  library 
development. 
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Abstract 

In  this  paper  we  describe  a  parallel  implemen¬ 
tation  of  the  linear  feature  extraction  process. 
Linear  feature  extraction  is  a  fundamental  step 
in  many  computer  vision  systems.  It  is  the 
step  that  “bridges  the  gap”  between  iconic  and 
symbolic  processing.  The  task  is  inherently 
heterogeneous  in  that  it  comprises  several  pro¬ 
cessing  steps  with  varying  degrees  of  algorithm 
complexity  and  multiple  data  structures.  Such 
tasks  are  common  among  computer  vision  sys¬ 
tems  but,  unfortunately  do  not  lend  themselves 
to  straight  forward  parallel  implementations. 

We  show  that  implementation  designs  must  not 
only  address  the  algorithmic  requirements,  but 
also  the  algorithm-to-algorithm  interfaces. 

1  Introduction 

Line2tr  feature  extraction  (extraction  of  curves  and  pos¬ 
sible  approximation  with  linear  line  segments)  is  a  basic 
operation  in  most  computer  vision  systems.  This  is  nor¬ 
mally  thought  of  as  a  “low-level”  operation,  easily  im¬ 
plemented  in  parallel  on  virtually  any  parallel  machine. 
However,  this  is  not  entirely  true.  Linear  feature  extrac¬ 
tion  is  more  appropriately  viewed  as  a  bridge  between 
low  and  mid-level  processing.  The  input  to  this  pro¬ 
cess  is  an  image,  but  the  output  is  a  symbolic  structure. 
We  found  the  paucity  of  parallel  implementation  studies 
that  actually  aim  at  outputing  the  symbolic  structure 
to  be  surprising.  This  paper  is  an  effort  to  correct  this 
situation. 

The  process  of  linear  feature  extraction  requires  sev- 
eiltl  steps.  Typical  steps  are  edge  detection,  edge  linking 
to  yield  contours,  and  approximation  of  the  contour  by 
straight  lines  (or  splines).  For  some  of  these  steps,  the 
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processing  is  iconic,  i.e.  the  input  and  output  are  both 
in  the  image  form.  Further,  the  processing  is  rather  lo¬ 
cal.  In  these  steps,  parallelization  is  easily  achieved  on 
almost  any  parallel  machine.  However,  at  some  stage, 
the  information  is  converted  to  a  list  form,  such  as  the 
list  of  edges  forming  the  contours,  or  the  list  of  straight 
lines  that  approximate  the  contours.  Here,  we  have  a  ba¬ 
sic  transformation  of  data  structures  from  the  iconic  to 
the  symbolic.  It  is  this  transformation  that  is  the  source 
of  difficulty  in  parallelizing  linear  feature  extraction  and 
is  explored  in  detail  in  this  paper. 


We  consider  a  specific  algorithm  in  our  study  [Nevatia 
and  Babu,  1980]  however,  our  method  should  apply  to 
any  other  algorithm  that  follows  similar  steps.  That 
is,  various  algorithms  have  been  developed  to  perform 
the  constituent  processing  steps.  For  example,  Canny 
[Canny,  1986]  and  Marr/Hildreth  [Marr  and  Hildreth, 
1980]  describe  alternative  edge  detection  schemes  and 
Canny  [Canny,  1983]  describes  and  alternative  approach 
to  contour  extraction  but  these  approaches  do  not  alter 
the  overall  structure  of  the  linear  feature  extraction  task. 
Furthermore,  the  basic  operations  employed  by  the  al¬ 
ternative  approaches  are  similar,  e.g.  convolutional  pro¬ 
cessing  for  edge  detection.  Therefore,  we  claim  that  our 
results  are  generalizable. 

Various  researchers  have  presented  parallel  implemen¬ 
tations  of  stand-alone  edge  detection  algorithms  [Little 
ei  ai,  1987]  [Lee  and  Aggarwal,  1987]  [Weems  ei  al, 
1991]  while  others  have  concerned  themselves  with  “line 
finders”  that  identify  the  presence  of  a  line,  via  a  Hough 
transform,  for  instance,  but  do  not  actually  extract  the 
line  [Weems  ei  al.,  1991]  [Guerra  and  Hambrusch,  1989]. 
Few  have  investigated  parallel  implementation  of  the  en¬ 
tire  linear  feature  extraction  process.  In  [Shu  ei  ai,  1990] 
the  authors  present  a  fine  grain  parallel  implementation 
of  the  entire  process  on  the  Image  Understanding  Archi¬ 
tecture  [Weems  and  Levitan,  1987].  The  implementation 
utilizes  custom  designed  hardware  and  therefore,  is  un¬ 
intuitive  and  somewhat  difficult  to  understand.  Thus, 
it  will  most  likely  be  difficult  to  implement  (code  and 
debug)  and  maintain  (modify). 

In  [Vaillant  ei  ai,  1989]  the  authors  present  a  coarse 
grain  parallel  implementation  using  a  small  number  of 
powerful  MIMD  processing  elements.  The  algorithm  is 
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of  a  genera]  nature  and  does  not  include  extensive  de¬ 
tails  specific  to  the  architecture  except  for  interprocess 
communication.  That  is,  if  agrees  with  one’s  intuition. 
A  drawback  of  the  implementation  is  that  they  do  not 
achieve  the  full  potential  algorithm  speedup  available  in 
the  low-level  operations  which  are  inherently  fine  grain 
and  map  well  to  a  massively  parallel  SIMD  machine. 

Our  approach  to  developing  a  parallel  implementation 
of  the  linear  feature  extraction  process  is  to  utilize  a 
heterogeneous  pyramid  architecture  that  comprises  both 
a  SIMD  (fine  grain)  mesh  connected  section  and  a  MIMD 
(coarse  grain)  ring  connected  section.  This  is  similar 
to  the  scheme  incorporated  in  the  Image  Understanding 
Architecture  with  two  exceptions:  1)  we  do  not  use  the 
custom  gated-connection  network;  and  2)  we  use  only 
a  ring  topology  in  the  MIMD  level  .The  heterogeneous 
architecture  is  depicted  in  figure  1.  It  is,  in  a  sense,  a 
subset  of  the  lUA. 


Figure  1:  Heterogeneous  architecture  for  the  linear  fea¬ 
ture  extraction  algorithm. 

We  justify  the  use  of  this  architecture  by  analyzing 
the  system  in  terms  of  the  methodology  described  in 
[Reinhart  and  Nevatia,  1990]  which  allows  a  designer  to 
match  parallel  uchitectures  to  the  inherent  parallelism 
of  the  algorithm. 

In  the  following  sections  we  describe  the  linear  feature 
extraction  process.  We  briefly  describe  the  algorithm 
driven  methodology  used  to  analyze  the  process  and  jus¬ 
tify  the  heterogeneous  architecture.  We  analyze  each 
of  the  individual  algorithms  that  constitute  the  process 
and  discuss  how  those  individual  results  fit  into  the  pro¬ 
posed  parallel  architecture.  Finally,  we  present  results 
of  simulation  of  our  implementation  and  summarize  the 
work. 

3  Process  Description 

The  objective  of  the  linear  feature  extraction  process  is 
to  extract  linear  segments  from  the  input  image.  Input 
to  the  process  is  the  2-D  image  array  of  pixels  and  output 
is  a  list  of  linear  segments  with  attributes  of  end-point  lo¬ 
cations,  length,  orientation,  and  contrast.  The  approach 
employed  is  one  of  multiple  (heterogeneous)  steps  includ¬ 
ing  edge  detection,  edge  thinning,  edge  linking,  contour 
extraction,  and  piecewise  linear  approximation.  Details 
of  a  typical  entire  process  can  be  found  in  [Nevatia  and 
Babu,  1980).  Briefly  stated,  the  following  steps  are  per¬ 
formed: 


•  Edge  detection  -  convolve  the  input  image  with  a 
kernel  or  set  of  kernel  masks. 

•  Edge  tf'inning  -  compare  each  edge  to  its  neighbors 
(in  the  directions  orthogonal  to  the  edge’s  orienta¬ 
tion)  and  retain  the  edge  if  its  magnitude  is  greater 
than  the  magnitudes  of  its  neighbors  and  greater 
than  a  fixed  threshold. 

•  Edge  linking  -  compare  each  edge  to  its  neighbors 
(in  the  direction  of  the  edge’s  orientation)  and  form 
a  link  to  the  neighbors  if  they  are  of  similar  orien¬ 
tation. 

•  Contour  extraction  -  extract  (from  the  2-D  image 
array)  the  edge  locations  ((x,j/)  coordinates)  and 
save  them  in  an  ordered  (contour)  list. 

•  Linear  approximation  ~  join  the  end  points  of  the 
contour  list  with  a  single  line  approximation,  mark 
the  point  of  maximum  error  for  this  approximation 
(creating  two  approximating  line  segments),  iterate 
the  process  on  the  two  new  segments  until  the  max¬ 
imum  error  is  within  an  acceptable  bound. 

The  processing  steps  and  resultant  data  representations 
are  depicted  in  figure  2. 
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Figure  2:  Processing  steps  and  data  representations  of 
the  linear  feature  extraction  process. 

Of  primary  interest  is  the  contour  extraction  step.  It  is 
this  step  that  actually  bridges  the  gap  between  the  iconic 
(pixel)  representation  and  the  symbolic  (linear  segment) 
representation.  Prior  to  this  step,  data  is  represented 
as  numbers  in  a  2-D  array.  Following  this  step,  data 
is  represented  as  (x,y)  coordinates  in  a  linked-list.  We 
show,  in  the  following  sections,  that  this  is  also  the  step 
that  poses  the  greatest  challenge  in  developing  a  parallel 
implementation. 

4  The  Parallel  Implementation 

4.1  Overview 

The  methodology  we  employ  for  justifying  our  choice  of 
a  heterogeneous  pyramid  architecture  for  the  linear  fea¬ 
ture  extraction  process  is  algorithm  driven  [Reinhart  and 
Nevatia,  1990).  We  analyze  each  algorithm,  identifying 
its  inherent  parallel  characteristics,  then  show  how  those 
characteristics  map  onto  the  pyramid.  Our  analysis  pro¬ 
ceeds  in  four  steps: 

•  Control  Structure  Analysis  -  identify  the  indepen¬ 
dent  processes  that  constitute  the  algorithm. 

•  Data  Structure  Analysis  -  determine  the  data  re¬ 
quirements  for  each  of  the  independent  processes. 

•  Communication  Analysis  -  determine  the  communi¬ 
cation  requirements  between  the  independent  pro¬ 
cesses. 
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Algorithm 

Complexity 

Comments 

Edge  Detection 

om 

z  :  i  X  i  pix  img 

Edge  Thinning 

W) 

i  :  i  X  i  pix  img 

Edge  Linking 

t  :  i  X  i  pix  img 

Contour 

Extraction 

i  :  i  X  »  pix  img 
/ :  avg  contour  len 

Linear 

Aproximation 

/ ;  avg  contour  len 
n  :  num  of  contours 

Table  1:  Time  complexities  of  the  steps  that  constitute 
the  linear  feature  extraction  process. 


•  Architecture  Specification  -  describe  a  parallel  pro¬ 
cessor  architecture  that  is  well  suited  to  the  process¬ 
ing  requirements  determined  via  the  previous  steps. 

In  using  such  an  approach  our  goal  is  to  show,  analyt¬ 
ically,  that  the  implementation  will  achieve  significant 
algorithm  speedup  and  processor  efficiency  through  al¬ 
gorithms  and  software  that  reflect  the  structure  of  the 
“typical”  solution  to  the  problem  of  linear  feature  ex¬ 
traction.  We  want  to  avoid  the  use  of  custom  hardware 
and  unintuitive  algorithm  designs  as  these  tend  to  be 
costly  over  the  life  time  of  the  system. 

4.2  Control  Structure  Analysis 

In  the  control  structure  analysis  step  our  objective  is  to 
identify  the  primary  sources  of  time  complexity  within 
the  algorithm  and  determine  which  processes  that  con¬ 
stitute  those  sources  can  be  performed  in  parallel. 

Table  1  shows  the  time  complexities  of  each  of  the 
constituent  algorithms  for  the  linear  feature  extraction 
process.  The  time  complexities  are  predicated  on  the  fol¬ 
lowing  concepts.  The  edge  detection,  edge  thinning,  and 
edge  linking  steps  require  every  pixel  in  the  image  to  be 
processed  in  a  uniform  manner.  Edge  detection  requires 
a  convolution  with  a  fixed  size  kernel,  edge  thinning  and 
linking  require  comparisons  of  an  edge  with  each  of  its 
neighbors.  The  contour  extraction  step  requires  a  scan 
of  the  image  plane  to  detect  contour  start  points  followed 
by  a  traversal  of  each  contour  from  its  start  to  its  end 
point.  The  linear  approximation  step  requires  traversing 
the  length  of  each  contour  detecting  points  of  inflection. 

The  primary  control  structures  for  each  processing 
step  are  presented  in  the  following  pseudo  code. 

(edge  detection) 
for  each  pixel 

convolve  with  kernel 
(edge  thinning) 
for  each  edge 

compare  neighbor  edges  and  retain 
if  it  is  of  greater  magnitude 
(edge  linking) 
for  each  edge 

link  to  neighbor  edges  of  similar 
orientation  to  form  contours 
(contour  extraction) 
for  each  contour 

extract  constituent  edge  coord 
to  form  contour  lists 


(linear  approximation) 
for  each  contour  list 

fit  piece-wise  linear  segments 


In  the  edge  detection,  edge  thinning,  and  edge  linking 
steps  each  image  pixel  location  can  be  processed  inde¬ 
pendently  and  synchronously,  that  is,  the  same  process¬ 
ing  steps  are  applied  to  each  location.  The  nature  of 
the  parallelism  is  fine  grain.  In  the  contour  extraction 
and  linear  approximation  steps,  each  contour  can  be  pro¬ 
cessed  independently  but  processing  is  data  dependent 
and  therefore,  processed  asynchronously.  Here,  the  par¬ 
allelism  is  coarse  grain. 

4.3  Data  Structure  Analysis 

In  the  data  structure  analysis  step  our  objective  is  to 
identify  the  input  data  requirements  of  each  of  the  in¬ 
dependent  processes  identified  by  the  control  structure 
analysis  and  to  define  an  appropriate  partitioning  of  the 
primary  data  structures. 

The  input  data  structure  to  the  linear  feature  extrac¬ 
tion  algorithm  is  the  2-D  image  array.  This  is  the  pri¬ 
mary  data  structure  of  the  edge  detection,  edge  thinning, 
and  edge  linking  steps.  If  each  pixel  location  is  to  be  pro¬ 
cessed  independently,  then  each  process  requires  access 
to  the  values  of  its  neighboring  locations. 

The  input  data  structure  to  the  contour  extraction 
step  is  the  2-D  array  containing  the  linked  edges.  The 
resultant  data  structure  is  multiple  linked-lists  of  {x,y) 
coordinates,  one  linked-list  for  each  contour.  If  each  con¬ 
tour  is  to  be  processed  (extracted)  independently,  then 
each  process  requires  access  to  the  entire  2-D  array  since 
a  single  contour  may  traverse  any  part  of  the  array. 

The  linear  approximation  step  accepts,  as  input,  the 
linked-lists  of  (r,  y)  coordinates,  the  contours.  To  pro¬ 
cess  each  contour  independently  a  process  only  needs 
access  to  its  own  contour.  Output  from  the  linear  ap¬ 
proximation  step  is  a  list  of  linear  segments  represented 
by  their  end  points,  orientation,  contrast,  and  length. 

Figure  3  depicts  the  two  primary  data  structures  used 
by  the  linear  feature  extraction  process. 


1 

1 

1 

1 

1 

1 

1 

Nua 


Figure  3:  Iconic  and  symbolic  data  representations 
(structures)  used  in  the  linear  feature  extraction  process. 


4.4  Communication  Analysis 

In  the  communication  analysis  step  our  objective  is  to 
identify  the  communication  requirements  among  the  in- 
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dependent  processes  determined  by  the  control  structure 
analysis.  We  attempt  to  generalize  the  requirements  into 
classes  such  as  point-to-point  seiid/receive,  broadcast, 
reduction,  shuffle,  local,  global. . . 

Communication  among  independent  processes  of  the 
edge  detection,  edge  thinning,  and  edge  linking  steps  is 
restricted  to  local  neighborhoods.  Each  process  must  ob¬ 
tain  the  pixel  value  of  its  neighbors  in  order  to  complete 
its  computation.  For  the  edge  detection  step  the  size  of 
the  neighborhood  is  dependent  on  the  size  of  the  con¬ 
volution  kernel,  e.g.  5  x  5  for  the  Nevatia-Babu  linear 
feature  extractor  [Nevatia  and  Babu,  1980],  selectable 
for  the  Canny  edge  detector  [Canny,  1986].  For  the  edge 
thinning  and  edge  linking  steps  the  neighborhood  is  3x3. 
Messages  are  single  valued  and  communication  can  pro¬ 
ceed  synchronously  among  all  processes. 

For  the  contour  extraction  step,  no  communication 
among  processes  is  required  but,  recall  that  access  to 
the  entire  2-D  array  is  required  since  contours  are  of 
“unlimited”  extent  within  the  image  plane.  As  we  shall 
see  later,  when  we  specify  the  parallel  architecture  for 
the  entire  process,  this  situation  will  change.  Partition¬ 
ing  of  the  low-level  algorithms  (edge  detection,  thinning, 
and  linking)  will  have  adverse  effects  on  the  parallel  im¬ 
plementation  of  the  contour  extraction  algorithm. 

For  the  linear  approximation  step,  no  communication 
among  processes  is  required.  Each  contour  can  be  pro¬ 
cessed  independently  of  all  others. 

4.5  Architecture  Specification 

In  the  architecture  specification  step  our  objective  is  to 
map  each  of  the  processing  steps  to  a  portion  of  the  het¬ 
erogeneous  architecture.  To  perform  this  mapping,  we 
use  the  parameters  of  each  of  the  sections  of  the  archi¬ 
tecture  such  as  processing  protocol,  processing  element 
type,  processing  element  coupling,  processor  homogene¬ 
ity,  processor  synchronicity,  and  communication  network 
topology.  At  this  point  the  mappings  for  each  of  the 
individual  algorithms  that  constitute  the  linear  feature 
extraction  process  will  be  independent.  Later  we  will 
tackle  the  job  of  interfacing  the  various  algorithms. 

Given  the  fine  granularity,  the  local  neighborhood 
communication  requirements,  the  synchronous  nature  of 
the  communications,  the  simplicity  of  the  required  pro¬ 
cessing,  and  the  data  independence  of  the  algorithms, 
the  edge  detection,  edge  thinning,  and  edge  linking  steps 
map  well  to  the  SIMD  mesh  connected  section  of  the  ar¬ 
chitecture.  This  concurs  with  the  findings  of  other  re¬ 
searchers  [Lee  and  Aggarwal,  1987]  [Little  ei  ai,  1987]. 
Contrary  to  this  approach  is  the  one  used  in  [Vaillant 
et  ai,  1989]  where  the  algorithms  are  mapped  onto  a 
MIMD  based  architecture.  The  authors  found  that  the 
edge  linking  process  presented  a  processing  bottle  neck. 

For  the  contour  extraction  and  linear  approximation 
steps,  the  parallelism  is  coarse  grain,  the  algorithms 
are  data  dependent,  comprise  relatively  simple  opera¬ 
tions,  and  do  not  require  inter-process  communication 
and  therefore,  map  well  to  the  MIMD  section.  As  there 
is  no  required  interprocess  communication,  any  topology 
will  suffice.  The  specification  of  a  ring  topology  will  be 
justified  later.  This  concurs  with  the  implementation 


described  in  [Vaillant  et  a/.,  1989]  in  which  the  autliors 
achieved  good  performance  with  an  intuitive  implemen¬ 
tation.  Contrary  to  this  approach  is  that  of  [Shu  et  ai, 
1990]  where  the  authors  map  these  algorithms  onto  the 
SIMD  section  of  the  lUA.  Good  performance  is  achieved 
but  only  at  the  cost  of  custom  hardware  and  complex 
software. 

5  Process  Interfacing 

In  utilizing  a  heterogeneous  approach  to  the  parallel  im¬ 
plementation  of  a  process  one  must  be  cognizant  of  the 
interface  between  the  two  distinct  architectures.  For  the 
linear  feature  extraction  process  it  is  at  this  interface 
where  data  is  transitioned  from  the  iconic  representation, 
the  2-D  array  of  pixel  values,  to  the  symbolic  representa¬ 
tion,  the  linked-lists  of  contours.  The  contour  extraction 
algorithm  performs  this  transition. 

The  number  of  contours  detected  in  a  scene  will  be  sig¬ 
nificantly  less  than  the  number  of  edges  detected.  This 
implies  that  the  number  of  processing  elements  that  can 
be  effectively  utilized  by  the  contour  extraction  and  lin¬ 
ear  approximation  algorithms  will  be  significantly  less 
than  that  used  by  the  edge  detection,  edge  thinning,  and 
edge  linking  algorithms.  Therefore,  a  pyramid  structure 
between  the  two  architectures  is  appropriate.  This  is 
also  the  scheme  used  in  the  Image  Understanding  Ar¬ 
chitecture  to  interface  the  CAAPP  (SIMD)  and  ICAP 
(MIMD)  layers.  Each  MIMD  processing  element  is  di¬ 
rectly  connected  to  a  group  of  SIMD  PEs,  and  the  groups 
are  non-overlapping. 

But,  with  this  arrangement  a  single  contour  may  be 
distributed  across  many  groups  of  SIMD  PEs,  or  equiv¬ 
alently,  across  many  MIMD  PEs.  Therefore,  after  each 
MIMD  processing  element  performs  the  contour  extrac¬ 
tion  algorithm  on  its  partition  (data  residing  in  its  as¬ 
sociated  SIMD  PEs),  and  before  each  performs  the  lin¬ 
ear  approximation  algorithm  on  its  set  of  contours,  an 
additional  step,  contour  merging,  must  be  performed. 
Since  the  distribution  of  pa:'tial  contours  is  arbitrary,  the 
MIMD  processing  elements  must  perform  a  complete  ex¬ 
change  operation  in  order  to  gather  all  of  their  partial 
contours. 

This  is  the  justification  for  the  ring  topology  of  the 
MIMD  section  of  the  architecture.  The  complete  ex¬ 
change  operation  is  performed  efficiently  on  a  ring  topol¬ 
ogy  by  “circulating”  data  values  through  the  ring  systoli- 
cally.  Each  processing  element  is  made  responsible  for  a 
set  of  contours  and  as  sections  of  those  contours  are  re¬ 
ceived,  they  are  saved  in  local  memory.  Sections  of  con¬ 
tours  that  do  not  belong  to  the  set  are  passed  on.  Upon 
receipt  of  all  sections  of  its  set  of  contours,  each  process¬ 
ing  element  must  reconstruct  the  complete  contours  by 
performing  the  contour  merging  step.  Upon  completion, 
linear  approximation  can  proceed  independently  for  each 
contour.  The  contour  merging  step  requires  0{N)  com¬ 
munication  steps  for  a  system  consisting  of  N  processing 
elements.  The  additional  execution  time  complexity  in¬ 
troduced  by  the  contour  merging  step  is  discussed  in  the 
following  section. 
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(a)  Algorithm  speedup. 


(b)  Processor  efficiency. 


Figure  4:  Estimated  performance  curves  for  the  contour 
extraction,  contour  merging,  and  linear  approximation 
algorithm  suite. 

6  Performance  Analysis 


[Weems  et  ai,  1991].  High  degrees  of  algorithm  speedup 
and  processor  efficiency  are  readily  achieved. 

The  operations  performed  by  the  contour  extraction 
and  linear  approximation  algorithms  do  not  suffer  any 
overhead  due  to  process  communication  and,  therefore, 
will  achieve  high  degrees  of  algorithm  speedup  and  pro¬ 
cessor  efficiency  so  long  as  the  number  of  processors  is 
kept  in  a  range  bounded  by  the  problem  size.  Thus,  our 
estimate  for  the  time  complexity  of  the  parallel  imple¬ 
mentation  of  the  five  constituent  algorithms  is 

LFEp^  ~  0(1) -1-0(1) -1-0(1) 

-bO(i^//7V)  -I-  Oinl/N) 

(near  linear  speedup)  given  an  i  x  t  SIMD  mesh  and  N 
MIMD  processing  elements. 

The  open  issue  regarding  the  performance  of  the  paral¬ 
lel  implementation  of  the  linear  feature  extraction  algo¬ 
rithm  is  the  time  complexity  introduced  by  the  inclusion 
of  the  contour  merging  step  and  its  associated  commu¬ 
nication  requirements.  This  is  where  we  focus  our  atten¬ 
tion. 

To  form  complete  contours  from  its  set  of  partial  con¬ 
tours,  each  processing  element  must  perform  the  follow¬ 
ing  algorithm: 

(contour  merging) 

lor  each  sub-contour,  j 
change  <—  TRUE 
while  change 
change  ♦—  FALSE 
lor  each  sub-contour,  j  ^  i 
il  head(i)  =  tail(j) 
connect(i,j) 
change  «—  TRUE 
else  il  tail{i)  =  head{j) 
connect{j,  i) 
change  *—  TRUE 


6.1  Complexity  Analysis 

As  shown  previously  (in  tabular  form)  the  time  complex¬ 
ity  of  the  entire  linear  feature  extraction  process  is 

LFE,,,  =  0(»2)  -f  0(«2)  +  Oii^)  +  O(i^l)  -f  0{nl) 

where  the  terms  correspond  to  the  edge  detection,  edge 
thinning,  edge  linking,  contour  extraction,  and  linear  ap¬ 
proximation  algorithm  complexities  respectively. 

For  synchronous,  data  independent  (SIMD)  opera¬ 
tions  such  as  those  performed  by  the  edge  detection,  edge 
thinning,  and  edge  linking  algorithms,  researchers  have 
shown  that  high  degrees  of  algorithm  speedup  and  pro¬ 
cessor  efficiency  awe  readily  achieved  [Little  ei  ai,  1987) 
[Rosenfeld,  1987]  [Prasanna-Kumar  and  Reisis,  1988]. 
The  determining  factor  for  speedup  and  efficiency  is  the 
“degree  of  match”  between  the  pattern  of  communica¬ 
tion  among  processes  and  the  interconnect  topology  of 
the  architecture  (solution  of  the  mapping  problem.)  Our 
use  of  the  2-D  mesh  for  the  local  communication  patterns 
of  the  linear  feature  extraction  algorithm  has  been  shown 
by  others  to  perform  very  well  [Lee  and  Aggarwal,  1987] 
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The  algorithm  traverses  the  list  of  partial  contours  (sub¬ 
contours)  combining  spatially  adjacent  sub-contours  re¬ 
ducing  the  length  of  the  list  by  one  every  iteration.  Given 
that  a  contour  is  partitioned  into  n  sub-contours,  the 
time  complexity  of  the  contour  merging  algorithm  is 

MERGING  =  i)  =  0("^y -^). 

i=l 

Then  the  estimate  made  above  (near  linear  speedup)  for 
the  time  complexity  of  the  entire  parallel  implementation 
of  the  linear  feature  extraction  process  is  amended  with 
this  new  term  to 

LFEp^r  =  0(l)  +  0(l)-l-0(l) 

+0(i^//N)  +  +  0(n//N). 

The  open  issue  is  thus  reduced  to  one  of  comparing 

0(i^//N)  +  0(!^i:l))  +  0(nt/N) 


to 


0(.-2/)  +  0(n/) 

for  various  numbers  of  processing  elements,  N,  and  con¬ 
tours,  n.  Although  the  length  and  number  of  sub- 
contours  that  a  single  contour  is  partitioned  into  are  also 
critical,  they  ate  bounded  by  the  image  size,  t,  and  the 
number  of  MIMD  processing  elements.  Figure  4  shows 
the  comparison  of  these  terms  along  with  the  estimated 
algorithm  speedup  and  processor  efficiency  for  the  paral¬ 
lel  implementation.  We  have  assumed  that  all  contours 
are  the  same  length  and  that  each  contour  is  “maxi¬ 
mally”  fragmented  (partitioned  into  N,  the  number  of 
MIMD  processing  elements)  which  is  the  worst  case.  The 
shape  of  these  curves  indicates  that  for  a  256  x  256  pro¬ 
cessing  element  SIMD  mesh,  64  MIMD  PEs  can  be  ef¬ 
fectively  utilized  before  performance  begins  to  degrade. 
We  use  these  curves  for  comparison  to  simulation  data 
in  the  next  section. 

6.2  Observed  Performance 

Figure  5  shows  the  partitioning  of  an  aeried  image  into 
areas  covered  by  the  individual  64  MIMD  processing  ele¬ 
ments.  From  this  hgure  one  can  imagine  how  increasing 
the  number  of  MIMD  processing  elements  would  frag¬ 
ment  long  contours  thus  affecting  the  performance  of  the 
contour  merging  step.  Also,  by  decreasing  the  number 
of  MIMD  PEs  each  would  be  overworked. 

Figure  6  shows  the  observed  performance  curves  (from 
simulation  of  the  parallel  implementation)  for  the  aerial 
image.  The  simulation  reflects  time  required  by  the  al¬ 
gorithm  as  well  as  overhead  due  to  communication  and 
data  dependencies.  Since  this  overhead  is  included  in 
the  simulation  the  shape  of  the  curves  differ  from  that 
of  the  estimated  curves  but  the  same  trends  exist.  That 
is,  performance  degrades  when  the  number  of  MIMD 
processing  elements  is  increased  beyond  64  processing 
elements. 

6.3  Algorithm  Structure 

Recall  that  one  of  our  goals  is  to  design  a  parallel  imple¬ 
mentation  that  agrees  with  one’s  intuition.  That  is,  one 
that  is  not  complicated  by  the  underlying  architecture. 
We  feel  that  this  is  important  because  of  the  modularity 
of  the  linear  feature  extraction  process  and  the  possibil¬ 
ity  of  changing  constituent  algorithms  at  a  later  date. 

With  regard  to  the  structure  of  the  algorithm  and  our 
resultant  implementation,  our  primary  concern  for  the 
linear  feature  extraction  process  is  the  addition  of  the 
contour  merging  step.  This  addition  is  somewhat  con¬ 
trary  to  our  desire  of  retaining  the  basic  structure  of  the 
algorithm.  But,  a  trade-oflT  is  made  that  is  beneficial 
to  the  run-time  performance  and  not  too  detrimental 
to  life  cycle  issues.  The  addition  of  the  contour  merg¬ 
ing  step  is  such  that  it  allows  the  original  algorithms 
(edge  detection,  edge  thinning,  edge  linki  .ij,,  contour  ex¬ 
traction,  and  linear  approximation)  to  a  ~  >1  >  gnificant 
algorithm  speedup  and  processor  effic'ency  through  in¬ 
tuitive  and  modular  implementationc .  \>  additional 

step  is  strictly  a  “mutually  exclusive”  addition  in  that 
it  does  not  affect  the  implementations  of  the  originals 


steps.  The  structure  of  the  parallel  implementation  of 
the  entire  process  still  resembles  the  structure  of  an  in¬ 
tuitive  implementation  with  the  addition  of  the  contour 
merging  step.  Thus,  the  effort  (cost)  required  to  realize 
and  maintain  the  parallel  implementation  is  only  slightly 
greater  than  that  of  the  sequential  implementation. 


Figure  5;  Image  plane  partitioning  for  64  MIMD  pro¬ 
cessing  elements. 


•  ■  Mi  m  Mi 


(a)  Algorithm  speedup. 


(b)  Processor  efficiency. 


Figure  6:  Performance  curves  from  simulation  of  the 
parallel  linear  feature  extraction  implementation  on  the 
aerial  image. 
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7  Summary 

In  this  paper  we  have  described  a  parallel  implemen¬ 
tation  of  the  linear  feature  extraction  process.  The 
problem  has  not  been  extensively  studied  in  that  most 
work  stops  short  at  parallel  line  ‘binders” ,  utilizing  tech¬ 
niques  such  as  the  Hough  Transform  [Weems  el  at.,  1991] 
[Guerra  and  Hambrusch,  1989],  or  simply  edge  detectors 
[Little  el  al,  1987]  [Lee  and  Aggarwal,  1987]  [Weems  el 
ai,  1991].  In  these  studies  the  heterogeneous,  fine/coarse 
grain  nature  of  the  problem  is  not  high-lighted.  In  stud¬ 
ies  that  have  considered  the  entire  problem,  they  utilized 
a  single  parallel  processing  architecture  that  is  suited  to 
either  the  fine  or  the  coarse  grain  algorithms  and  de¬ 
signed  mapping  schemes  of  both  types  of  algorithms  onto 
that  one  architecture.  In  doing  so,  they  either  had  to 
develop  hardware  specific  implementations  [Shu  cl  ai, 
1990]  or  settle  for  reduced  performance  in  some  of  the 
constituent  algorithms  [Vaillant  cl  ai,  1989]. 

In  our  implementation  we  have  attempted  to  combine 
the  “best  of  both  worlds”  through  the  use  of  a  heteroge¬ 
neous  architecture.  We  achieve  good  algorithm  speedup 
and  processor  efficiency  for  all  of  the  constituent  algo¬ 
rithms  and  the  process  as  a  whole.  Furthermore,  the 
only  deviation  from  the  process  specification  is  the  addi¬ 
tion  of  an  algorithm  to  “bridge”  the  fine  and  coarse  grain 
processes  (architectures.)  The  architecture  we  utilize  is  a 
heterogeneous  pyramid  with  a  64-to-l  reduction  from  the 
fine  grain  (SIMD)  to  the  coarse  grain  (MIMD)  sections 
which  concurs  with  the  findings  of  the  Image  Under¬ 
standing  Architecture  (lUA)  designers  in  bridging  low 
and  mid-level  parallel  vision  processing.  The  difference 
between  our  design  and  one  for  the  lUA  described  in  [Shu 
el  ai,  1990]  is  our  independence  on  custom  “functional¬ 
ity”  within  the  SIMD  level.  We  were  able  to  achieve  this 
by  mapping  the  constituent  algorithms  onto  appropriate 
parallel  processor  architectures. 

References 

[Canny,  1983]  J.F.  Canny.  Finding  edges  and  lines  in 
images.  Technical  report,  Massachusetts  Institute  of 
Technology,  June  1983.  Artificial  Intelligence  Labora¬ 
tory. 

[Canny,  1986]  J.F.  Canny.  A  computational  approach 
to  edge  detection.  IEEE  Transactions  on  Pattern 
Analysis  and  Machine  Intelligence,  PAMI-8(6):679- 
698,  November  1986. 

[Guerra  and  Hambrusch,  1989]  C.  Guerra  and  S.  Ham¬ 
brusch.  Parallel  algorithms  for  line  detection  on  a 
mesh.  Journal  of  Parallel  and  Distributed  Computing, 
6(2):1-19,  1989. 

[Lee  and  Aggarwal,  1987]  S.Y.  Lee  and  J.K.  Aggarwal. 
Parallel  2-D  convolution  on  a  mesh  connected  array 
processor.  IEEE  Transactions  on  Pattern  Analysis 
and  Machine  Intelligence,  PAMl-9(9);590-594,  July 
1987. 

[Little  ei  ai,  1987]  J.  J.  Little,  G.  Blelloch,  and  T.  Cass. 
Parallel  algorithms  for  computer  vision  on  the  Con¬ 
nection  Machine.  In  Proceedings  of  the  DARPA  Im¬ 


age  Understanding  Workshop,  pages  628-638,  Febru¬ 
ary  1987. 

[Marr  and  Hildreth,  1980]  D.  Marr  and  E.  Hildreth. 
Theory  of  edge  detection.  Proceedings  of  the  Royal 
Society  of  London,  B,  207:187-217,  1980. 

[Nevatia  and  Babu,  1980]  R.  Nevatia  and  K.  R.  Babu. 
Linear  feature  extraction  and  description.  Computer 
Vision,  Graphics,  and  Image  Processing,  13:257-269, 
1980. 

[Prasanna-Kumar  and  Reisis,  1988]  V.K.  Prasanna- 
Kumar  and  D.  Reisis.  Parallel  architectures  for  image 
processing  and  vision.  In  Proceedings  of  the  DARPA 
Image  Understanding  Worksftop,  pages  609-619,  April 
1988. 

(Reinhart  and  Nevatia,  1990]  C.  Reinhart  and  R.  Neva¬ 
tia.  Efficient  parallel  processing  in  high  level  vision. 
In  Proceedings  of  the  DARPA  Image  Understanding 
Workshop,  pages  829-839,  September  1990. 

[Rosenfeld,  1987]  A.  Rosenfeld.  A  report  on  the  DARPA 
image  understanding  architectures  workshop.  In  Pro¬ 
ceedings  of  the  DARPA  Image  Understanding  Work¬ 
shop,  pages  298-302,  February  1987. 

(Shu  et  ai,  1990]  D.B.  Shu,  J.G.  Nash,  M.M.  Es- 
haghian,  and  K.  Kim.  Straight-line  detection  on  a 
gated-connection  VLSI  network.  In  Proceedings  of  the 
Tenth  Intemation  Conference  on  Pattern  Recognition, 
pages  456-461,  June  1990. 

(Vaillant  et  ai,  1989]  R.  Vaillant,  R.  Deriche,  and 
O.  Faugeras.  3D  vision  on  the  parallel  machine  CAP- 
ITAN.  In  International  Workshop  on  Industrial  Ap¬ 
plications  of  Machine  Intelligence  and  Vision,  pages 
326-331,  April  1989. 

(Weems  and  Levitan,  1987]  C.  C.  Weems  and  S.  P.  Lev¬ 
itan.  The  Image  Understanding  Architecture.  In  Pro¬ 
ceedings  of  the  DARPA  Image  Understanding  Work¬ 
shop,  pages  483-496,  February  1987. 

(Weems  et  ai,  1991]  C.  Weems,  E.  Riseman,  and 
A.  Hanson.  The  DARPA  image  understanding  bench¬ 
mark  for  parallel  computers.  Journal  of  Parallel  and 
Distributed  Computing,  ll(l).T-24,  1991 


1055 


Parallel  Algorithms  for  Stereo  and  Image  Matching 


Ashfaq  Khokhar  and  Viktor  K.  Prasanna  ' 
Department  of  EE-Systems,  EEB  244 
University  of  Southern  California 
Los  Angeles,  CA  90089-2562 
email:  {ashfaq  -t-  prasanna}@halcyon. usc.edu 


Abstract 

In  this  paper,  we  summarize  our  progress  in  parallelizing 
two  high  level  vision  tasks;  stereo  matching  and  image 
matching.  We  show  processor-time  optimal  algorithms 
for  stereo  and  image  matching  using  linear  features  as 
primitives  using  the  matching  techniques  developed  by 
the  vision  group  at  USC.  These  algorithms  are  designed 
to  execute  on  fixed  size  arrays. 

1  Introduction 

Parallel  processing  has  been  used  in  computer  vision  over 
the  past  two  decades.  However,  most  of  these  solutions 
have  addressed  problems  in  low-level  and  mid-level  vision 
[12].  This  paper  presents  a  summary  of  our  research  in 
parallelizing  two  high-level  vision  tasks;  stereo  and  im¬ 
age  matching  using  linear  features  as  primitives.  Stereo 
matching  is  one  of  the  well  known  methods  for  extraction 
of  depth  information.  Depth  recovery  is  a  crucial  prob¬ 
lem  in  image  understanding  with  applications  in  robotics 
and  navigation.  Also,  image  matching  is  a  fundamental 
operation  in  machine  vision  and  is  a  key  step  in  object 
recognition. 

For  stereo  matching,  we  propose  0(^^)  time  algo¬ 
rithm  on  a  P  processor  linear  arrrv,  where  N  is  the 
number  of  line  segments  in  one  im..ge,  n  is  the  num¬ 
ber  of  line  segments  in  a  window  determined  by  the  ob¬ 
ject  size,  and  P  <  n.  This  algorithm  is  extended  to  a 
mesh  array  of  P  x  P  processors  to  run  in  0(^^)  time. 
The  sequential  algorithm  takes  O(Nn^)  time.  For  image 
matching,  we  first  propose  a  fast  sequential  algorithm 
which  runs  in  0(n^m^)  time,  where  n  is  the  number  of 
line  segments  in  the  image  and  m  is  the  number  of  line 
segments  in  the  model.  Previously  known  approaches 
to  the  image  matching  problem  take  O(n^m^)  time.  A 
parallel  algorithm  is  developed  using  the  proposed  se¬ 
quential  algorithm.  0((^  -h  P)nm)  time  performance 
is  achieved  on  a  P  processor  fixed  size  linear  array,  where 
P  <  nm.  This  leads  to  a  processor-time  optimal  solu¬ 
tion  for  P  <  y/nm.  Also,  a  mesh  algorithm  for  image 

’This  research  was  supported  by  the  Defense  Advanced 
Research  Projects  Agency  under  contract  F49620-90-C-0078, 
monitored  by  the  Air  Force  Office  of  Scientific  Research.  The 
United  States  Government  is  authorized  to  reproduce  and 
distribute  reprints  for  governmental  purposes  notwithstand¬ 
ing  any  copyright  notation  hereon. 


matching  is  derived  which  runs  in  0((^  -i-P)nT7i)  time 
performance  is  achieved  on  a  P  x  P  processor  array.  All 
the  proposed  parallel  algorithms  achieve  linear  speed-up 
compared  with  the  corresponding  sequential  algorithms 

2  Fixed  Size  Arrays 

Two  parallel  architectures  are  used  in  our  algorithms: 


A:  Fixed  Size  Mesh  Array:  A  fixed  size  mesh  array  is 
a  two  dimensional  array  of  P  x  P  processors,  where  P* 
is  less  than  or  equal  to  the  problem  size.  Each  proces¬ 
sor  PEi;  is  connected  to  PE,  +  ij,  PE,_ij,  PE,j_i,  and 
PE,y+),  if  they  exist.  A  memory  plane  of  P  x  P  mem¬ 
ory  modules  (MMs)  is  provided.  Each  PE,j  is  attached 
to  memory  module  MM,y.  The  architecture  is  showji  in 
Fig.  1. 


Figure  1;  A  Fixed  Size  Mesh  Array 


B:  Fixed  Size  Linear  Array:  A  fixed  size  linear  array 
is  a  one  dimensional  array  of  processors,  say  P,  where 
P  is  less  than  or  equal  to  the  problem  size.  Processor 
i  is  denoted  as  PE;,  where  0  <  *  <  P  -  1.  Each  PE, 
is  attached  to  an  external  memory  module  (MMi).  The 
architecture  is  shown  in  Fig.  2. 

In  both  the  models,  processors  are  connected  through 
bidirectional  local  links  and  the  arrays  operate  in  SIMD 
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Figure  3:  Determining  a  Stereo-window 

Figure  2;  A  Fixed  Size  Linear  Array. 


mode.  Following  assumptions  are  made  regarding  the 
computations  in  these  models: 

•  Each  arithmetic/logic  operation  performed  in  a  PE 
takes  0(1)  time. 

•  Each  access  by  PE,-  to  memory  module  MM,-,  0  < 
z  <  P  —  1,  takes  0(1)  time. 

•  Each  access  by  PE,-j  to  memory  module  MM,-j-,  0  < 
i,j  <  P  —  I,  takes  0(1)  time. 

•  A  unit  data  transfer  between  adjacent  PEs  takes 
0(1)  time. 

•  Each  PE  has  indirect  addresssing  capability. 

3  Stereo  Matching  on  Fixed  Size  Arrays 

In  stereo  matching,  two  images,  left  and  right  image, 
captured  at  the  same  time  but  at  different  ahgles  are 
matched.  Various  stereo  matching  algorithms  differ  with 
respect  to  the  primitives  used  for  matching  [3].  Each 
technique  has  its  own  advantages  and  disadvantages. 
Stereo  matching  using  linear  features  is  capable  of  han¬ 
dling  more  complex  scenes  (such  as  those  containing 
repetitive  structures)  [9].  In  this  section  we  provide  a 
fast  parallel  implementation  of  the  stereo  matching  al¬ 
gorithm  (also  called  the  Minimum  Disparity  Algorithm) 
described  in  [9].  For  the  sake  of  completeness,  the  main 
ideas  of  this  algorithm  are  presented  in  Section  3.1. 

3.1  Minimum  Disparity  Algorithm 

The  technique  attempts  to  match  overlapping  segments 
detected  along  the  same  epipolar  line,  having  similar 
contrast  and  orientation.  Following  the  terminology  in 
[9],  for  each  segment  a,  in  the  left  image,  a  match  is  found 
in  a  window  u>(a{)  defined  in  the  right  image.  Similarly, 
for  each  segment  bj  in  the  right  image,  a  match  is  found 
in  w(bj)  defined  in  the  left  image.  The  shape  of  the  win¬ 
dow  is  a  parallelogram.  This  is  shown  in  Fig.  3  for  left 
to  right  match,  one  side  corresponds  to  a,-,  and  the  other 
side  is  a  horizontal  vector  of  length  2dmax,  where  dmax 
is  the  maximum  disparity.  The  number  of  segments  in 
each  window  is  assumed  to  be  at  most  n  and  both  the 
images  are  assumed  to  have  N  segments  each. 

For  each  a<,  a  set  5p(a,-)  of  possible  matches  in  win¬ 
dow  w(aj)  is  defined  based  on  the  contrast,  overlap,  and 
orientatidn.  Similarly  for  each  bj,  a  set  Sp{bj)  is  de¬ 
fined.  To  assign  unambiguous  matches,  a  set  of  matches 


is  considered  together  for  each  segment  in  the  image.  For 
each  possible  element  j  in  5p(oj),  an  evaluation  function 
t;(i,  j),  is  computed,  which  is  dependent  on  how  well  the 
disparities  of  the  other  line  segment  matches  in  w{bj) 
agree  with  the  average  disparity  of  the  matching  pair. 
A  set  of  preferred  matches  Q‘(a,-)  is  constructed  for  each 
I  during  iteration  t,  if  the  following  holds: 

Vfc  6  5p(aj)  such  that  bt  bj,v*{i,j)  <  v\i,k)  (1) 
and 

Vh  €  Sp{bj)  such  that  <->■  ai,v‘(i,j)  <  v*{h,j)  .  (2) 

The  relation  i*  ♦-*  bj  is  true  if  fc*  overlaps  bj . 


v{i,j)  is  defined  as  follows: 


Of,  in 


verifier  C](a^) 


a I ^  0^1  venfie*  '  ' 

bfc  in  w(oj) 

In  the  above  equation,  f  -t-  1  indicates  the  iteration 
number  and  Xijhk  =  min(overlap(t,  j),  overlap(h,  k))  and 
card{ai)  is  the  number  of  segments  in  w{ai).  The  rela¬ 
tions  Cl  and  Cl  are  defined  as  follows.  We  say  verifies 
C,(ofc)  if: 

1.  If  ^‘(ofc)  ^  0,  bt  is  in  Q*{ah)  else  bt  is  in  5p(afc), 

2.  Either  bt  ^  bj,  or  at  and  Oi  do  not  overlap. 

The  algorithm  terminates  after  a  constant  number 
of  iterations  [9]  and  each  iteration  takes  0{Nn^)  time. 
Each  time  unit  corresponds  to  a  simple  arithmetic/logic 
operation.  A  formal  description  of  the  algorithm  is  given 
in  Fig.  4.  Also,  procedure  i-pref(i)  mentioned  in  Fig.  4 
in  shown  in  Fig  5.  Similar  procedure  for  j-pref(j)  can  be 
developed. 

3.2  Parallel  Stereo  Matching 

In  this  section,  we  first  give  a  parallel  version  of  the  al¬ 
gorithm  described  in  Fig.  4  and  then  provide  parallel 
implementations  of  procedure  i-pref(i)  on  fixed  size  ar¬ 
rays.  We  have  also  devised  an  efficient  data  partitioning 
strategy  for  the  stereo  matching  problem. 

In  Fig.  6,  procedures  Parallel-i-preJ{i)  and  Parallel-j- 
prej{j)  determine  the  partial  preferred  matches,  QT(ai) 
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1.  repeat 

2.  change  «—  0; 

3.  for  t  =  1  to  N  do 

4.  for  each  j  such  that  j  €  «>(ai)  do 

5.  i-preJ{i,j,QT{ai)) 

6.  end 

7.  end; 

8.  for  j  =  1  to  TV  do 

9.  for  each  i  such  that  t  €  u;(6j)  do 

10.  j-pref{j,i,QT{bj)) 

1 1 .  end 

12.  end; 

13.  for  i  =  1  to  yv  do 

14.  for  each  j  such  that  j  €  w{ai)  do 

15.  Q-update(i,j) 

16.  end 

17.  end 

18.  for  »=  1  to  TV  do  Q(aj) «—  Q'(oi); 

19.  for  J  =  1  to  TV  do  Q(bj)  *—  Q"(bjy, 

20.  until  (change  =  0) 

Figure  4;  Sequential  Algorithm  for  Stereo  Matching 

and  QT(bj),  for  Uj  and  bj  respectively.  The  third  proce¬ 
dure  ParalleUQ-update  is  used  to  determine  the  new  sets 
of  preferred  matches,  Q(ai)  and  Q{bj),  for  at  and  bj,  re¬ 
spectively.  In  the  following  sections,  |  denotes  a  mod  b. 

3.2.1  Data  Partitioning 

In  stereo  matching,  input  to  the  algorithm  is  a  set  of 
TV  segments  from  the  right  image  and  a  set  of  TV  seg¬ 
ments  from  the  left  image.  As  described  in  Section  3.1, 
each  segment  is  represented  by  its  length,  contrast  and 
orientation.  With  each  pair  (oi,  6;),  such  that  ai,bj  over¬ 
lap  and  have  similar  contrast  and  similar  orientation,  sui 
average  disparity  dij  is  associated.  In  order  to  find  a 
possible  match  for  each  segment  at  in  the  left  image,  a 
window  w(a,)  is  defined  in  the  right  image.  Similarly, 
for  each  segment  bj  in  the  right  image  a  windov  w(bj) 
is  defined  in  the  left  image.  Each  window  is  assumed 
to  contain  at  most  n  segments.  An  efficient  partition¬ 
ing  algorithm  is  devised  in  [5].  The  running  time  of  the 
algorithm  is  O(TVnlogP). 

3.2.2  Partitioned  Implementation  on  a  Fixed 
Size  Mesh  Array 

Based  on  the  algorithm  shown  in  Fig.  5,  parallel  al¬ 
gorithm  on  a  P  X  P  mesh  array  is  developed.  Similar 
procedure  for  ParaUel-j-prtJ[j)  can  be  developed. 

In  each  »-loop,  PE^^,  0  <  <  n  —  1  (or  for 

the  second  part  of  the  j-loop)  is  used  for  the  computation 
of  min(t,  j,  k)  (or  min(t,  j,  h)).  The  information  required 
to  accomplish  the  computation  in  (or  PE^^) 

includes: 

1.  if  t  =  0,  then  Sp{ak)  (or  5p(6t)),  else  Q‘(oh)  (or 

Q‘(6»)). 

2.  dftt.O  <  ib  <  n  -  1  (or  dhk,0  <  h  <n  -  1)  and 

3.  ^ijkkiO  <  fc  <  n  -  1  (or  Xijhk,0  <  h  <  n  -  1) 


procedure  i-prej{i,  j,  QT{ai)) 

1.  for  each  h  such  that  h  6  u’(f>>)  do 

2.  for  each  k  such  that  6*  verifies  Ci(aA)  do 

3.  min/,  min(min/,,  Xijhk\dht  -d.jl); 

4.  end; 

5.  suml(j,  j)  •—  sum(i,j)  +  min/,; 

6.  end; 

7.  avel(i,j)  *— suml(i, ;)/car(T(6j); 

8.  for  each  k  such  that  k  G  lo(aj)  do 

9.  for  each  h  such  that  a/,  verifies  C2{bk)  do 

10.  mint  *-  m«n(mint,  Aij/,t|<T/,t  -  d,j  |); 

11.  end; 

12.  sum2(»,j)  <—  sum(i,  j)  +  mint; 

13.  end; 

14.  ave2(»,j)  •—  sum2(i,  j)/card(a,); 

15. sum(i,j)  *—  avel(i,  j)  -|-  ave2(t,  j); 

16.  case: 

17.  sum(i,  j)  <  Min(i): 

18.  QT(ai)^{j]; 

19.  Min(i)  «—  sum(i,  j); 

20.  sum(i,j)  =  Min(i): 

21.  QT{ai)  ^  QT(ai){J{  j  }■ 

22.  end 


Figure  5:  Finding  Partially  Preferred  Matches  for  the 
Left  Image 

1.  repeat 

2.  change  •—  0; 

3.  for  i  =  1  to  TV  do 

4.  Poro//e/-t-pre/(i); 

5.  for  J  =  1  to  do 

6.  Parallel- j-prej[j), 

7.  for  i  =  1  to  do 

8.  Parallel-Q-update(i)\ 

9.  until  (change  =  0) 


Figure  6:  Parallel  Algorithm  for  Stereo  Matching 

The  main  steps  of  procedure  Parallel-t-preJ[i)  are 
briefly  discussed  in  the  following,  with  the  corresponding 
execution  time  indicated  within  parentheses. 

1.  V  j  6  u'(ai),  load  d/,t  and  A,j/,t,  0{n)  data  to 
PE^^,  0{n^)  data  in  all.  (0(^)) 

2.  PEoo  broadcasts  dij  to  all  the  proces.sors  in  the 
array.  (0(P)) 

3.  Perform  the  min  operation  over  all  k  to  determine 
min(t,j,h)  in  PE^^.  (0((^)n)) 

4.  Along  each  column  of  processors,  i.e.  Vj,  all  the 
min(t,7,  A)  values  are  summed  up  and  saved  in 

pEp-i+.  m^)p)) 

5.  In  each  PEp_j^,  compute  the  average. 

6.  V  j  G  u;(a,),  load  d/,*  and  Xijhk,  0(n)  data  to 
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PE^^,  O(n^)  data  in  total.  (0(^)) 

7.  Perform  the  min  operation  over  all  h  to  determine 
min(»,i,*)in  PE^^.  (0((p)n)) 

8.  Along  each  column  of  processors,  i.e.  Vj,  all  the 
min(t,  j,  k)  values  are  summed  up  and  saved  in  the 
last  processor  PEp_j^.  (0((]5)/’)) 

9.  In  each  PEp_i^,  V  j  €  tu(a,),  compute  the  aver¬ 
age  of  the  sum  obtained  in  step  8  and  add  to  the 
average  obtained  in  step  5.  (0(1)) 

10.  Along  the  last  row  of  processors,  find  the  mini¬ 
mum  of  all  the  values  obtained  in  step  9,  and  store 
the  corresponding  bj  back  in  the  memory,  which  is 
QT{ai).  (0(n)) 

Similar  steps  can  be  designed  for  the  procedure 
ParalUl-j-pref(j). 

It  can  be  easily  verified  that  for  each  Uj  the  proce¬ 
dure  ParaUtl-i~prtf{i)  runs  in  0(n^/P?)  time  with  each 
time  unit  corresponding  to  a  simple  arithmetic/logic 
operation.  The  resulting  QT(a,')’s  and  QT(6j)’s  can 
then  be  combined  in  constant  time  by  using  the  proce¬ 
dure  Parallel- Q~update.  Therefore,  each  iteration  takes 
0(73-)  time. 

3.2.3  Partitioned  Implementation  on  a  Fixed 
Size  Linear  Array 

The  main  steps  of  procedure  Parallel-i-pref(i)  on  a 
fixed  size  linear  array  of  P  processors  are  briefly  dis¬ 
cussed  in  the  following,  with  the  corresponding  execution 
time  indicated  within  the  parentheses. 

1.  V  h  €  iu(fr;),  loewl  dfcj  and  \ijkky  0(n)  data  to  PE^, 
0(n*)  data  in  all.  (0(^)) 

2.  PEo  broadcasts  dij  to  all  the  PEs,  0  <  i,  j  <  n  —  1. 

(0(P)) 

3.  Perform  the  min  operation  over  all  k  to  determine 
min(i,j,/i)in  PE^.  (0((p)n)) 

4.  Along  the  linear  array,  i.e.  Vh,  all  min/i  values  are 
summed  up  and  saved  in  PEq.  (0((js)P)) 

5.  Compute  the  average  in  PEq. 

6.  V  ib  e  w{ai),  load  duj  and  Xijhk,  0{n)  data  to  PE^ , 
0(n2)  data  in  all.  (0(^)) 

7.  V  4;  e  ti;(a,-),  perform  the  min  operation  over  all  k 
to  determine  min(i,  j,  k)  in  PE^.  (0((p)n)) 

8.  Along  the  linear  array,  i.e.  Vk,  all  mint  values  are 
summed  up  and  saved  in  the  PEq.  (0((p)P)) 

9.  Take  the  average  of  the  sum  obtained  in  step  8  and 
add  it  to  the  average  obtained  in  step  5. 

10.  Vj,  find  the  minimum  of  all  the  values  obtained  in 
step  9,  which  is  QT{ai)  and  store  it  back  in  the 
memory.  (0(n)) 

Similar  steps  can  be  designed  for  procedure  linear-j- 

For  each  t,  PEi  is  assigned  to  store  the  value  QT(a,-) 
obtained  at  the  end  of  the  procedure.  It  can  be  easily 


verified  that  for  each  i  the  procedure  Parallel-i-preJ[i) 
runs  in  0(n^/P)  time. 

All  the  resulting  QT(a,)’s  and  QT(6j)’s  can  then 
be  combined  in  constant  time  by  using  the  proce¬ 
dure  Parallel-Q-update.  Therefore,  each  iteration  takes 
0{Nn^/P),  which  represents  optimal  speedup. 

For  details  of  these  implementations,  refer  to  [5;  6]. 

4  Image  Matching  on  Fixed  Size  Arrays 

Image  matching  problem  plays  a  key  role  in  object  recog¬ 
nition.  In  the  past,  several  approaches  have  been  pro¬ 
posed  for  this  problem  [l3;  15;  2],  which,  in  general,  dif¬ 
fer  with  respect  to  the  primitives  used  for  matching.  In 
this  section,  we  consider  image  matching  using  linear 
features  [lO]  for  parallel  implementation.  Readers  can 
refer  to  [lO]  for  additional  details  of  the  matching  tech¬ 
nique.  We  begin  with  the  basic  idea  of  this  approach  and 
then  present  processor-time  optimal  parallel  algorithms 
on  fixed  size  arrays. 

4.1  Matching  Technique 

In  general,  in  the  image  matching  problem,  we  have 
n  objects,  {01,02, ..  .,On},  in  the  scene  and  m  labels, 
{h.h,...  , /m),  in  the  model.  Here,  the  objects  are  seg¬ 
ments  in  the  scene  derived  from  edge  detectors  and  are 
described  by  the  coordinates  of  their  end  points,  orien¬ 
tation  and  average  contrast.  The  matching  technique 
computes  the  quantity  v,p,  in  {0, 1),  which  is  the  possi¬ 
bility  of  assigning  label  Ip  to  object  Oi . 

The  method  [lO]  relies  on  geometric  constraints, 
which  means  that  when  a  label  Ip  is  assigned  to  object 
Oj,  we  expect  to  find  an  object  Oj  with  assigned  label 
If  in  an  area  depending  on  i,p,g.  The  match-window 
W{i,p,q)  denotes  the  area  described  by  the  parameters 
i,  p,  q.  By  representing  the  object  Oj  with  a  vector  Ai  Bi , 
the  label  Ip  with  CpDp  and  label  1,  with  C^Df,  we  can 
determine  the  four  extreme  points,  Wi,  W2,  VV3,  Wf,  of 
the  induced  match-window  W(t,p,  g)  using  the  following 
relations:  (/i  denotes  the  scaling  factor  known  before¬ 
hand) 

•  AiWi  =  p  ■  CpCf,  WiTVj  =  p  •  Cfb, 

•  BiWs  =  p  •  DpCf,  W3W4  =  p  •  Cfbf 

Fig.  7  shows  the  relationship  between  the  window  and 
the  segments. 
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Figure  7;  Determining  a  match- window 
The  meaning  of  compatibility  is  defined  as  follows; 


<  i,p  >  is  compatible  with  <  j,q  >  iff  o,-  in 
W{j,q,p)  and  Oj  in  W{i,p,q), 

Let  Qff  [p,  9]  denote  the  compatibility  of  assigning  label 
Ip  to  object  t  and  label  1^  to  object  j.  A  weak  notion  of 
consistency  is  used  to  determine  whether  an  assignment 
is  feasible.  A  predetermined  confidence  factor,  8  <  m, 
is  used  to  decide  the  feasibility  of  <  t,p  >  as  in  the 
following  update  statement  during  an  iteration:^ 

For  every  i,p,  u,'p  <—  v,-p  AND ‘condi¬ 
tion  A’,  where  ‘condition  A’  is  true  if  (35  C 
{1,2, . .  .,m}  and  (|5(|  =  6,  such  that  for  ev- 
ery  9  6  5,  3j  €  (1, 2, . . . ,  n}  such  that  Uj,  =  1 
and  nfj[p,  9]  =  1)  and  is  false  otherwise. 

The  algorithm  stops  when  for  all  i,p,  v{p  =  tj,p.  We  can 
rewrite  the  above  update  statement  as 

6,m  n 

vjp  —  Vip  *  A  I  ♦  %tp. 9]  )  ].  (3) 

»=i  i=i 

where 


if  53”  j  Xf>6 

otherwise  . 


(4) 


Note  that  the  operation  Yl’jsi  equation  3  is  a  logi¬ 
cal  OR  operation,  while  the  operation  in  equa¬ 

tion  4  is  an  arithmetic  ADO  operation. 

With  the  modified  update  statement  given  by  equar 
tion  4,  we  have  designed  a  faster  sequential  algorithm 
which  is  easier  to  parallelize  compared  with  the  one  pro¬ 
posed  in  [10].  This  algorithm  is  an  extension  of  the  dis¬ 
crete  relaxation  algorithm  developed  in  [8]  and  it  takes 
O(n^m^)  time.  Each  time  unit  corresponds  to  a  simple 
arithmetic/logic  operation.  The  original  algorithm  [lO] 
runs  in  O(n*m^dio)  time,  where  d  is  the  density  of  the 
segments  and  w  is  the  window  size.  In  the  worst  case, 
d  and  w  can  be  n  and  m  respectively.  For  more  details 
refer  to  [5]. 


{  Initialization  } 

1.  Initialize  all  Qij\p,  9]’s  in  parallel; 

2.  parallel  do  (in  PE.p,  1  <  «  <  «,  1  <  p  <  m) 

3.  Vip  ♦“  1 ; 

4.  for  9=1  to  m  do 

5.  Nip[9]  ^  0; 

6.  T,p  -  0 

7.  for  j  =  1  to  n  do 

8.  if  (Ofjb-?]  =  1)  tben  jV.pf?]  ^  A.pf?]  +  1; 

9.  end; 

10.  if  (Nip[9]  5^  0)  then  T,p  *- r,p  +  1 ; 

11.  end; 

12.  if  (Tip  <  6)  then  do 

13.  Sendip  <—  1; 

14.  Vip  ^  0; 

15.  end; 

13.  parallel  end  ; 

{  Iteration  ) 

16.  repeat 

17.  parallel  do  (in  PE.p,  l<»<n,  l<p<m) 

18.  if  (Sendtp  =  1)  then 

19.  send  Id  <  i,p  >  to  all  the  PEs; 

20.  {  if  (no  broadcast  Id  acknowledged) 

21.  then  stop; 

22.  else  a  broadcast  Id,  say  <  j,g  >,  is 
acknowledged  by  all  PE’s;}* 

23.  if  {<  i,p  >  =  <  j,  q  >)  then  Send,p  —  0; 

24.  if  ((v.p  =  1)  AND  (flf,[p,  9)  =  1))  then  do 

25.  •Nip[9]  •—  ^ip[9]  ~  1; 

26.  if  (Mp[9]  =  0)  then  T,p  —  r,p  —  1; 

27.  if  (T,p  <  6)  then  do 

28.  Sendtp  •—  1; 

29.  v,p  •—  0; 

30.  end 

31.  end 

32.  end 

33.  parallel  end 

34.  forever 

*:  the  code  inside  braces  is  the  broadcast  operation. 


Figure  8:  Parallel  Algorithm  for  Image  Matching 


4.2  Parallel  Image  Matching 

In  this  section,  we  present  parallel  algorithms  for  the  im¬ 
age  matching  problem.  Since  the  size  of  a  match-window 
determined  by  any  object  and  two  labels  is  much  smaller 
than  the  size  of  the  complete  image,  the  number  of  ini¬ 
tially  assignable  segments  in  any  of  the  match-windows 
for  each  object  is  much  smaller  than  the  total  number 
of  objects  in  the  image.  This  allows  us  to  obtain  a  par¬ 
titioned  implementation  in  which  each  processor  is  re¬ 
sponsible  for  more  than  one  <object,  label >  pair. 

A  parallel  algorithm  is  shown  in  Fig.  8.  Each  v,-p 
is  associated  with  m  +  I  counter  variables.  These  are 
^»p(9l>0  <  9  <  7n  -  1,  and  Tip.  These  counter  variables 
have  the  following  definitions: 

*  ^•>(9]  denotes  the  number  of  I’s  in  the  n  entries  of 
fiylip. 9]i  0  <  i  <  n  -  1,  and 

•  Tip  denotes  the  number  of  nonzero  ^i>[9]’8,  0  < 
9  <•«»  -  1. 


For  Vip,  each  of  the  m  W»p[9]’s  are  used  to  determine 
an  object  for  label  1^,  i.e.,  if  we  can  find  any  object  to  be 
labelled  with  Iq  when  Oi  is  labelled  with  Ip.  Ttp  is  used  to 
determine  if  there  are  at  least  8  such  compatible  labelings 
when  Oi  is  labelled  with  Ip.  Each  infeasible  pair  <  i,p> 
(having  <  6)  is  broadcast  to  all  PEs.  Each  PE  checks 
if  it  has  any  pair  <  j,q  >  such  that  [p,  9]  =  1  and 
decrements  IVjp[9].  At  any  time,  if  IV, p [9]  becomes  zero. 
Tip  is  decremented.  As  a  result,  if  Tip  <  6,  pair  <  j,q  > 
is  marked  infeasible. 

4.2.1  Partitioned  Algorithms  on  Fixed  Size 
Arrays 

Based  on  the  algorithm  shown  in  Fig  8,  partitioned 
algorithms  on  fixed  size  mesh  and  linear  arrays  are  ob¬ 
tained.  These  algorithms  differ  with  respect  to  data 
routing  schemes. 

In  a  mesh  array  of  P  x  P  processors,  each  of  the  P^ 
PEs  process  ^  distinct  v,p  values  (^  values  in  case  of 

*  ||5||  denotes  the  cardinality  of  5 


1061 


linear  array).  The  data  stored  in  each  memory  module 
include  ^  Vip  values  in  linear  array),  corresponding 
nm  Qfj\p,q]  values,  m  MpM  counter  values  and  the  Tip 
variable.  Also,  a  flag  is  stored  in  each  MM  for  each  Vip  to 
indicate  whether  the  infeasibility  has  been  acknowledged 
by  all  the  PEs.  Such  an  acknowledgment  triggers  the 
necessary  update  of  the  corresponding  counters  in  each 
PE.  Also,  in  each  PE,  an  extra  flag  is  used  to  indicate 
if  at  least  one  such  infeasible  assignment  is  yet  to  be 
acknowledged. 

An  initialization  procedure  is  executed  in  each  PE 
to  initialize  the  m  counter  variables  for  each  of  its 
Vip  values.  Based  on  the  condition  defined  in  Section 
4.1,  each  PE  sets  the  corresponding  flag  for  an  infeasi¬ 
ble  assignment  and  retains  the  Id  of  one  such  assign¬ 
ment  for  later  broadcast.  This  can  be  performed  in 
0(75?)  time  on  a  P  X  P  mesh  (O(^)  time  on  the 
linear  array).  During  each  iteration,  a  ‘collect’  oper¬ 
ation  is  first  executed.  The  purpose  of  this  operation 
is  to  gather  the  Ids  retained  in  all  the  PEs  at  the 
end  of  the  previous  iteration.  The  ‘collect’  operation 
can  be  executed  in  0(P)  time  on  both  arrays.  De¬ 
tails  of  the  data  routing  algorithms  can  be  found  in  [5; 
6]. 

The  execution  time  of  the  mesh  algorithm  is  0((^  -1- 
P)nm).  This  implementation  leads  to  a  processor-time 
optimal  solution  when  P  <  (nm)^^^.  Additional  details 
can  be  found  in  [6]. 

The  total  execution  time  of  the  linear  array  algorithm 
is  0((^  -1-  P)nm).  This  implementation  leads  to  a 
processor-time  optimal  solution  when  P  <  y/nrfi.  Fur¬ 
ther  details  can  be  found  in  [S]. 

5  Conclusions 

Image  matching  and  stereo  matching  are  key  problems 
in  image  understanding.  In  this  paper  we  have  summer- 
ized  our  work  [5;  6]  in  parallelizing  a  well  known  match¬ 
ing  technique  developed  by  the  vision  group  at  USC. 
Several  other  sequential  approaches  to  these  problems 
have  been  proposed.  These  solutions  vary  mainly  with 
respect  to  the  primitives  used  for  matching.  Additional 
work  is  needed  to  consolidate  these  approaches  and  pro¬ 
vide  a  uniform  framework  for  parallel  stereo  and  image 
matching. 
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